Fundamental Concepts
Introduction
1. Describe the purpose of inferential statistics.
2. Explain the key difference between descriptive statistics and inferential statistics.
3. Explain random sampling.
4. Why is it important to apply random sampling when collecting the samples from the population?
5. Explain sampling error in inference statistics
6. Explain the terms: parameter and statistics. Provide an example for each of the terms.
- Parameter is a characteristic of the target population(e.g. population mean)
- Statistics is a measure that describes the features of a sample (e.g. sample mean) and it is used to estimate the population parameter.
7. Explain the two types of estimates (point and interval estimates) and provide examples.
- A point estimate is a single value that is used to estimate an unknown parameter of a population. For example, the sample mean is the point estimate of the population mean.
- An interval estimate is a range of values that will most likely have the true value of a population parameter given a certain level of confidence. For example, the confidence interval is an interval estimate.
8. What does it mean to have an unbiased estimator of the population mean?
Sampling Distribution
1. Explain sampling distribution and the use-case.
- Definition: A sampling distribution is a probability distribution that describes the likelihood of different values a statistic can take based on different samples drawn from the same population.
- Use-case: Provides insights of how a statistic behaves, which helps to make estimates and inferences about a larger population of interest.
2. Explain sample means and provide the formula to calculate it.
Definition: Sample means are the average of the multiple samples drawn from the population.
Formula: $\bar{X} = \frac{\sum_{i=1}^nX_i}{n}$
- $\bar{X}$: The sample mean
- $X_i$: Individual samples
- $n$: The sample size.
3. Can the sample means value vary?
4. Explain sampling distribution of the sample means.
5. Is the sample mean an unbiased estimator of the population mean and what does this observation imply?
6. Under what circumstances that the sampling distribution resembles the population distribution regardless of the sample size?
7. Why do we use the sampling distribution instead of the population distribution?
- Cost & Time constraints: Collecting data from an entire population can be time-consuming, expensive and extremely difficult. Sampling provides a more efficient and cost-effective way to make inferences about the population.
- Non-Normal: Many occasions, the population distribution is not normal, making it difficult to perform inferences.
8. Describe the concept of sample variance $(s^2)$, including the formula.
Definition: A measure of the spread or dispersion of a set of sample data.
Formula: $s^2= \frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}$
- $s^2:$The sample variance.
- $n$: The sample size.
- $\bar{x}$: The sample mean.
- $x_i$: Represents each individual value in the sample.
- $\sum$ denotes the sum of the squared differences between each value and the mean.
9. Describe the concept of sample standard deviation $(s)$, including the formula.
Definition: A measure of how spread out the values in a dataset are from the mean, expressed in the same units as the original statistic.
Formula: $s = \sqrt{s^2}$
- $s$: The sample standard deviation
- $s^2$: The sample variance
10. In the formulas for $s^2$ and $s$, the denominator is $n-1$. Explain the intuition for the adjustment from subtracting 1 from the sample size $(n)$.
- The purpose is to provide an unbiased estimate of the population variance.
- For instance, the sample variance uses the sample mean to calculate the squared deviations. However, the sample mean is one possible point for the true population mean. If the true population mean is further away from the sample observations, then the deviation will be larger than the sample mean. Therefore, the sample mean almost always underestimates the desired deviation from the population mean. Using $n-1$ as the denominator / divisor helps to correct that mistake by making the sample variance bigger.
Standard of Error
1. Describe the concept of standard error $(SE)$, including the formula.
Definition: The sample standard error measures the variability of the sample mean as an estimate of the population mean. It represents the standard deviation of the sampling distribution of the sample mean.
Formula: $SE = \frac{s}{\sqrt{n}}$
- $SE$: Standard error.
- $s$: The sample standard deviation.
- $n$: The sample size.
2. What happens to the standard error when the sample standard deviation increases?
Formula: $SE = \frac{s}{\sqrt{n}}$
The sample standard error increases when the sample standard deviation increases.
3. What happens to the standard error when the sample size increases?
Formula: $SE = \frac{s}{\sqrt{n}}$
The standard error decreases when the sample size increases.
4. How does the law of large number (LLN) applies in terms of reducing the sample standard error?
5. Explain the difference between the Standard Error $(SE)$ Vs Standard Deviation $(S)$
- The standard deviation describes variability within a single sample.
- The standard error estimates the variability across multiple samples of a population.
6. How should we report the standard error?
Degree of Freedom
1. Explain degrees of freedom in the context population variance and sample variance.
2. Explain why the sample variance $(s^2)$ calculation uses $n-1$ degrees of freedom.
Formula: $s^2= \frac{\sum_{i=1}^n(x_i-\bar{x})^2}{n-1}$
The sample variance has one less degree of freedom $(n-1)$ because given the sample mean, one observation must be restricted (pre-determined) to ensure the sample mean retain its value. To further elaborate, one observation must be fixed such that the deviations must sum to 0 in order for the sample mean to be the specified number.
3. Provide the formula to calculate degrees of freedom.
$DF = n-p$
where:
- $DF:$ Degrees of Freedom.
- $n:$ sample size $n$.
- $p:$ The number of parameters to estimate.
Central Limit Theorem and Law of Large NUmbers
1. Explain Central Limit Theorem (CLT).
2. Why is CLT theorem crucial for statistical inference?
3. The commonly used rule is that CLT starts to apply when the sample size is around 30 or greater. Provide the intuition behind the sentence.
A population consist of observations that can take on wide range of values. When the sample size $(n)$ is small, an extreme value will have a significant effect on a particular sample mean value.
When $n$ is very low, the sample mean values can yield significantly different results because the observations can be on a wide range of values. However, as the sample size increases, the effect of a single extreme value becomes smaller because it is averaged with more values and the sample mean will converge to an approximate value. As a result, this reduces standard error and makes the distribution more normal.
4. Explain Law of Large Numbers (LLN).
5. Explain the key difference and similarities between LLN and CLT.
- LLN and CLT is similar in the aspect that both approximately tell us the behaviour of the sample mean as the sample size increases.
- The key difference is that the CLT gives us the approximate shape of the sampling distribution of the sample means, which is normally distributed. Whereas, the LLN talks about the approximate value of sample mean which becomes closer and closer to the population mean as the sample size becomes large.