biostats-part-4-normal-distribution-central-limit-theorem-confidence-intervals

2025-04-09 Wednesday #biostatistics [[maps-of-content]] > [!success]- Concept Sketch: [[]] > ![[]] # Statistical Foundations: Probability Models to Confidence Intervals > [!abstract]- Quick Review > > **Core Essence**: Statistical inference uses probability models to make conclusions about populations based on sample data, with the normal distribution and Central Limit Theorem providing the mathematical foundation for constructing confidence intervals. > > **Key Concepts**: > > - Probability models bridge sample data to population inferences > - The Central Limit Theorem allows us to make valid statistical conclusions regardless of population distribution > - Confidence intervals quantify uncertainty in our population estimates > > **Must Remember**: > > - Population parameters (μ, σ) are estimated by sample statistics (x̄, s) > - For large enough samples, the sampling distribution of x̄ is approximately normal > - A 95% confidence interval means 95% of similarly constructed intervals would contain the true parameter > > **Critical Relationships**: > > - Larger sample size → narrower confidence intervals > - Higher confidence level → wider confidence intervals > - Standard error = σ/√n (decreases as sample size increases) ## Introduction to Statistical Inference Statistical inference is the process of drawing conclusions about populations based on sample data. Since we typically can't measure entire populations, we need mathematical frameworks to make reasonable inferences from limited samples. The concepts covered here—probability models, normal distribution, Central Limit Theorem, and confidence intervals—form the backbone of this inferential process. ## Probability Models: The Foundation of Inference **Probability models describe the likelihood of events occurring in a population**. They are essential because they provide the mathematical framework that allows us to make inferences about unknown population parameters. ### Why Probability Models Matter Population parameters (like the mean μ and standard deviation σ) are typically unknown. We use sample statistics (like the sample mean x̄ and sample standard deviation s) to estimate these parameters, but we need a framework to understand: 1. How reliable these estimates are 2. How much they might vary across different samples 3. How to quantify our uncertainty in these estimates > [!note] Population vs. Sample > > - **Population parameters**: μ (mean), σ (standard deviation) > - **Sample statistics**: x̄ (sample mean), s (sample standard deviation) > > While a sample gives us specific values, probability models help us understand what these values tell us about the entire population. ### Limitations of Basic Summary Statistics The mean and standard deviation alone don't fully describe a distribution. Different distributions can have identical means and standard deviations but very different shapes. > [!warning] Important to Remember > The _shape_ of a distribution matters when making probability statements. Two distributions with the same mean and standard deviation can have different probabilities for specific values if their shapes differ. ### Probability Models as Approximations An important concept to grasp is that our chosen probability model is an approximation of the unknown true model in the population. We select models with biological or practical plausibility, acknowledging they're imperfect but useful tools. ## The Normal Distribution: A Fundamental Probability Model The **normal distribution** (also called Gaussian or bell-shaped distribution) is the most commonly used probability model in public health and many other fields. ### Key Properties of the Normal Distribution 1. It's perfectly **symmetric** around the mean 2. The mean equals the median 3. It's fully described by just two parameters: μ (mean) and σ (standard deviation) 4. It's denoted as N(μ, σ) ### The 68-95-100 Heuristic This rule describes the percentage of values that fall within certain distances from the mean: - Approximately **68%** of values lie within 1 standard deviation of the mean (μ ± 1σ) - Approximately **95%** of values lie within 2 standard deviations (μ ± 2σ) - Approximately **99.7%** (nearly 100%) of values lie within 3 standard deviations (μ ± 3σ) ## The Central Limit Theorem: A Statistical Breakthrough The **Central Limit Theorem (CLT)** is one of the most powerful concepts in statistics. It explains why the normal distribution is so important even when our data aren't normally distributed. ### What the Central Limit Theorem States **Regardless of the shape of the original population distribution, the sampling distribution of the sample mean will approximately follow a normal distribution if the sample size is sufficiently large.** This has several profound implications: 1. The mean of this sampling distribution equals the population mean (μ) 2. The standard deviation of this sampling distribution (called the standard error) equals σ/√n 3. As sample size (n) increases, the standard error decreases 4. The approximation improves as sample size increases > [!note] Sampling Distribution vs. Population Distribution > The _population distribution_ describes how individual values are distributed in the population. > > The _sampling distribution of the sample mean_ describes how sample means (x̄) would be distributed if we took many samples of the same size from the population. > > The CLT tells us the latter will be approximately normal even if the former is not! ### Why the Central Limit Theorem Matters The CLT allows us to make valid statistical inferences without knowing the exact shape of the population distribution. This is incredibly powerful because: 1. It lets us use normal distribution mathematics for inference 2. It works for any population distribution given a large enough sample 3. It forms the basis for constructing confidence intervals and hypothesis tests > [!tip] How Large is "Large Enough"? > General rule: n ≥ 30 is usually sufficient for the CLT to apply. > > For highly skewed distributions, you may need larger samples (n ≥ 50 or more). > > If the population is already normally distributed, the CLT applies for any sample size. ## Confidence Intervals: Quantifying Uncertainty Confidence intervals provide a range of plausible values for a population parameter based on sample data, along with a measure of the uncertainty in our estimate. ### Computing a Confidence Interval for the Mean For a 95% confidence interval, the formula is: x̄ ± 2 × (s/√n) Where: - x̄ is the sample mean - s is the sample standard deviation - n is the sample size - s/√n is the standard error of the mean > [!note] Where does the "2" come from? > The multiplier "2" is an approximation for the 95% confidence level. The exact value from the normal distribution is 1.96, but 2 is commonly used for simplicity. > > For other confidence levels: > > - 90% confidence interval: x̄ ± 1.65 × (s/√n) > - 99% confidence interval: x̄ ± 2.58 × (s/√n) ### Correct Interpretation of Confidence Intervals > [!warning] Common Misinterpretation > It is **incorrect** to say: "There is a 95% probability that the population mean is within this confidence interval." > > Once calculated, a specific confidence interval either contains the true parameter or it doesn't—there's no probability involved. **Correct interpretation**: If we were to draw many samples from the same population and construct a 95% confidence interval from each sample, approximately 95% of these intervals would contain the true population mean. ### Factors Affecting Confidence Interval Width 1. **Sample size (n)**: Larger sample → Narrower interval - Doubling the sample size reduces the width by a factor of √2 (about 1.41) 2. **Confidence level**: Higher confidence → Wider interval - 99% confidence intervals are wider than 95% intervals 3. **Sample variability (s)**: Greater variability → Wider interval - More diverse/scattered data points lead to wider intervals ### Using Confidence Intervals for Inference Confidence intervals can be used to: 1. Estimate population parameters with a measure of precision 2. Test hypotheses about specific values 3. Compare groups to see if they likely differ in the population > [!example]- Case Application: Birth Weights > > A study examines birth weights of babies born to uninsured mothers. From a sample of 100 babies, researchers find: > > - Sample mean (x̄): 2,350 grams > - Sample standard deviation (s): 400 grams > > **Questions**: > > 1. Calculate a 95% confidence interval for the population mean birth weight. > 2. Is a mean birth weight of 2,500 grams (considered healthy) plausible for this population? > > **Solution**: > > 3. 95% CI = 2,350 ± 2 × (400/√100) = 2,350 ± 2 × 40 = 2,350 ± 80 = [2,270, 2,430] grams > > 2. Since 2,500 grams lies outside our confidence interval, the data do not support that these babies have a healthy mean birth weight of 2,500 grams. The evidence suggests their average weight is lower than the healthy standard. > ## Connecting the Concepts: From Probability Models to Inference Let's see how these concepts work together in the statistical inference process: ```mermaid flowchart TD A[Population with unknown parameters] --> B[Sample data collection] B --> C[Calculate sample statistics] C --> D[Apply probability model] D --> E[Use Central Limit Theorem] E --> F[Construct confidence interval] F --> G[Make inferences about population] ``` The journey from data to conclusion relies on: 1. **Probability models** providing the mathematical framework 2. **Normal distribution** offering well-understood properties 3. **Central Limit Theorem** justifying the use of normal-based methods 4. **Confidence intervals** quantifying our uncertainty ## Summary: The Power of Statistical Inference Statistical inference allows us to reach beyond our limited data to make reasoned conclusions about populations. The concepts we've covered form the foundation of this process: - **Probability models** help us bridge the gap between sample and population - The **normal distribution** provides a well-understood mathematical framework - The **Central Limit Theorem** gives us powerful tools for inference regardless of the original population distribution - **Confidence intervals** allow us to estimate population parameters while acknowledging uncertainty > [!important] The Single Most Important Takeaway > **The Central Limit Theorem is the bridge that allows us to make valid statistical inferences about populations regardless of their distribution, providing we have a sufficiently large sample size.** This principle underpins virtually all parametric statistical methods. By understanding these concepts, you can properly interpret statistical results, design better studies, and draw more valid conclusions from data—essential skills in public health, scientific research, and many other fields. -- Reference: - Biostats