biostats-part-6-hypothesis-testing-for-continuous-outcomes

2025-04-09 Wednesday #biostatistics [[maps-of-content]] # Hypothesis Testing for Continuous Outcomes > [!success]- Concept Sketch: [[]] > ![[]] > [!abstract]- Quick Review > > **Core Essence**: Statistical methods to determine if differences in continuous outcomes between groups are meaningful or likely due to chance, focusing on t-tests and ANOVA with appropriate corrections for multiple comparisons. > > **Key Concepts**: > > - Two-sample t-test compares means between two groups using the signal-to-noise ratio > - ANOVA extends comparison to three or more groups using between vs. within group variation > - Bonferroni correction controls error rate when making multiple comparisons > > **Must Remember**: > > - Small p-values (<0.05) provide evidence to reject the null hypothesis > - Statistical significance shows association, not causation > - Multiple comparisons increase the risk of false positives > > **Critical Relationships**: > > - Test statistic = signal (difference between means) ÷ noise (standard error) > - Number of pairwise comparisons = K(K-1)/2 where K is number of groups > - Adjusted p-value threshold = 0.05 ÷ number of comparisons ## Introduction to Hypothesis Testing Hypothesis testing provides a framework for making decisions about populations based on sample data. For continuous outcomes (like birth weight, blood lead levels, or test scores), we use specific statistical methods to determine if observed differences between groups are likely due to chance or represent meaningful differences. This guide covers three key statistical approaches: 1. Two-sample t-tests for comparing two groups 2. Analysis of Variance (ANOVA) for comparing multiple groups 3. Bonferroni correction to address multiple comparison problems > [!tip] Continuous Outcomes > Continuous outcomes are measurements that can take any value within a range (e.g., weight, height, blood pressure). These differ from categorical outcomes which fall into distinct categories (e.g., yes/no, disease/no disease). ## Two-Sample T-test: Comparing Two Groups ### Purpose and Application The two-sample t-test determines if there's a statistically significant difference between the means of two independent populations. > [!visual] Sketch Idea > > **Core Concept**: Two-Sample T-test **Full Description**: Statistical test comparing means of two populations by measuring the difference between sample means relative to their variability (standard error) **Memorable Description**: "Signal-to-noise ratio for two groups" **Visual Representation**: Draw two bell curves side-by-side, one for each group, with means marked. Show the difference between means as "signal" (with a vertical arrow) and the overlap/spread as "noise" (with a horizontal double arrow). Include the formula t = (x̄₁-x̄₂)/SE as a ratio. ### Hypothesis Framework **Null Hypothesis (H₀)**: The means of the two populations are equal. - Mathematically: μ₁ = μ₂, or μ₁ - μ₂ = 0 - Interpretation: No association between the exposure and the outcome **Alternative Hypothesis (H₁)**: The means of the two populations are different. - Mathematically: μ₁ ≠ μ₂, or μ₁ - μ₂ ≠ 0 - Interpretation: An association exists between the exposure and the outcome ### The Test Statistic: T-statistic The t-statistic measures how many standard errors the difference in sample means is from zero: t = (x̄₁ - x̄₂) / SE of the difference Where: - x̄₁ and x̄₂ are the sample means - SE is the standard error of the difference between means > [!note] Signal vs. Noise > The t-statistic represents a ratio of "signal" (difference between means) to "noise" (standard error). Larger t-values indicate stronger evidence against the null hypothesis. ### Making Decisions with P-values The p-value represents the probability of observing a t-statistic as extreme as the one calculated, assuming the null hypothesis is true. Decision rule: - If p-value < 0.05: Reject the null hypothesis (evidence supports a difference) - If p-value ≥ 0.05: Fail to reject the null hypothesis (insufficient evidence for a difference) > [!warning] Association ≠ Causation > A significant t-test result only indicates an association between the exposure and outcome, not a causal relationship. Other study designs and criteria are needed to establish causation. ### Confidence Intervals A 95% confidence interval for the difference in means provides a range of plausible values for the true difference. Key insight: If the 95% confidence interval does not include zero, then p < 0.05 and we reject the null hypothesis. > [!example]- Case Application: Birth Weight Study > > Researchers compared the average birth weight of babies born to mothers living in poverty versus those not in poverty. > > **Data:** > > - Group 1 (poverty): n₁ = 85, x̄₁ = 3125g, s₁ = 450g > - Group 2 (non-poverty): n₂ = 92, x̄₂ = 3310g, s₂ = 425g > > **Test statistic calculation:** t = (3125 - 3310) / SE = -185 / 65.8 = -2.81 > > **P-value = 0.006** > > **95% CI for difference:** (-315g, -55g) > > **Conclusion:** Since p < 0.05 and the CI doesn't include zero, we reject the null hypothesis. There is a statistically significant difference in average birth weights between the two groups. Babies born to mothers in poverty had lower birth weights by approximately 185g on average. ## Analysis of Variance (ANOVA): Comparing Multiple Groups ### Purpose and Application ANOVA extends hypothesis testing to compare means across three or more independent groups simultaneously. ### Hypothesis Framework **Null Hypothesis (H₀)**: All population means are equal. - Mathematically: μ₁ = μ₂ = μ₃ = ... = μₖ - Interpretation: Group membership has no association with the outcome **Alternative Hypothesis (H₁)**: At least two population means differ. - Interpretation: The outcome differs between at least two groups > [!note] ANOVA Limitation > ANOVA only tells you if there are any differences between groups. It doesn't tell you which specific groups differ from each other. Follow-up tests are needed for that. ### The Test Statistic: F-statistic The F-statistic measures the ratio of between-group variation to within-group variation: F = MSB / MSE Where: - MSB (Mean Square Between) = between-group variation = "signal" - MSE (Mean Squared Error) = within-group variation = "noise" ### Components of ANOVA 1. **Grand Mean**: The average of all observations across all groups 2. **Mean Square Between (MSB)**: Measures how far each group's mean is from the grand mean, weighted by sample sizes 3. **Mean Squared Error (MSE)**: Measures how far individual observations are from their respective group means 4. **F-statistic**: The ratio MSB/MSE follows an F-distribution under the null hypothesis ### Making Decisions Similar to the t-test, we use the p-value to make decisions: - If p-value < 0.05: Reject the null hypothesis (at least two groups differ) - If p-value ≥ 0.05: Fail to reject the null hypothesis (insufficient evidence for differences) ### ANOVA Table Results are typically presented in an ANOVA table: |Source|Sum of Squares|df|Mean Square|F|p-value| |---|---|---|---|---|---| |Between Groups|SSB|K-1|MSB = SSB/(K-1)|F = MSB/MSE|p| |Within Groups (Error)|SSE|N-K|MSE = SSE/(N-K)||| |Total|SST|N-1|||| Where: - K = number of groups - N = total sample size across all groups > [!example]- Case Application: Blood Lead Levels by Region > > Researchers compared average blood lead levels in children from three different regions. > > **Data:** > > - Region A: n₁ = 32, x̄₁ = 5.2 μg/dL > - Region B: n₂ = 28, x̄₂ = 3.1 μg/dL > - Region C: n₃ = 30, x̄₃ = 4.8 μg/dL > > **ANOVA Results:** > > Download > > |Source|Sum of Squares|df|Mean Square|F|p-value| > |---|---|---|---|---|---| > |Between Regions|86.4|2|43.2|8.64|0.0004| > |Within Regions|435.0|87|5.0||| > |Total|521.4|89|||| > > **Conclusion:** With p = 0.0004 < 0.05, we reject the null hypothesis. There is strong evidence that the mean blood lead levels differ among at least two of the three regions. However, ANOVA doesn't tell us which specific regions differ from each other. ## Multiple Comparisons and Bonferroni Correction ### The Multiple Comparisons Problem When comparing multiple groups, conducting many pairwise comparisons increases the risk of Type I errors (false positives). ### Type I Error in Context **Type I Error**: Rejecting a true null hypothesis (finding a "significant" difference when there isn't one) For a single test with α = 0.05: - 5% chance of Type I error For multiple independent tests, the overall Type I error rate: - P(at least one Type I error) = 1 - (1 - α)ᵐ - Where m = number of tests ### Number of Pairwise Comparisons For K groups, the number of possible pairwise comparisons: m = K(K-1)/2 Examples: - 3 groups → 3 comparisons - 4 groups → 6 comparisons - 5 groups → 10 comparisons ### Bonferroni Correction The Bonferroni correction adjusts for multiple comparisons by making each individual test more stringent, ensuring the overall Type I error rate remains at the desired level (typically 0.05). Two equivalent approaches: 1. **Adjusted significance threshold**: - New threshold = α/m - Example: For 3 comparisons, 0.05/3 = 0.0167 2. **Adjusted p-values**: - Adjusted p-value = original p-value × m - Compare adjusted p-values to original α (0.05) > [!note] Conservative Approach > Bonferroni correction is considered conservative, meaning it reduces Type I errors but increases Type II errors (failing to detect true differences). Other methods like Tukey's HSD or Scheffé's method may be more powerful in certain situations. > [!example]- Case Application: Post-hoc Analysis of Blood Lead Levels > > After finding a significant ANOVA result (p = 0.0004) for blood lead levels across three regions, researchers conducted pairwise t-tests. > > **Original pairwise t-test results:** > > - Region A vs B: p = 0.0008 > - Region A vs C: p = 0.42 > - Region B vs C: p = 0.003 > > **Bonferroni correction:** Number of comparisons = 3 Adjusted significance threshold = 0.05/3 = 0.0167 > > **Decision based on adjusted threshold:** > > - Region A vs B: p = 0.0008 < 0.0167 → Significant difference > - Region A vs C: p = 0.42 > 0.0167 → No significant difference > - Region B vs C: p = 0.003 < 0.0167 → Significant difference > > **Conclusion after correction:** Children in Region B have significantly lower blood lead levels than those in both Region A and Region C. There is no significant difference between Regions A and C. ## Practical Applications of Hypothesis Testing ### When to Use Each Test - **Two-Sample T-test**: Use when comparing exactly two groups (e.g., treatment vs. control, exposed vs. unexposed) - **ANOVA**: Use when comparing three or more groups (e.g., multiple treatment arms, geographic regions, age categories) - **Bonferroni Correction**: Apply when making multiple comparisons after a significant ANOVA result ### Interpreting Results in Context Statistical significance should always be interpreted alongside: 1. **Effect size**: How large is the difference in practical terms? 2. **Clinical or practical relevance**: Does the difference matter in the real world? 3. **Study design**: Was the study designed to detect causation or just association? 4. **Sample size**: Larger studies can detect smaller differences as "significant" > [!tip] Reporting Results > When reporting results, include both p-values AND confidence intervals. P-values tell you if there's an effect, while confidence intervals tell you about the size and precision of that effect. > [!visual]- Sketch Idea > > **Core Concept**: Hypothesis Testing Decision Flow decision tree flowchart starting with "How many groups?" → Two groups leads to t-test; Three or more leads to ANOVA → If ANOVA significant, branch to "Multiple comparisons needed?" → Yes leads to Bonferroni correction. Include small visual cues for each test (bell curves for t-test, multiple distributions for ANOVA). ## Common Misconceptions and Pitfalls ### Statistical Significance ≠ Importance A small p-value only indicates that the observed difference is unlikely to occur by chance alone. It doesn't tell you if the difference matters in a practical sense. ### P-hacking and Multiple Testing Running many tests until finding a "significant" result artificially increases the chance of false positives. This is why corrections like Bonferroni are crucial. ### Assumptions Matter Both t-tests and ANOVA have assumptions: - Approximately normal distribution of the outcome (or large enough sample size) - Homogeneity of variances (similar spread in each group) - Independent observations > [!warning] Violations of Assumptions When assumptions are violated, consider: > > - Transforming the data (e.g., log transformation) > - Using non-parametric alternatives (e.g., Mann-Whitney U, Kruskal-Wallis) > - Consulting with a statistician for complex situations ## Advanced Considerations ### One-sided vs. Two-sided Tests The standard approach is a two-sided test (H₁: μ₁ ≠ μ₂), but sometimes one-sided tests are appropriate (H₁: μ₁ > μ₂ or H₁: μ₁ < μ₂) when you have a directional hypothesis. ### Other Multiple Comparison Procedures Besides Bonferroni, other methods include: - Tukey's Honestly Significant Difference (HSD) - Scheffé's method - Holm's step-down procedure - False Discovery Rate (FDR) control Each has different strengths depending on your specific needs and the number of comparisons. > [!note] Alternative to ANOVA > When comparing multiple groups to a single control group, consider Dunnett's test instead of conducting all pairwise comparisons with Bonferroni correction. ## Summary and Integration Hypothesis testing for continuous outcomes provides a statistical framework for comparing group means and determining if differences are statistically significant. **Key relationships between concepts:** 1. The two-sample t-test is a special case of ANOVA limited to exactly two groups 2. Both methods compare "signal" (differences between means) to "noise" (variability within data) 3. ANOVA requires follow-up tests to determine which specific groups differ 4. Multiple comparisons require correction methods like Bonferroni to control error rates > [!tip] Statistical Significance vs. Practical Importance > A statistically significant result doesn't automatically mean the difference is large enough to be practically important. Always consider the size of the effect in the context of the field. ## Most Important Takeaway **The essence of hypothesis testing for continuous outcomes is quantifying the signal-to-noise ratio.** Whether using t-tests or ANOVA, we're fundamentally asking: "Is the difference between groups (signal) large enough compared to the variability within groups (noise) to suggest a real effect rather than random chance?" This balance between signal and noise is the foundation for all statistical inference in this context. -- Reference: