biostats-part-2-visual-and-numerical-summaries

2025-04-09 Wednesday #biostatistics [[maps-of-content]] > [!success]- Concept Sketch: [[]] > ![[]] # Data Summarization in Public Health: Visual & Numerical Methods > [!abstract]- Quick Review > > **Core Essence**: Public health data requires both visual and numerical summaries to extract meaningful insights, as raw numbers alone are difficult to interpret. The appropriate methods depend on data type and distribution characteristics. > > **Key Concepts**: > > - Visual summaries: bar plots (categorical), histograms (continuous), scatter plots (relationships) > - Location measures: mean and median show typical values but are affected differently by skewed data > - Spread measures: standard deviation, IQR, and range quantify data variability and distribution > > **Must Remember**: > > - Skewed data affects mean more than median (mean is pulled toward the tail) > - Categorical data requires caution when applying numerical summaries > - Boxplots integrate location and spread statistics in one powerful visualization > > **Critical Relationships**: > > - Data type determines appropriate summary method (categorical vs. continuous) > - Location and spread measures work together to fully describe a distribution > - Visual and numerical summaries complement each other for comprehensive understanding ## Introduction to Data Summarization Public health professionals regularly work with large datasets that cannot be meaningfully interpreted in their raw form. Data summarization techniques provide organized ways to extract insights, identify patterns, and communicate findings effectively. This material explores two complementary approaches to data summarization: 1. **Visual summaries** - Graphical representations that allow patterns to be seen at a glance 2. **Numerical summaries** - Statistical measures that quantify key aspects of data distributions Together, these approaches provide a comprehensive toolkit for understanding and communicating public health information. > [!tip] Learning Approach > As you study this material, think about the relationships between different summary methods and when each is most appropriate. The goal isn't just to know formulas or chart types, but to develop intuition about which approach will reveal meaningful insights for different data scenarios. ## Visual Summaries: Seeing Data Patterns Visual summaries transform numbers into graphical representations that humans can process more intuitively. The appropriate visualization depends primarily on the **data type**. ### Bar Plots: Visualizing Categorical Data **Bar plots** display categorical variables (nominal or ordinal) with vertical or horizontal bars representing each category. **Key characteristics**: - Each bar corresponds to a specific category - Bar height represents frequency or count within that category - Works for both nominal data (like gender, race) and ordinal data (like disease severity) ### Histograms: Visualizing Continuous Data **Histograms** are specialized bar plots for continuous variables, where the range of values is divided into equal-width intervals (bins). **Key characteristics**: - Bins divide continuous data into intervals of equal width - Bar height shows frequency or percentage within each bin - Bars touch each other (no gaps) to represent continuity - Shape reveals distribution properties (normal, skewed, bimodal) **Practical application**: - Calculating cumulative probabilities by summing bar heights - Example: If 88% of people get less than the recommended 8 hours of sleep ### Scatter Plots: Visualizing Relationships **Scatter plots** show relationships between two continuous variables by plotting individual data points. **Key characteristics**: - Each point represents one observation with two values - Position shows relationship between variables - Pattern reveals association type (positive, negative, none) - Shape suggests strength of relationship **Example application**: In public health, scatter plots might show the relationship between gun ownership rates and firearm suicide rates across different regions. > [!example]- Visual Summary Decision Tree >```mermaid >flowchart TD > A["Data to Visualize"] --> B{"Data Type?"} > B -->|Categorical| C["Bar Plot"] > B -->|Continuous| D{"How many variables?"} > D -->|One| E["Histogram"] > D -->|Two| F["Scatter Plot"] > C --> G["Shows frequency by category"] > G --> H["Interpret patterns across categories"] > E --> I["Shows distribution shape"] > I --> J["Interpret central tendency and spread"] > F --> K["Shows relationships"] > K --> L["Interpret association direction and strength"] >``` ## Numerical Summaries of Location Numerical summaries of location identify the "middle" or typical value in a dataset. The two primary measures are the **mean** and the **median**. ### Sample Mean: The Mathematical Average The **sample mean** (X̄) represents the mathematical average of all values. **Formula**: X̄ = (sum of all values) ÷ (number of observations) **Key characteristics**: - Utilizes all data points in calculation - Sensitive to extreme values (outliers) - Appropriate for symmetric distributions - Has useful mathematical properties for further analyses > [!warning] Limitation > The mean can be heavily influenced by extreme values, potentially misrepresenting the "typical" value in skewed distributions. ### Sample Median: The Middle Value The **sample median** is the middle value in an ordered dataset (50th percentile). **How to find it**: 1. Arrange all values in ascending order 2. For odd number of observations: middle value 3. For even number: average of two middle values **Key characteristics**: - More robust against outliers than the mean - Represents the value that divides the dataset in half - Often more representative of "typical" in skewed distributions ### Understanding Skewness **Skewness** describes the asymmetry of a distribution and affects the relationship between mean and median. **Left (negative) skew**: - Tail extends to the left - Mean < Median - Mean pulled toward smaller values **Right (positive) skew**: - Tail extends to the right - Mean > Median - Mean pulled toward larger values > [!visual]- Sketch Idea > > "Mean chases the tail, median stays central" **Visual Representation**: Draw three distribution curves side by side: (1) Left-skewed with mean < median, (2) Symmetric with mean = median, (3) Right-skewed with mean > median. Use vertical lines for each measure, with arrows showing how the mean is "pulled" toward the tail. ### Limitations with Categorical Data **For ordinal categorical data**: - Can calculate mean/median by assigning numerical values - Results lack direct biological interpretation - Values are sensitive to arbitrary number assignment **For nominal categorical data**: - Mean and median are inappropriate - No meaningful ordering exists > [!tip] When to Use What > > - **Mean**: Best for symmetric distributions without extreme outliers > - **Median**: Better for skewed distributions or when outliers are present > - **Both together**: Provide insight into distribution shape by comparing values ## Numerical Summaries of Spread While location measures tell us about the center, spread measures tell us about variability or dispersion around that center. ### Sample Standard Deviation The **sample standard deviation (s)** quantifies the typical deviation from the mean. **Key characteristics**: - Unit matches the original data - Value of 0 indicates no variability - Larger values indicate greater spread - Sensitive to outliers - Square root of the variance ### Sample Variance The **sample variance (s²)** is the average of squared deviations from the mean. **Key characteristics**: - Square of the standard deviation - Units are squared (e.g., years²) - Often used in further statistical calculations ### Range Measures **Sample Range**: The difference between maximum and minimum values. - Simple to calculate - Highly sensitive to outliers - Only uses two values from the dataset **Interquartile Range (IQR)**: The difference between the 75th percentile (Q3) and 25th percentile (Q1). - Encompasses middle 50% of the data - More robust against outliers than the range - Used to identify potential outliers - Key component of boxplots ## Boxplots: Integrating Location and Spread **Boxplots** are visual summaries that elegantly combine multiple numerical measures. **Components of a boxplot**: - **Box**: Represents the IQR (Q1 to Q3) - **Line inside box**: Median (50th percentile) - **Whiskers**: Typically extend to 5th and 95th percentiles or 1.5×IQR - **Points beyond whiskers**: Outliers **Interpretive value**: - Position of median within box suggests skewness - Box length shows spread of middle 50% of data - Whisker length indicates tail behavior - Presence of outliers suggests unusual observations > [!example]- Case Study: Hospital Length of Stay > > A hospital administrator collected data on patient length of stay (in days) for a specific procedure: > > Raw9, 9, 10, 10, 10, 11, 12, 13, 14, 16, 19, 24, 49 > > **Numerical summaries**: > > - Mean: 14.1 days > - Median: 10 days > - Range: 7-49 days > - IQR: 7-14 days > - Standard deviation: 10.4 days > > **Boxplot interpretation**: > > - The median (10) is closer to Q1 than Q3 > - Mean > median suggests right-skewed distribution > - One potential outlier (49 days) affects the mean > - The administrator should investigate the causes of extended stays to improve hospital efficiency. ```mermaid flowchart TD A[Data Summary Methods] --> B[Visual Summaries] A --> C[Numerical Summaries] B --> D[Bar Plots] B --> E[Histograms] B --> F[Scatter Plots] C --> G[Location Measures] C --> H[Spread Measures] G --> I[Mean] G --> J[Median] H --> K[Standard Deviation] H --> L[Variance] H --> M[Range] H --> N[IQR] O[Boxplot] --> I O --> J O --> N O --> P[Outliers] Q[Data Type] --> R[Categorical] Q --> S[Continuous] R --> D S --> E S --> F T[Distribution Shape] --> U[Symmetric] T --> V[Skewed] U --> I V --> J ``` ## Summary and Application Effective data summarization requires choosing appropriate methods based on: 1. **Data type** (categorical vs. continuous) 2. **Distribution characteristics** (symmetric vs. skewed) 3. **Analysis objectives** (central tendency, spread, relationships) Visual and numerical summaries complement each other: - Visual summaries provide intuitive patterns - Numerical summaries provide precise quantification - Together they tell a complete story about the data In public health, these techniques enable professionals to: - Monitor health trends across populations - Identify at-risk groups - Evaluate intervention effectiveness - Communicate findings to stakeholders and the public > [!tip] Best Practice > Always use multiple summary methods to gain comprehensive understanding. A single measure rarely tells the complete story about your data. **Most important takeaway**: The appropriate summary method depends on your data type and distribution characteristics. Using inappropriate methods can lead to misleading conclusions in public health, where decisions affect human lives. -- Reference: - Biostats