univariate-descriptive-statistics-r

2025-05-29 #data-visualization #data-management #r-programming #statistical-analysis #hmp669 [[maps-of-content|🌐 Maps of Content — All Notes]] Series: [[hmp669|HMP 669 — Data Management and Visualization]] Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]] # Univariate Descriptive Statistics in R > [!abstract]- Overview > > Descriptive statistics reveal the story hidden in your data---who your participants are, what patterns exist, and how your sample compares to the broader world. > > **Key Concepts**: > > - Variable type determines statistical approach (numeric vs categorical) > - R functions mirror intuitive statistical concepts > - Reproducible tables eliminate errors and save time > > **Critical Connections**: Understanding your data's shape → Choosing appropriate statistics → Creating professional, reproducible output → Better science communication > > **Must Remember**: Different variables need different measures; R makes this intuitive; automation vs manual table creation > [!info]- Required R Packages > > |Package|Purpose|Installation| > |---|---|---| > |`dplyr`|Data manipulation (part of tidyverse)|`install.packages("tidyverse")`| > |`gtsummary`|Professional summary tables|`install.packages("gtsummary")`| > > **Setup code:** > ```r > library(dplyr) # or library(tidyverse) > library(gtsummary) > ``` > > **Note**: Base R functions (mean, median, sd, table, etc.) require no additional packages. > [!code]- Syntax Quick Reference > > > |Command/Syntax|Purpose|Example| > |---|---|---| > |**Numeric Variables**||| > |`mean(x)`|Calculate average|`mean(age)`| > |`median(x)`|Calculate middle value|`median(income)`| > |`sd(x)`|Standard deviation|`sd(test_scores)`| > |`quantile(x)`|Percentiles (default: 0,25,50,75,100)|`quantile(blood_pressure)`| > |`IQR(x)`|Interquartile range|`IQR(height)`| > |`summary(x)`|Multiple stats at once|`summary(weight)`| > |`summarize(...)`|Custom multi-stat table|`summarize(min=min(x), max=max(x))`| > |**Categorical Variables**||| > |`table(x)`|Frequency table|`table(education)`| > |`count(data, x)`|Tidy counting|`count(df, gender)`| > |`prop.table(table(x))`|Proportions|`prop.table(table(status))`| > |**Combined Workflow**||| > |`count() %>% mutate()`|Count + percentage|`count(education) %>% mutate(pct = n/sum(n)*100)`| > |**Professional Tables**||| |`tbl_summary()`|Automatic summary table|`select(vars) %>% tbl_summary()`| |`tbl_summary(..., label=...)`|Custom labels|`tbl_summary(label=list(age~"Age in Years"))`| |`tbl_summary(..., statistic=...)`|Custom statistics|`tbl_summary(statistic=list(age~"{mean} ({sd})"))`| --- ## Why Descriptive Statistics Matter: The Data Detective's First Step Before you can test hypotheses or build models, you need to **meet your data**. Think of descriptive statistics as your data's introduction---revealing age distributions, disease prevalence, and risk factor levels that determine how generalizable and unique your sample truly is. > [!note] The Fundamental Question > Every dataset holds people with stories. Descriptive statistics help you understand: _Who are these people, and what do their characteristics tell us about the bigger picture?_ --- ## The Two Worlds of Variables: Numeric vs Categorical ### **Numeric Variables: The Shape-shifters** Numeric variables contain only numbers, but they can take dramatically different forms: - **Normal distribution**: The classic bell curve---symmetric and predictable - **Bimodal distribution**: Two peaks, like seeing both night owls and early birds in sleep data - **Right-skewed distribution**: A long tail stretching right, common in income or medical cost data > [!tip] Imagine numeric data as landscapes: some are gentle hills (normal), others have twin peaks (bimodal), and some have steep cliffs with long valleys (skewed). The shape tells you which statistical tools work best. **What We Measure in Numeric Variables:** |Concept|What It Tells Us|R Function| |---|---|---| |**Central Tendency**|Where's the "middle"?|`mean()`, `median()`| |**Spread**|How scattered are the values?|`sd()`, `IQR()`| |**Range**|What are the extremes?|`min()`, `max()`| |**Quantiles**|How is data distributed across percentiles?|`quantile()`| ### **Categorical Variables: The Group Storytellers** Character or factor variables organize people into meaningful groups---education levels, disease status, treatment arms. Here, we're interested in **who belongs where** and **how groups compare in size**. **What We Count:** - **N (Count)**: How many people in each group? - **Percentage**: What proportion of the total does each group represent? > [!insight] Why Percentages Matter > Raw counts can mislead. Saying "50 people have the condition" feels different than "2% have the condition"---percentages reveal relative importance and help you understand group dynamics. --- ## R Functions: Your Statistical Toolkit R's descriptive functions read like natural language, making complex calculations intuitive: ### **For Numeric Variables** ```r # Single statistics mean(age) # Average age median(age) # Middle value when sorted sd(age) # Standard deviation (spread) quantile(age) # Default: 0%, 25%, 50%, 75%, 100% IQR(age) # 75th percentile - 25th percentile summary(age) # Multiple stats at once # Position finders first(age) # Which observation has minimum value last(age) # Which observation has maximum value ``` ### **The Power of `summarize()`: Multiple Stats, One Command** ```r enhanced_data %>% summarize( minimum = min(age), mean = mean(age), maximum = max(age) ) ``` This creates a clean data frame with one row and three columns---perfect for organized output. ### **For Categorical Variables** ```r # Counting approaches (choose what feels intuitive) table(education) # Simple frequency table count(enhanced_data, education) # Tidy counting n() # Within group operations # Adding percentages enhanced_data %>% count(education) %>% mutate(percent = n / sum(n) * 100) ``` > [!warning] Remember to multiply by 100 for percentages! `n / sum(n)` gives proportions (0.25), while `n / sum(n) * 100` gives percentages (25%). --- ## Professional Tables: From Manual Labor to Automated Excellence ### **The Problem with Manual Tables** Traditional approach: Calculate statistics → Copy numbers → Paste into Word → Format → Hope for no errors **Issues**: Time-consuming, error-prone, not reproducible ### **The `tbl_summary()` Solution** Think of `tbl_summary()` as your automatic table generator---it analyzes variable types and applies appropriate statistics without manual specification. ```r # Basic usage enhanced_data %>% select(age, education, income) %>% tbl_summary() ``` **What happens automatically:** - **Numeric variables**: Displays median with 25th and 75th percentiles (quantiles) - **Categorical variables**: Shows count and percentage for each level - **Professional formatting**: Ready for publication ### **Customization Options** You can customize labels, decimal places, and statistics: ```r enhanced_data %>% tbl_summary( # Custom labels label = list(age ~ "Participant Age"), # Decimal places (transcript example: age with no decimal places) digits = list(age ~ 0), # Different statistics (transcript example: mean and standard deviation for age) statistic = list(age ~ "{mean} ({sd})") ) ``` **Key customization options from the transcript:** - **Labels**: Give meaningful names to variables - **Digits**: Control decimal places displayed - **Statistics**: Choose which descriptive statistics to calculate - **Variable types**: Specify if the function hasn't correctly identified the variable type > [!tip] Reproducibility Benefits > > - **No copy-paste errors**: Direct R-to-output pipeline as mentioned in transcript > - **Update automatically**: Change data, re-run code, get updated table > - **More time for interpretation**: Less manual work means more time analyzing results > - **Professional formatting**: Ready for sharing with collaborators or the public > - **Coding infrastructure**: Set up exactly the way you want through code --- ## Real-World Application: Building Your First Descriptive Table > [!example]- Descriptive Table > Analyzing a health study with participant age, education level, and blood pressure readings: > > ```r > # Step 1: Understand your variables > health_data %>% glimpse() > > # Step 2: Explore individual variables > health_data %>% summarize( > age_mean = mean(age), > age_sd = sd(age), > bp_median = median(blood_pressure) > ) > > # Step 3: Categorical summaries > health_data %>% > count(education) %>% > mutate(percent = n / sum(n) * 100) > > # Step 4: Professional table health_data %>% select(age, education, blood_pressure) %>% tbl_summary() > ``` --- ## Connecting the Concepts: From Understanding to Communication **The Flow of Descriptive Analysis:** - Raw Data → Identify Variable Types → Choose Appropriate Statistics → Calculate with R Functions → Create Professional Tables → Understand Sample Characteristics → Inform Next Analysis Steps > [!insight] The Bigger Picture > Descriptive statistics aren't just numbers---they're the foundation for everything that follows. Understanding your sample's characteristics helps you choose appropriate tests, interpret results in context, and communicate findings effectively. **Match your tool to your data type.** Numeric variables need measures of center and spread; categorical variables need counts and percentages. R's intuitive function names make this natural, and tbl_summary() automates the entire process while maintaining professional standards.