<sub>2025-05-29</sub> <sub>#data-visualization #data-management #r-programming #statistical-analysis #hmp669</sub>
<sup>[[maps-of-content|π Maps of Content β All Notes]] </sup>
<sup>Series: [[hmp669|HMP 669 β Data Management and Visualization]]</sup>
<sup>Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]]</sup>
# Univariate Descriptive Statistics in R
> [!abstract]- Overview
>
> Descriptive statistics reveal the story hidden in your data---who your participants are, what patterns exist, and how your sample compares to the broader world.
>
> **Key Concepts**:
>
> - Variable type determines statistical approach (numeric vs categorical)
> - R functions mirror intuitive statistical concepts
> - Reproducible tables eliminate errors and save time
>
> **Critical Connections**: Understanding your data's shape β Choosing appropriate statistics β Creating professional, reproducible output β Better science communication
>
> **Must Remember**: Different variables need different measures; R makes this intuitive; automation vs manual table creation
> [!info]- Required R Packages
>
> |Package|Purpose|Installation|
> |---|---|---|
> |`dplyr`|Data manipulation (part of tidyverse)|`install.packages("tidyverse")`|
> |`gtsummary`|Professional summary tables|`install.packages("gtsummary")`|
>
> **Setup code:**
> ```r
> library(dplyr) # or library(tidyverse)
> library(gtsummary)
> ```
>
> **Note**: Base R functions (mean, median, sd, table, etc.) require no additional packages.
> [!code]- Syntax Quick Reference
>
>
> |Command/Syntax|Purpose|Example|
> |---|---|---|
> |**Numeric Variables**|||
> |`mean(x)`|Calculate average|`mean(age)`|
> |`median(x)`|Calculate middle value|`median(income)`|
> |`sd(x)`|Standard deviation|`sd(test_scores)`|
> |`quantile(x)`|Percentiles (default: 0,25,50,75,100)|`quantile(blood_pressure)`|
> |`IQR(x)`|Interquartile range|`IQR(height)`|
> |`summary(x)`|Multiple stats at once|`summary(weight)`|
> |`summarize(...)`|Custom multi-stat table|`summarize(min=min(x), max=max(x))`|
> |**Categorical Variables**|||
> |`table(x)`|Frequency table|`table(education)`|
> |`count(data, x)`|Tidy counting|`count(df, gender)`|
> |`prop.table(table(x))`|Proportions|`prop.table(table(status))`|
> |**Combined Workflow**|||
> |`count() %>% mutate()`|Count + percentage|`count(education) %>% mutate(pct = n/sum(n)*100)`|
> |**Professional Tables**|||
|`tbl_summary()`|Automatic summary table|`select(vars) %>% tbl_summary()`|
|`tbl_summary(..., label=...)`|Custom labels|`tbl_summary(label=list(age~"Age in Years"))`|
|`tbl_summary(..., statistic=...)`|Custom statistics|`tbl_summary(statistic=list(age~"{mean} ({sd})"))`|
---
## Why Descriptive Statistics Matter: The Data Detective's First Step
Before you can test hypotheses or build models, you need to **meet your data**. Think of descriptive statistics as your data's introduction---revealing age distributions, disease prevalence, and risk factor levels that determine how generalizable and unique your sample truly is.
> [!note] The Fundamental Question
> Every dataset holds people with stories. Descriptive statistics help you understand: _Who are these people, and what do their characteristics tell us about the bigger picture?_
---
## The Two Worlds of Variables: Numeric vs Categorical
### **Numeric Variables: The Shape-shifters**
Numeric variables contain only numbers, but they can take dramatically different forms:
- **Normal distribution**: The classic bell curve---symmetric and predictable
- **Bimodal distribution**: Two peaks, like seeing both night owls and early birds in sleep data
- **Right-skewed distribution**: A long tail stretching right, common in income or medical cost data
> [!tip] Imagine numeric data as landscapes: some are gentle hills (normal), others have twin peaks (bimodal), and some have steep cliffs with long valleys (skewed). The shape tells you which statistical tools work best.
**What We Measure in Numeric Variables:**
|Concept|What It Tells Us|R Function|
|---|---|---|
|**Central Tendency**|Where's the "middle"?|`mean()`, `median()`|
|**Spread**|How scattered are the values?|`sd()`, `IQR()`|
|**Range**|What are the extremes?|`min()`, `max()`|
|**Quantiles**|How is data distributed across percentiles?|`quantile()`|
### **Categorical Variables: The Group Storytellers**
Character or factor variables organize people into meaningful groups---education levels, disease status, treatment arms. Here, we're interested in **who belongs where** and **how groups compare in size**.
**What We Count:**
- **N (Count)**: How many people in each group?
- **Percentage**: What proportion of the total does each group represent?
> [!insight] Why Percentages Matter
> Raw counts can mislead. Saying "50 people have the condition" feels different than "2% have the condition"---percentages reveal relative importance and help you understand group dynamics.
---
## R Functions: Your Statistical Toolkit
R's descriptive functions read like natural language, making complex calculations intuitive:
### **For Numeric Variables**
```r
# Single statistics
mean(age) # Average age
median(age) # Middle value when sorted
sd(age) # Standard deviation (spread)
quantile(age) # Default: 0%, 25%, 50%, 75%, 100%
IQR(age) # 75th percentile - 25th percentile
summary(age) # Multiple stats at once
# Position finders
first(age) # Which observation has minimum value
last(age) # Which observation has maximum value
```
### **The Power of `summarize()`: Multiple Stats, One Command**
```r
enhanced_data %>%
summarize(
minimum = min(age),
mean = mean(age),
maximum = max(age)
)
```
This creates a clean data frame with one row and three columns---perfect for organized output.
### **For Categorical Variables**
```r
# Counting approaches (choose what feels intuitive)
table(education) # Simple frequency table
count(enhanced_data, education) # Tidy counting
n() # Within group operations
# Adding percentages
enhanced_data %>%
count(education) %>%
mutate(percent = n / sum(n) * 100)
```
> [!warning] Remember to multiply by 100 for percentages! `n / sum(n)` gives proportions (0.25), while `n / sum(n) * 100` gives percentages (25%).
---
## Professional Tables: From Manual Labor to Automated Excellence
### **The Problem with Manual Tables**
Traditional approach: Calculate statistics β Copy numbers β Paste into Word β Format β Hope for no errors
**Issues**: Time-consuming, error-prone, not reproducible
### **The `tbl_summary()` Solution**
Think of `tbl_summary()` as your automatic table generator---it analyzes variable types and applies appropriate statistics without manual specification.
```r
# Basic usage
enhanced_data %>%
select(age, education, income) %>%
tbl_summary()
```
**What happens automatically:**
- **Numeric variables**: Displays median with 25th and 75th percentiles (quantiles)
- **Categorical variables**: Shows count and percentage for each level
- **Professional formatting**: Ready for publication
### **Customization Options**
You can customize labels, decimal places, and statistics:
```r
enhanced_data %>%
tbl_summary(
# Custom labels
label = list(age ~ "Participant Age"),
# Decimal places (transcript example: age with no decimal places)
digits = list(age ~ 0),
# Different statistics (transcript example: mean and standard deviation for age)
statistic = list(age ~ "{mean} ({sd})")
)
```
**Key customization options from the transcript:**
- **Labels**: Give meaningful names to variables
- **Digits**: Control decimal places displayed
- **Statistics**: Choose which descriptive statistics to calculate
- **Variable types**: Specify if the function hasn't correctly identified the variable type
> [!tip] Reproducibility Benefits
>
> - **No copy-paste errors**: Direct R-to-output pipeline as mentioned in transcript
> - **Update automatically**: Change data, re-run code, get updated table
> - **More time for interpretation**: Less manual work means more time analyzing results
> - **Professional formatting**: Ready for sharing with collaborators or the public
> - **Coding infrastructure**: Set up exactly the way you want through code
---
## Real-World Application: Building Your First Descriptive Table
> [!example]- Descriptive Table
> Analyzing a health study with participant age, education level, and blood pressure readings:
>
> ```r
> # Step 1: Understand your variables
> health_data %>% glimpse()
>
> # Step 2: Explore individual variables
> health_data %>% summarize(
> age_mean = mean(age),
> age_sd = sd(age),
> bp_median = median(blood_pressure)
> )
>
> # Step 3: Categorical summaries
> health_data %>%
> count(education) %>%
> mutate(percent = n / sum(n) * 100)
>
> # Step 4: Professional table
health_data %>%
select(age, education, blood_pressure) %>%
tbl_summary()
> ```
---
## Connecting the Concepts: From Understanding to Communication
**The Flow of Descriptive Analysis:**
- Raw Data
β Identify Variable Types
β Choose Appropriate Statistics
β Calculate with R Functions
β Create Professional Tables
β Understand Sample Characteristics
β Inform Next Analysis Steps
> [!insight] The Bigger Picture
> Descriptive statistics aren't just numbers---they're the foundation for everything that follows. Understanding your sample's characteristics helps you choose appropriate tests, interpret results in context, and communicate findings effectively.
**Match your tool to your data type.** Numeric variables need measures of center and spread; categorical variables need counts and percentages. R's intuitive function names make this natural, and tbl_summary() automates the entire process while maintaining professional standards.