<sub>2025-05-29</sub> <sub>#data-visualization #data-management #r-programming #statistical-analysis #hmp669</sub> <sup>[[maps-of-content|🌐 Maps of Content — All Notes]] </sup> <sup>Series: [[hmp669|HMP 669 — Data Management and Visualization]]</sup> <sup>Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]]</sup> # Data Exploration: Ordering, Filtering & Bivariate Analysis > [!abstract]- Overview > > Essential data manipulation techniques that transform raw datasets into focused, comparable groups for meaningful analysis. > > **Key Concepts**: Three fundamental data exploration skills > > - **Ordering**: Arrange data systematically using `arrange()` > - **Filtering**: Create targeted subsets using `filter()` > - **Bivariate Analysis**: Compare distributions across groups simultaneously > > **Critical Connections**: Each technique builds toward the ultimate goal of comparing how variables behave differently across groups in your data > > **Must Remember**: These aren't just data cleaning steps---they're the foundation for discovering patterns and relationships that drive insights > [!info]- Required R Packages > > > |Package|Purpose|Installation| > |---|---|---| > |`dplyr`|Data manipulation (part of tidyverse)|`install.packages("tidyverse")`| > |`gtsummary`|Professional summary tables|`install.packages("gtsummary")`| > > **Setup code:** > > > ```r > library(dplyr) # or library(tidyverse) > library(gtsummary) > ``` > > **Note**: Base R functions (mean, median, sd, table, etc.) require no additional packages. > [!code]- Syntax Reference > > > |Command/Syntax|Purpose|Example| > |---|---|---| > |**Data Ordering**||| > |`arrange(column)`|Sort ascending|`data %>% arrange(age)`| > |`arrange(desc(column))`|Sort descending|`data %>% arrange(desc(age))`| > |`arrange(col1, col2)`|Sort by multiple columns|`data %>% arrange(category, desc(score))`| > |**Data Filtering**||| > |`filter(condition)`|Keep rows meeting criteria|`data %>% filter(disease == "case")`| > |`filter(var >= value)`|Numeric comparisons|`data %>% filter(age >= 65)`| > |`filter(var %in% c(...))`|Multiple values (OR)|`data %>% filter(education %in% c("HS", "College"))`| > |`filter(cond1, cond2)`|Multiple conditions (AND)|`data %>% filter(age >= 65, sex == "Female")`| > |**Bivariate Analysis: group_by Method**||| > |`group_by(variable)`|Start grouping|`data %>% group_by(sex)`| > |`group_by(var1, var2)`|Multiple grouping variables|`data %>% group_by(sex, age_group)`| > |`summarise(stat = function(var))`|Calculate group statistics|`summarise(mean_age = mean(age))`| > |`n()`|Count observations per group|`summarise(count = n())`| > |`ungroup()`|End grouping|Always close with `ungroup()`| > |**Bivariate Analysis: tbl_summary Method**||| > |`tbl_summary(by = variable)`|Professional bivariate table|`select(vars) %>% tbl_summary(by = sex)`| > |`add_p()`|Add statistical tests|`tbl_summary(by = group) %>% add_p()`| > |`modify_header()`|Customize column headers|`modify_header(label ~ "**Variable**")`| > |**Common Summary Functions**||| > |`mean(var, na.rm = TRUE)`|Calculate mean|Handle missing values with na.rm| > |`median(var, na.rm = TRUE)`|Calculate median|More robust to outliers| > |`sd(var, na.rm = TRUE)`|Standard deviation|Measure of variability| > |`min(var, na.rm = TRUE)`|Minimum value|Useful for ranges| > |`max(var, na.rm = TRUE)`|Maximum value|Useful for ranges| > |`quantile(var, 0.25, na.rm = TRUE)`|Calculate percentiles|Replace 0.25 with desired percentile| --- ## The Journey from Chaos to Clarity Raw data arrives in whatever order it happened to be collected---participant 1047 might be 23 years old, followed by participant 2 who's 67, then participant 891 who's 31. **This randomness obscures patterns.** Your first job as a data analyst is to bring order to this chaos, creating clear views that reveal the stories hidden in your data. ## Part 1: Bringing Order with `arrange()` **The foundation of insight is organization.** The `arrange()` function is your primary tool for creating logical sequences in your data. ### Basic Ordering Patterns ```r # Start with random order nhanes_data %>% select(participant_id, age) %>% head() # participant_id age # 1047 23 # 2 67 # 891 31 ``` **Ascending order** (smallest to largest) is the default: ```r nhanes_data %>% arrange(age) %>% select(participant_id, age) %>% head() # participant_id age # 891 18 # 1047 23 # 2 29 ``` **Descending order** uses the `desc()` wrapper: ```r nhanes_data %>% arrange(desc(age)) %>% select(participant_id, age) %>% head() # participant_id age # 2 67 # 1047 45 # 891 38 ``` > [!tip] When to Use Each Approach > > - **Ascending**: Finding youngest participants, lowest scores, earliest dates > - **Descending**: Identifying oldest participants, highest values, most recent events > - **Multiple variables**: `arrange(category, desc(score))` sorts by category first, then by score within each category ## Part 2: Focus with `filter()` **Every analysis begins with a question about a specific group.** The `filter()` function transforms your broad dataset into a focused subset that matches your research interest. ### Creating Targeted Subsets ```r # Create a subset of only cases (people with the disease) cases_only <- nhanes_data %>% filter(disease == "case") # Use the subset for focused analysis cases_only %>% summarise( n_cases = n(), mean_age = mean(age, na.rm = TRUE), median_age = median(age, na.rm = TRUE) ) ``` ### Common Filtering Patterns ```r # Single condition filter(age >= 65) # Seniors only filter(education == "College") # College graduates only filter(insurance == "Private") # Privately insured only # Multiple conditions (AND logic) filter(age >= 65, sex == "Female") # women with age 65 and above filter(disease == "case", age < 50) # cases with age below 50 # OR logic using %in% filter(education %in% c("High School", "College")) ``` ## Part 3: Bivariate Analysis - Seeing Relationships Bivariate descriptive statistics let you compare how one variable behaves across different groups of another variable. Instead of asking "What's the average age?" you ask "How does average age differ between cases and controls?" ### The Conceptual Shift **Univariate thinking**: "The average age in our sample is 45.2 years." **Bivariate thinking**: "Cases average 52.1 years while controls average 41.8 years---a potentially meaningful 10-year difference." ### Method 1: The Two-Step Approach This method separates filtering and calculation: ```r # Step 1: Filter to cases cases <- nhanes_data %>% filter(disease == "case") controls <- nhanes_data %>% filter(disease == "control") # Step 2: Calculate statistics separately cases %>% summarise(mean_age = mean(age, na.rm = TRUE)) controls %>% summarise(mean_age = mean(age, na.rm = TRUE)) ``` > [!warning] This approach becomes unwieldy with multiple groups or complex comparisons. It's perfect for quick checks but inefficient for comprehensive analysis. ### Method 2: The `group_by()` Workflow **This is the workhorse of bivariate analysis.** The `group_by()` function creates invisible partitions in your data, allowing you to calculate statistics for each group simultaneously. #### The Three-Step Process ```r nhanes_data %>% # Step 1: Open grouping group_by(disease) %>% # Step 2: Calculate desired statistics summarise( n_participants = n(), min_age = min(age, na.rm = TRUE), mean_age = mean(age, na.rm = TRUE), max_age = max(age, na.rm = TRUE) ) %>% # Step 3: Close grouping ungroup() ``` **Output example:** ```plaintext # A tibble: 2 × 5 disease n_participants min_age mean_age max_age <chr> <int> <dbl> <dbl> <dbl> 1 case 156 22 52.1 78 2 control 312 18 41.8 82 ``` > [!tip] Why This Method Excels > > - **Flexible**: Add any statistic you need > - **Efficient**: One operation handles all groups > - **Expandable**: Works with multiple grouping variables > - **Consistent**: Same syntax regardless of group count ### Method 3: Professional Tables with `tbl_summary()` **For publication-ready output**, the `gtsummary` package creates beautifully formatted bivariate tables: ```r nhanes_data %>% select(age, education, sex) %>% tbl_summary(by = sex) %>% add_p() # Adds statistical tests ``` This produces professional tables with: - **Columns** for each group (Male, Female) - **Rows** for each variable of interest - **Automatic formatting** of percentages, means, medians - **Statistical tests** comparing groups (when `add_p()` is included) ## Real-World Application: Understanding Health Disparities > [!example]- Understanding Health Disparities > **"How do health behaviors differ between insured and uninsured populations?"** > > ```r > # Step 1: Order data to understand age distribution > health_data %>% > arrange(age) %>% > head(10) > > # Step 2: Filter to focus on adults > adults_only <- health_data %>% > filter(age >= 18) > > # Step 3: Compare health behaviors by insurance status > adults_only %>% > group_by(insurance_status) %>% > summarise( > n = n(), > avg_age = mean(age, na.rm = TRUE), > pct_smokers = mean(smoker == "Yes", na.rm = TRUE) * 100, > avg_bmi = mean(bmi, na.rm = TRUE), > pct_regular_exercise = mean(exercise_regular == "Yes", na.rm = TRUE) * 100 > ) %>% > ungroup() > > # Step 4: Create publication-ready table > adults_only %>% > select(age, smoker, bmi, exercise_regular, insurance_status) %>% > tbl_summary(by = insurance_status) %>% > add_p() %>% > modify_header(label ~ "**Characteristic**") %>% > modify_spanning_header(c("stat_1", "stat_2") ~ "**Insurance Status**") > ``` > > > [!note] The Analysis Pipeline > > Notice how each step builds toward the final insight: > > - **ordering** reveals data structure, > > - **filtering** creates the relevant population, > > - and **bivariate analysis** uncovers group differences that might indicate health disparities. > ## Connecting the Concepts: The Analytical Flow ```mermaid graph TD A[Raw Dataset] --> B[arrange: Order for understanding] B --> C[filter: Focus on relevant groups] C --> D[Bivariate Analysis: Compare across groups] D --> E[group_by method: Flexible calculations] D --> F[tbl_summary method: Professional presentation] E --> G[Insights about group differences] F --> G ``` **Bivariate thinking transforms data analysis from description to discovery.** When you shift from asking "What happened?" to "How did what happened differ between groups?", you're no longer just summarizing---you're investigating. This cognitive shift is the foundation of epidemiology, social science research, and evidence-based decision making---help you ask better questions and find more meaningful answers in your data. -- Reference - HMP 669