<sub>2025-05-29</sub> <sub>#data-visualization #data-management #r-programming #statistical-analysis #hmp669</sub>
<sup>[[maps-of-content|🌐 Maps of Content — All Notes]] </sup>
<sup>Series: [[hmp669|HMP 669 — Data Management and Visualization]]</sup>
<sup>Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]]</sup>
# Data Manipulation with dplyr: The Five Essential Verbs
> [!abstract]- Overview
>
> dplyr provides five intuitive "verbs" that handle 90% of your data manipulation needs, allowing you to transform messy real-world data into analysis-ready datasets.
>
> **Key Concepts**: The Five Essential Operations
>
> - **filter** - Choose which rows to keep
> - **select** - Choose which columns to keep
> - **arrange** - Sort rows by values
> - **mutate** - Create new columns
> - **summarize** - Collapse data into summaries
>
> **Critical Connections**: These verbs work together through the pipe operator (%>%) to create readable, step-by-step data transformations that flow naturally from left to right.
>
> **Must Remember**: Think of dplyr verbs as actions you perform on a spreadsheet
> [!info]- Required R Packages
>
>
> |Package|Purpose|Installation|
> |---|---|---|
> |`dplyr`|Data manipulation (part of tidyverse)|`install.packages("tidyverse")`|
> |`tidyverse`|Complete data science toolkit|`install.packages("tidyverse")`|
>
> **Setup code:**
>
>
> ```r
> library(tidyverse) # loads dplyr and other essential packages
> # OR just dplyr alone:
> library(dplyr)
> ```
> [!code]- Syntax Reference
>
>
> |Command/Syntax|Purpose|Example|
> |---|---|---|
> |**Row Operations**|||
> |`filter(condition)`|Keep rows meeting criteria|`filter(age > 25)`|
> |`filter(col %in% vector)`|Keep rows where column matches any value in vector|`filter(region %in% c("North", "South"))`|
> |`filter(col1 == "value" \| col2 > 10)`|OR logic (either condition)|`filter(status == "Active" \| score > 80)`|
> |`filter(col1 == "value" & col2 > 10)`|AND logic (both conditions)|`filter(age > 25 & income > 50000)`|
> |`arrange(column)`|Sort ascending|`arrange(age)`|
> |`arrange(desc(column))`|Sort descending|`arrange(desc(income))`|
> |`arrange(col1, desc(col2))`|Multi-column sort|`arrange(region, desc(age))`|
> |**Column Operations**|||
> |`select(col1, col2, col3)`|Select specific columns|`select(name, age, income)`|
> |`select(col1:col3)`|Select range of columns|`select(name:income)`|
> |`select(-column)`|Remove specific column|`select(-unwanted_col)`|
> |`select(-c(col1, col2))`|Remove multiple columns|`select(-c(temp_col, old_col))`|
> |`select(starts_with("prefix"))`|Select columns starting with text|`select(starts_with("score_"))`|
> |`select(ends_with("suffix"))`|Select columns ending with text|`select(ends_with("_total"))`|
> |`select(contains("text"))`|Select columns containing text|`select(contains("temp"))`|
> |`rename(new_name = old_name)`|Change column name|`rename(participant_id = id)`|
> |**Creating New Columns**|||
> |`mutate(new_col = expression)`|Create new column|`mutate(bmi = weight / height^2)`|
> |`mutate(col1 = expr1, col2 = expr2)`|Create multiple columns|`mutate(total = a + b, avg = total/2)`|
> |`mutate(col = ifelse(condition, true_val, false_val))`|Conditional column creation|`mutate(adult = ifelse(age >= 18, "Yes", "No"))`|
> |`mutate(col = case_when())`|Multiple condition column|`mutate(grade = case_when(score >= 90 ~ "A", score >= 80 ~ "B", TRUE ~ "C"))`|
> |`cut(column, breaks, labels)`|Convert numeric to categorical|`cut(age, breaks = c(0,30,60,Inf), labels = c("Young","Middle","Senior"))`|
> |**Summary Operations**|||
> |`summarize(stat = function(column))`|Create summary statistic|`summarize(avg_age = mean(age))`|
> |`summarize(stat1 = func1(), stat2 = func2())`|Multiple summaries|`summarize(avg = mean(score), count = n())`|
> |`group_by(column)`|Group data for operations|`group_by(region)`|
> |`group_by(col1, col2)`|Group by multiple columns|`group_by(region, gender)`|
> |`ungroup()`|Remove grouping|`ungroup()`|
> |**Pipe Operations**|||
> |`data %>% function()`|Chain operations left to right|`data %>% filter(age > 25) %>% select(name, age)`|
> |**Common Summary Functions**|||
> |`n()`|Count observations|`summarize(count = n())`|
> |`mean(column)`|Calculate average|`summarize(avg = mean(score))`|
> |`median(column)`|Calculate median|`summarize(med = median(income))`|
> |`sd(column)`|Standard deviation|`summarize(std_dev = sd(score))`|
> |`min(column)`, `max(column)`|Minimum/Maximum values|`summarize(min_age = min(age), max_age = max(age))`|
> |`sum(column)`|Sum values|`summarize(total = sum(sales))`|
> |`first(column)`, `last(column)`|First/Last values in group|`summarize(first_purchase = first(date))`|
> |**Logical Operators**|||
> |`==`|Equal to|`filter(status == "Active")`|
> |`!=`|Not equal to|`filter(status != "Inactive")`|
> |`>`, `>=`|Greater than, greater than or equal|`filter(age > 25)`|
> |`<`, `<=`|Less than, less than or equal|`filter(score <= 100)`|
> |`&`|AND (both conditions true)|`filter(age > 25 & income > 50000)`|
> |`\|`|OR (either condition true)|`filter(region == "North" \| region == "South")`|
> |`%in%`|Value in a vector|`filter(status %in% c("Active", "Pending"))`|
> |`is.na()`|Check for missing values|`filter(is.na(score))`|
> |`!is.na()`|Check for non-missing values|`filter(!is.na(score))`|
## The Data Manipulation Reality
**Real-world data is messy.** Your perfect analysis sits on the other side of a data transformation challenge. You need to:
- Filter out irrelevant observations - filter()
- Focus on specific variables of interest - select()
- Sort data in meaningful ways - arrange()
- Create new variables from existing ones - mutate()
- Summarize complex datasets into actionable insights - summaries()
dplyr verbs that mirrors how you naturally think about data manipulation.
---
## Understanding dplyr Verbs
Think of dplyr verbs as **actions you perform on a dataset**. Just as you might tell someone to "filter the list," "select the important columns," or "arrange by date," dplyr uses these same intuitive terms as function names.
> [!note] Verb vs. Function
> In dplyr context, "verb" and "function" mean the same thing. The tidyverse uses "verb" because these functions represent _actions_ you take on data.
### Rows vs. Columns
**Rows = Observations** (individual data points, cases, records)
**Columns = Variables** (characteristics, measurements, attributes)
```plaintext
Variable 1 Variable 2 Variable 3
Obs 1 A 12 High
Obs 2 B 8 Low
Obs 3 C 15 High
```
This distinction determines which verb you need:
- **Row operations**: filter, arrange
- **Column operations**: select, mutate
- **Summary operations**: summarize (collapses both)
---
## 1. filter() - Choose Your Rows
**Purpose**: Keep only the observations that meet your criteria.
Think of filtering like **using a sieve** - you define the holes (conditions), and only data that fits through survives.
Filter = chooses specific rows (typically observations)
![[data-manipulation-dplyr-1748584635501.webp]]
### Basic Filtering
```r
# Keep only rows where age is greater than 25
data %>% filter(age > 25)
# Keep only rows where status equals "Active"
data %>% filter(status == "Active")
# Keep only rows where score is between 80 and 100
data %>% filter(score >= 80 & score <= 100)
```
### Advanced Filtering
```r
# Multiple conditions (AND logic)
data %>% filter(age > 25 & income > 50000)
# Either/or conditions (OR logic) - use the | symbol
data %>% filter(category == "A" | category == "B")
# Exclude specific values
data %>% filter(status != "Inactive")
# Filter based on multiple values
data %>% filter(region %in% c("North", "South", "East"))
```
> [!tip] The Vertical Bar (|) = "OR"
> When you see `|` in filter conditions, read it as "OR". This is R's way of saying "this condition OR that condition."
---
## 2. select() - Choose Your Columns
**Purpose**: Keep only the variables you need for your analysis.
Think of selecting like **choosing items from a menu** - you specify exactly what you want.
Select = chooses specific columns (typical variables)
![[data-manipulation-dplyr-1748584667029.webp]]
### Basic Selection
```r
# Select specific columns by name
data %>% select(name, age, income)
# Select a range of columns using the colon operator
data %>% select(name:income) # All columns from 'name' to 'income'
# Select columns by position
data %>% select(1:3) # First three columns
```
### Advanced Selection
```r
# Remove specific columns (use minus sign)
data %>% select(-unwanted_column)
data %>% select(-c(col1, col2, col3)) # Remove multiple columns
# Select columns that start with specific text
data %>% select(starts_with("temp_"))
# Select columns that contain specific text
data %>% select(contains("score"))
# Select columns that match a pattern
data %>% select(matches("^data_\\d+
quot;)) # Columns like data_1, data_2, etc.
```
> [!note] The Colon Operator (:)
> In `select()`, the colon creates a sequence. `name:income` means "select all columns from 'name' through 'income' in order."
---
## 3. arrange() - Sort Your Rows
**Purpose**: Reorder rows based on the values in one or more columns.
Think of arranging like **organizing a filing cabinet** - you choose the sorting criteria.
![[data-manipulation-dplyr-1748584380736.webp]]
### Basic Arranging
```r
# Sort by one column (ascending by default)
data %>% arrange(age)
# Sort by one column (descending)
data %>% arrange(desc(age))
# Sort by multiple columns
data %>% arrange(region, desc(income)) # First by region, then by income (descending)
```
### The Power of Multiple Sorts
```r
# Complex sorting: region alphabetically, then age descending, then income ascending
data %>% arrange(region, desc(age), income)
```
This creates a hierarchical sort - like organizing students first by grade, then by GPA within each grade.
---
## 4. mutate() - Create New Columns
**Purpose**: Add new variables to your dataset, typically calculated from existing variables.
Think of mutating like **adding a calculated field in a spreadsheet** - you create new information from what you already have.
![[data-manipulation-dplyr-1748584396979.webp]]
### Basic Mutation
```r
# Create a new column based on existing columns
data %>% mutate(total_score = math_score + english_score)
# Create multiple new columns at once
data %>% mutate(
bmi = weight / (height^2),
age_group = ifelse(age >= 65, "Senior", "Non-Senior"),
log_income = log(income)
)
```
### Advanced Mutation
```r
# Use conditional logic
data %>% mutate(
performance = case_when(
score >= 90 ~ "Excellent",
score >= 80 ~ "Good",
score >= 70 ~ "Fair",
TRUE ~ "Needs Improvement"
)
)
# Create categories from continuous variables using cut()
data %>% mutate(
income_bracket = cut(income,
breaks = c(0, 30000, 60000, 100000, Inf),
labels = c("Low", "Medium", "High", "Very High"))
)
```
### rename() - Change Column Names
Sometimes you don't want a new column - you just want to fix a name:
```r
# Change column names
data %>% rename(new_name = old_name)
data %>% rename(participant_id = id, test_score = score)
```
> [!tip] Tidy Variable Names
> Choose consistent naming conventions:
>
> - **Snake case**: `participant_age`, `test_score` (recommended)
> - **Camel case**: `participantAge`, `testScore`
>
> Pick one style and stick with it. Don't create "mutant snake-camel hybrids"!
---
## 5. summarize() - Collapse to Insights
**Purpose**: Reduce your dataset to summary statistics that tell the story of your data.
Think of summarizing like **creating an executive summary** - you distill the essential information.
![[data-manipulation-dplyr-1748584417807.webp]]
### Basic Summarizing
```r
# Single summary statistic
data %>% summarize(avg_age = mean(age))
# Multiple summary statistics
data %>% summarize(
avg_age = mean(age),
median_income = median(income),
total_participants = n(),
std_dev_score = sd(test_score)
)
```
### The Power Combo: group_by() + summarize()
The real magic happens when you combine `group_by()` with `summarize()`:
```r
# Summarize by groups
data %>%
group_by(region) %>%
summarize(
avg_income = mean(income),
count = n(),
max_age = max(age)
)
# Multiple grouping variables
data %>%
group_by(region, gender) %>%
summarize(
avg_score = mean(test_score),
participants = n()
)
```
> [!note] group_by() Changes Everything
> Once you use `group_by()`, all subsequent operations happen _within each group_ until you use `ungroup()`. It's like creating separate mini-datasets for each group.
---
## The Pipe Operator (%>%) - Connecting the Flow
**The game-changer**: The pipe operator lets you chain operations together in a readable, left-to-right flow.
### Without Pipes (Hard to Read)
```r
# Nested functions - read from inside out
result <- arrange(select(filter(data, age > 25), name, age, income), desc(income))
```
### With Pipes (Natural Flow)
```r
# Step-by-step flow - read from left to right
result <- data %>%
filter(age > 25) %>%
select(name, age, income) %>%
arrange(desc(income))
```
**Think of the pipe as "then"**:
- Take the data, **then**
- Filter for age > 25, **then**
- Select name, age, and income, **then**
- Arrange by income in descending order
> [!tip] Reading Pipes Aloud
> The pipe `%>%` can be read as "then" or "and then". This makes your code sound like natural instructions: "Take the data, then filter it, then select these columns, then arrange by income."
---
## Real-World Application: Population Health Analysis
> [!example]- Scenario All five verbs working together in a realistic scenario:
>
> "What are the key health patterns and risk factors across different age groups and regions in our population health survey?"
>
>
> ```r
> # Analyze health outcomes from the latest population survey
> health_insights <- health_survey_data %>%
> # Step 1: Focus on complete, recent survey responses
> filter(survey_year == 2023 & !is.na(bmi) & !is.na(blood_pressure)) %>%
>
> # Step 2: Select relevant health and demographic variables
> select(patient_id, region, age, gender, bmi, systolic_bp,
> diabetes_status, smoking_status, exercise_hours_weekly) %>%
>
> # Step 3: Create new health classification variables
> mutate(
> bmi_category = case_when(
> bmi < 18.5 ~ "Underweight",
> bmi < 25 ~ "Normal",
> bmi < 30 ~ "Overweight",
> TRUE ~ "Obese"
> ),
> age_group = cut(age, breaks = c(0, 30, 50, 65, Inf),
> labels = c("Under 30", "30-49", "50-64", "65+")),
> bp_category = case_when(
> systolic_bp < 120 ~ "Normal",
> systolic_bp < 140 ~ "Elevated",
> TRUE ~ "Hypertensive"
> ),
> risk_score = case_when(
> diabetes_status == "Yes" & smoking_status == "Current" ~ "High Risk",
> diabetes_status == "Yes" | smoking_status == "Current" ~ "Moderate Risk",
> TRUE ~ "Low Risk"
> )
> ) %>%
>
> # Step 4: Sort to identify highest-risk populations first
> arrange(desc(risk_score), region, age_group) %>%
>
> # Step 5: Create summary health profiles by demographics
> group_by(region, age_group, risk_score) %>%
> summarize(
> population_count = n(),
> avg_bmi = round(mean(bmi), 1),
> pct_hypertensive = round(mean(bp_category == "Hypertensive") * 100, 1),
> avg_exercise_hours = round(mean(exercise_hours_weekly, na.rm = TRUE), 1),
> .groups = "drop" # Ungroup after summarizing
> )
> ```
---
## Connecting the Concepts
The dplyr philosophy creates a coherent system where:
- Each verb has one clear purpose - no confusion about what it does
- Verbs combine naturally - use pipes to chain operations
- Code reads like instructions - filter this, then select that, then arrange by this
- Complex analysis becomes approachable - break big problems into small, clear steps
--
Reference
- HMP 669