data-manipulation-dplyr - Em Royce & Company

2025-05-29 #data-visualization #data-management #r-programming #statistical-analysis #hmp669 [[maps-of-content|🌐 Maps of Content — All Notes]] Series: [[hmp669|HMP 669 — Data Management and Visualization]] Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]] # Data Manipulation with dplyr: The Five Essential Verbs > [!abstract]- Overview > > dplyr provides five intuitive "verbs" that handle 90% of your data manipulation needs, allowing you to transform messy real-world data into analysis-ready datasets. > > **Key Concepts**: The Five Essential Operations > > - **filter** - Choose which rows to keep > - **select** - Choose which columns to keep > - **arrange** - Sort rows by values > - **mutate** - Create new columns > - **summarize** - Collapse data into summaries > > **Critical Connections**: These verbs work together through the pipe operator (%>%) to create readable, step-by-step data transformations that flow naturally from left to right. > > **Must Remember**: Think of dplyr verbs as actions you perform on a spreadsheet > [!info]- Required R Packages > > > |Package|Purpose|Installation| > |---|---|---| > |`dplyr`|Data manipulation (part of tidyverse)|`install.packages("tidyverse")`| > |`tidyverse`|Complete data science toolkit|`install.packages("tidyverse")`| > > **Setup code:** > > > ```r > library(tidyverse) # loads dplyr and other essential packages > # OR just dplyr alone: > library(dplyr) > ``` > [!code]- Syntax Reference > > > |Command/Syntax|Purpose|Example| > |---|---|---| > |**Row Operations**||| > |`filter(condition)`|Keep rows meeting criteria|`filter(age > 25)`| > |`filter(col %in% vector)`|Keep rows where column matches any value in vector|`filter(region %in% c("North", "South"))`| > |`filter(col1 == "value" \| col2 > 10)`|OR logic (either condition)|`filter(status == "Active" \| score > 80)`| > |`filter(col1 == "value" & col2 > 10)`|AND logic (both conditions)|`filter(age > 25 & income > 50000)`| > |`arrange(column)`|Sort ascending|`arrange(age)`| > |`arrange(desc(column))`|Sort descending|`arrange(desc(income))`| > |`arrange(col1, desc(col2))`|Multi-column sort|`arrange(region, desc(age))`| > |**Column Operations**||| > |`select(col1, col2, col3)`|Select specific columns|`select(name, age, income)`| > |`select(col1:col3)`|Select range of columns|`select(name:income)`| > |`select(-column)`|Remove specific column|`select(-unwanted_col)`| > |`select(-c(col1, col2))`|Remove multiple columns|`select(-c(temp_col, old_col))`| > |`select(starts_with("prefix"))`|Select columns starting with text|`select(starts_with("score_"))`| > |`select(ends_with("suffix"))`|Select columns ending with text|`select(ends_with("_total"))`| > |`select(contains("text"))`|Select columns containing text|`select(contains("temp"))`| > |`rename(new_name = old_name)`|Change column name|`rename(participant_id = id)`| > |**Creating New Columns**||| > |`mutate(new_col = expression)`|Create new column|`mutate(bmi = weight / height^2)`| > |`mutate(col1 = expr1, col2 = expr2)`|Create multiple columns|`mutate(total = a + b, avg = total/2)`| > |`mutate(col = ifelse(condition, true_val, false_val))`|Conditional column creation|`mutate(adult = ifelse(age >= 18, "Yes", "No"))`| > |`mutate(col = case_when())`|Multiple condition column|`mutate(grade = case_when(score >= 90 ~ "A", score >= 80 ~ "B", TRUE ~ "C"))`| > |`cut(column, breaks, labels)`|Convert numeric to categorical|`cut(age, breaks = c(0,30,60,Inf), labels = c("Young","Middle","Senior"))`| > |**Summary Operations**||| > |`summarize(stat = function(column))`|Create summary statistic|`summarize(avg_age = mean(age))`| > |`summarize(stat1 = func1(), stat2 = func2())`|Multiple summaries|`summarize(avg = mean(score), count = n())`| > |`group_by(column)`|Group data for operations|`group_by(region)`| > |`group_by(col1, col2)`|Group by multiple columns|`group_by(region, gender)`| > |`ungroup()`|Remove grouping|`ungroup()`| > |**Pipe Operations**||| > |`data %>% function()`|Chain operations left to right|`data %>% filter(age > 25) %>% select(name, age)`| > |**Common Summary Functions**||| > |`n()`|Count observations|`summarize(count = n())`| > |`mean(column)`|Calculate average|`summarize(avg = mean(score))`| > |`median(column)`|Calculate median|`summarize(med = median(income))`| > |`sd(column)`|Standard deviation|`summarize(std_dev = sd(score))`| > |`min(column)`, `max(column)`|Minimum/Maximum values|`summarize(min_age = min(age), max_age = max(age))`| > |`sum(column)`|Sum values|`summarize(total = sum(sales))`| > |`first(column)`, `last(column)`|First/Last values in group|`summarize(first_purchase = first(date))`| > |**Logical Operators**||| > |`==`|Equal to|`filter(status == "Active")`| > |`!=`|Not equal to|`filter(status != "Inactive")`| > |`>`, `>=`|Greater than, greater than or equal|`filter(age > 25)`| > |`<`, `<=`|Less than, less than or equal|`filter(score <= 100)`| > |`&`|AND (both conditions true)|`filter(age > 25 & income > 50000)`| > |`\|`|OR (either condition true)|`filter(region == "North" \| region == "South")`| > |`%in%`|Value in a vector|`filter(status %in% c("Active", "Pending"))`| > |`is.na()`|Check for missing values|`filter(is.na(score))`| > |`!is.na()`|Check for non-missing values|`filter(!is.na(score))`| ## The Data Manipulation Reality **Real-world data is messy.** Your perfect analysis sits on the other side of a data transformation challenge. You need to: - Filter out irrelevant observations - filter() - Focus on specific variables of interest - select() - Sort data in meaningful ways - arrange() - Create new variables from existing ones - mutate() - Summarize complex datasets into actionable insights - summaries() dplyr verbs that mirrors how you naturally think about data manipulation. --- ## Understanding dplyr Verbs Think of dplyr verbs as **actions you perform on a dataset**. Just as you might tell someone to "filter the list," "select the important columns," or "arrange by date," dplyr uses these same intuitive terms as function names. > [!note] Verb vs. Function > In dplyr context, "verb" and "function" mean the same thing. The tidyverse uses "verb" because these functions represent _actions_ you take on data. ### Rows vs. Columns **Rows = Observations** (individual data points, cases, records) **Columns = Variables** (characteristics, measurements, attributes) ```plaintext Variable 1 Variable 2 Variable 3 Obs 1 A 12 High Obs 2 B 8 Low Obs 3 C 15 High ``` This distinction determines which verb you need: - **Row operations**: filter, arrange - **Column operations**: select, mutate - **Summary operations**: summarize (collapses both) --- ## 1. filter() - Choose Your Rows **Purpose**: Keep only the observations that meet your criteria. Think of filtering like **using a sieve** - you define the holes (conditions), and only data that fits through survives. Filter = chooses specific rows (typically observations) ![[data-manipulation-dplyr-1748584635501.webp]] ### Basic Filtering ```r # Keep only rows where age is greater than 25 data %>% filter(age > 25) # Keep only rows where status equals "Active" data %>% filter(status == "Active") # Keep only rows where score is between 80 and 100 data %>% filter(score >= 80 & score <= 100) ``` ### Advanced Filtering ```r # Multiple conditions (AND logic) data %>% filter(age > 25 & income > 50000) # Either/or conditions (OR logic) - use the | symbol data %>% filter(category == "A" | category == "B") # Exclude specific values data %>% filter(status != "Inactive") # Filter based on multiple values data %>% filter(region %in% c("North", "South", "East")) ``` > [!tip] The Vertical Bar (|) = "OR" > When you see `|` in filter conditions, read it as "OR". This is R's way of saying "this condition OR that condition." --- ## 2. select() - Choose Your Columns **Purpose**: Keep only the variables you need for your analysis. Think of selecting like **choosing items from a menu** - you specify exactly what you want. Select = chooses specific columns (typical variables) ![[data-manipulation-dplyr-1748584667029.webp]] ### Basic Selection ```r # Select specific columns by name data %>% select(name, age, income) # Select a range of columns using the colon operator data %>% select(name:income) # All columns from 'name' to 'income' # Select columns by position data %>% select(1:3) # First three columns ``` ### Advanced Selection ```r # Remove specific columns (use minus sign) data %>% select(-unwanted_column) data %>% select(-c(col1, col2, col3)) # Remove multiple columns # Select columns that start with specific text data %>% select(starts_with("temp_")) # Select columns that contain specific text data %>% select(contains("score")) # Select columns that match a pattern data %>% select(matches("^data_\\d+quot;)) # Columns like data_1, data_2, etc. ``` > [!note] The Colon Operator (:) > In `select()`, the colon creates a sequence. `name:income` means "select all columns from 'name' through 'income' in order." --- ## 3. arrange() - Sort Your Rows **Purpose**: Reorder rows based on the values in one or more columns. Think of arranging like **organizing a filing cabinet** - you choose the sorting criteria. ![[data-manipulation-dplyr-1748584380736.webp]] ### Basic Arranging ```r # Sort by one column (ascending by default) data %>% arrange(age) # Sort by one column (descending) data %>% arrange(desc(age)) # Sort by multiple columns data %>% arrange(region, desc(income)) # First by region, then by income (descending) ``` ### The Power of Multiple Sorts ```r # Complex sorting: region alphabetically, then age descending, then income ascending data %>% arrange(region, desc(age), income) ``` This creates a hierarchical sort - like organizing students first by grade, then by GPA within each grade. --- ## 4. mutate() - Create New Columns **Purpose**: Add new variables to your dataset, typically calculated from existing variables. Think of mutating like **adding a calculated field in a spreadsheet** - you create new information from what you already have. ![[data-manipulation-dplyr-1748584396979.webp]] ### Basic Mutation ```r # Create a new column based on existing columns data %>% mutate(total_score = math_score + english_score) # Create multiple new columns at once data %>% mutate( bmi = weight / (height^2), age_group = ifelse(age >= 65, "Senior", "Non-Senior"), log_income = log(income) ) ``` ### Advanced Mutation ```r # Use conditional logic data %>% mutate( performance = case_when( score >= 90 ~ "Excellent", score >= 80 ~ "Good", score >= 70 ~ "Fair", TRUE ~ "Needs Improvement" ) ) # Create categories from continuous variables using cut() data %>% mutate( income_bracket = cut(income, breaks = c(0, 30000, 60000, 100000, Inf), labels = c("Low", "Medium", "High", "Very High")) ) ``` ### rename() - Change Column Names Sometimes you don't want a new column - you just want to fix a name: ```r # Change column names data %>% rename(new_name = old_name) data %>% rename(participant_id = id, test_score = score) ``` > [!tip] Tidy Variable Names > Choose consistent naming conventions: > > - **Snake case**: `participant_age`, `test_score` (recommended) > - **Camel case**: `participantAge`, `testScore` > > Pick one style and stick with it. Don't create "mutant snake-camel hybrids"! --- ## 5. summarize() - Collapse to Insights **Purpose**: Reduce your dataset to summary statistics that tell the story of your data. Think of summarizing like **creating an executive summary** - you distill the essential information. ![[data-manipulation-dplyr-1748584417807.webp]] ### Basic Summarizing ```r # Single summary statistic data %>% summarize(avg_age = mean(age)) # Multiple summary statistics data %>% summarize( avg_age = mean(age), median_income = median(income), total_participants = n(), std_dev_score = sd(test_score) ) ``` ### The Power Combo: group_by() + summarize() The real magic happens when you combine `group_by()` with `summarize()`: ```r # Summarize by groups data %>% group_by(region) %>% summarize( avg_income = mean(income), count = n(), max_age = max(age) ) # Multiple grouping variables data %>% group_by(region, gender) %>% summarize( avg_score = mean(test_score), participants = n() ) ``` > [!note] group_by() Changes Everything > Once you use `group_by()`, all subsequent operations happen _within each group_ until you use `ungroup()`. It's like creating separate mini-datasets for each group. --- ## The Pipe Operator (%>%) - Connecting the Flow **The game-changer**: The pipe operator lets you chain operations together in a readable, left-to-right flow. ### Without Pipes (Hard to Read) ```r # Nested functions - read from inside out result <- arrange(select(filter(data, age > 25), name, age, income), desc(income)) ``` ### With Pipes (Natural Flow) ```r # Step-by-step flow - read from left to right result <- data %>% filter(age > 25) %>% select(name, age, income) %>% arrange(desc(income)) ``` **Think of the pipe as "then"**: - Take the data, **then** - Filter for age > 25, **then** - Select name, age, and income, **then** - Arrange by income in descending order > [!tip] Reading Pipes Aloud > The pipe `%>%` can be read as "then" or "and then". This makes your code sound like natural instructions: "Take the data, then filter it, then select these columns, then arrange by income." --- ## Real-World Application: Population Health Analysis > [!example]- Scenario All five verbs working together in a realistic scenario: > > "What are the key health patterns and risk factors across different age groups and regions in our population health survey?" > > > ```r > # Analyze health outcomes from the latest population survey > health_insights <- health_survey_data %>% > # Step 1: Focus on complete, recent survey responses > filter(survey_year == 2023 & !is.na(bmi) & !is.na(blood_pressure)) %>% > > # Step 2: Select relevant health and demographic variables > select(patient_id, region, age, gender, bmi, systolic_bp, > diabetes_status, smoking_status, exercise_hours_weekly) %>% > > # Step 3: Create new health classification variables > mutate( > bmi_category = case_when( > bmi < 18.5 ~ "Underweight", > bmi < 25 ~ "Normal", > bmi < 30 ~ "Overweight", > TRUE ~ "Obese" > ), > age_group = cut(age, breaks = c(0, 30, 50, 65, Inf), > labels = c("Under 30", "30-49", "50-64", "65+")), > bp_category = case_when( > systolic_bp < 120 ~ "Normal", > systolic_bp < 140 ~ "Elevated", > TRUE ~ "Hypertensive" > ), > risk_score = case_when( > diabetes_status == "Yes" & smoking_status == "Current" ~ "High Risk", > diabetes_status == "Yes" | smoking_status == "Current" ~ "Moderate Risk", > TRUE ~ "Low Risk" > ) > ) %>% > > # Step 4: Sort to identify highest-risk populations first > arrange(desc(risk_score), region, age_group) %>% > > # Step 5: Create summary health profiles by demographics > group_by(region, age_group, risk_score) %>% > summarize( > population_count = n(), > avg_bmi = round(mean(bmi), 1), > pct_hypertensive = round(mean(bp_category == "Hypertensive") * 100, 1), > avg_exercise_hours = round(mean(exercise_hours_weekly, na.rm = TRUE), 1), > .groups = "drop" # Ungroup after summarizing > ) > ``` --- ## Connecting the Concepts The dplyr philosophy creates a coherent system where: - Each verb has one clear purpose - no confusion about what it does - Verbs combine naturally - use pipes to chain operations - Code reads like instructions - filter this, then select that, then arrange by this - Complex analysis becomes approachable - break big problems into small, clear steps -- Reference - HMP 669