creating-mutating-variables-r - Em Royce & Company

2025-05-29 #data-visualization #data-management #r-programming #statistical-analysis #hmp669 [[maps-of-content|🌐 Maps of Content — All Notes]] Series: [[hmp669|HMP 669 — Data Management and Visualization]] Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]] # Creating and Mutating Variables in R: From Raw Data to Meaningful Insights > [!abstract]- Overview > > Transform existing data into new, meaningful variables that reveal patterns and support analysis---turning numbers into insights that matter for health research. > > **Key Concepts**: Three essential transformations > > - **Categorization**: Converting continuous measurements into clinically meaningful groups > - **Combination**: Merging multiple variables into composite scores or indices > - **Integration**: Seamlessly adding new variables to existing datasets > > **Critical Connections**: > > - Conditional logic drives categorization (if-then thinking) > - The `mutate()` function serves as your primary workshop for variable creation > - Logic checking ensures your transformations work as intended > > **Must Remember**: > > - Always verify expectations before and after creating variables > - Use `case_when()` for complex conditions, `ifelse()` for simple ones > - New variables should enhance, not replace, your understanding of the data > [!info]- Required R Packages > > |Package|Purpose|Installation| > |---|---|---| > |`dplyr`|Data manipulation and mutate() function|`install.packages("tidyverse")`| > |`ggplot2`|Advanced grouping functions (cut_interval, cut_number)|`install.packages("tidyverse")`| > > **Setup code:** > ```r > library(tidyverse) # Includes both dplyr and ggplot2 > # OR separately: > # library(dplyr) > # library(ggplot2) > ``` > > **Note**: Basic functions (ifelse, relevel, table, summary) are built into base R. > [!code]- Syntax Reference > > Download > > |Command/Syntax|Purpose|Example| > |---|---|---| > |**Conditional Logic**||| > |`ifelse(condition, true, false)`|Simple conditional assignment|`ifelse(score >= 70, "pass", "fail")`| > |`case_when(cond1 ~ result1, cond2 ~ result2)`|Multiple conditional assignment|`case_when(score >= 90 ~ "A", score >= 80 ~ "B")`| > |**Categorization**||| > |`cut_interval(x, n = 5)`|Split into equal-range groups|`cut_interval(age, n = 5)`| > |`cut_number(x, n = 5)`|Split into equal-size groups|`cut_number(age, n = 5)`| > |`relevel(factor, ref = "level")`|Set reference level|`relevel(risk_cat, ref = "Low")`| > |**Variable Creation**||| > |`mutate(new_var = expression)`|Add single new variable|`mutate(bmi = weight/height^2)`| > |`mutate(var1 = expr1, var2 = expr2)`|Add multiple variables|`mutate(bmi = weight/height^2, age_group = cut_interval(age, 5))`| > |**Arithmetic**||| > |`+`, `-`, `*`, `/`, `^`|Basic math operations|`total_score = test1 + test2 + test3`| > |**Checking**||| > |`table(variable)`|Count categories|`table(bp_category)`| > |`summary(variable)`|Descriptive statistics|`summary(risk_score)`| --- ## The Why: From Numbers to Knowledge Imagine you're analyzing health data with a blood pressure reading of 145/90. That number tells you something, but what you really need to know is: _"Does this person have hypertension?"_ This is where variable creation becomes powerful---**you're not just manipulating data, you're translating measurements into meaningful categories that inform decisions**. In health research, this transformation happens constantly: - Converting test scores into pass/fail categories - Combining multiple cognitive assessments into a total score - Grouping ages into clinically relevant ranges - Creating risk categories from continuous biomarkers > [!note] The Bridge Concept > Think of variable creation as building bridges between raw measurements and actionable insights. You're the architect deciding where those bridges should connect. ## Foundation: Conditional Logic **The core principle**: Every new categorical variable starts with a question---_"Under what conditions should this participant be classified this way?"_ ### Simple Conditions: The `ifelse()` Function ```r # Basic structure: ifelse(condition, if_true, if_false) pass_status <- ifelse(score >= 70, "pass", "fail") ``` **How it works**: 1. **Condition**: `score >= 70` (the test you're applying) 2. **If True**: `"pass"` (what happens when condition is met) 3. **If False**: `"fail"` (what happens when condition is not met) > [!tip] Logic Checking Strategy > Before running your code, estimate: _"How many participants do I expect in each category?"_ Then verify your results match your expectations. ### Complex Conditions: The `case_when()` Function When you need multiple categories, `case_when()` becomes your powerful ally: ```r grade <- case_when( score >= 90 ~ "A", score >= 80 ~ "B", score >= 70 ~ "C", score >= 60 ~ "D", TRUE ~ "F" # Default for everything else ) ``` **The pattern**: `condition ~ result` - The tilde (`~`) connects your condition to its outcome - Conditions are evaluated in order (first match wins) - `TRUE ~` acts as your safety net for any unmatched cases > [!warning] Order Matters `case_when()` checks conditions sequentially. Put more specific conditions first, broader ones later. ## Advanced Categorization: Automated Grouping ### Equal-Range Groups: `cut_interval()` Sometimes you want to divide data into groups with equal spacing: ```r age_groups <- cut_interval(age, n = 5) # 5 groups of equal year ranges ``` ### Equal-Size Groups: Quantile-Based Splitting `cut_number()` For approximately equal numbers in each group: ```r age_quintiles <- cut_number(age, n = 5) # 5 groups of roughly equal size ``` > [!note] When to Use Which > > - **Equal-range**: When the spacing matters (e.g., age decades) > - **Equal-size**: When you want balanced groups for statistical analysis ## Combining Variables: Building Composite Scores Health research often requires combining multiple measurements: ```r # Body Mass Index calculation bmi <- weight_kg / (height_m^2) # Total cognitive score total_score <- executive_score + memory_score + attention_score ``` **The arithmetic operators** become your tools: - `+` and `-` for addition and subtraction - `*` and `/` for multiplication and division - `^` for exponentiation ## Integration: The `mutate()` Workshop `mutate()` is where individual variable creations become part of your dataset. Think of it as your workshop where you craft new variables while keeping all your existing ones safe. ### Single Variable Creation ```r enhanced_data <- original_data %>% mutate( hypertension = ifelse(systolic_bp >= 140 | diastolic_bp >= 90, "Yes", "No") ) ``` ### Multiple Variables at Once The real power emerges when you create several variables simultaneously: ```r enhanced_data <- original_data %>% mutate( bmi = weight_kg / (height_m^2), age_group = cut_interval(age, n = 5), test_result = case_when( score >= 90 ~ "Excellent", score >= 80 ~ "Good", score >= 70 ~ "Satisfactory", score >= 60 ~ "Needs Improvement", TRUE ~ "Unsatisfactory" ), risk_category = case_when( bmi >= 30 ~ "High Risk", bmi >= 25 ~ "Moderate Risk", TRUE ~ "Low Risk" ) ) ``` > [!tip] The Beauty of Chaining > Notice how new variables can immediately use other newly created variables within the same `mutate()` call. You can reference `bmi` in `risk_category` even though both are being created in the same operation. ## Working with Factor Levels When you create categorical variables, R treats them as factors with levels. R automatically sets the first level alphabetically as the reference. **One level becomes the reference level**---the baseline against which all others are compared. Use relevel() to set a clinically meaningful reference category. ```r # Set "Low Risk" as the reference level risk_category <- relevel(risk_category, ref = "Low Risk") ``` This becomes crucial for statistical analysis and visualization---your reference level affects how results are interpreted and how graphs are ordered. ## Real-World Application: Creating a Clinical Dashboard > [!example]- Scenario > ```r > # Starting with raw health data > clinical_data <- health_survey %>% > mutate( > # Blood pressure categories (AHA guidelines) > bp_category = case_when( > systolic_bp >= 180 | diastolic_bp >= 120 ~ "Hypertensive Crisis", > systolic_bp >= 140 | diastolic_bp >= 90 ~ "High Blood Pressure", > systolic_bp >= 130 | diastolic_bp >= 80 ~ "Elevated", > systolic_bp < 120 & diastolic_bp < 80 ~ "Normal", > TRUE ~ "Elevated" > ), > > # BMI calculation and categories > bmi = weight_kg / (height_m^2), > weight_status = case_when( > bmi >= 30 ~ "Obese", > bmi >= 25 ~ "Overweight", > bmi >= 18.5 ~ "Normal Weight", > TRUE ~ "Underweight" > ), > > # Age grouping for analysis > age_decade = cut_interval(age, width = 10), > > # Composite risk score (simplified example) > risk_score = (systolic_bp/140) + (bmi/25) + (age/100), > > # Overall risk classification > overall_risk = case_when( > risk_score >= 2.5 ~ "High Risk", > risk_score >= 1.5 ~ "Moderate Risk", > TRUE ~ "Low Risk" > ) > ) > ``` ## The Logic Checking Process **Before creating variables**: - What categories do I expect? - How many participants should fall into each? - Do my cutpoints make clinical sense? **After creating variables**: ```r # Check your work table(clinical_data$bp_category) summary(clinical_data$risk_score) ``` **Questions to ask**: - Do the numbers match my expectations? - Are there any missing or unexpected categories? - Do the distributions make sense given my data? ## Connecting It All Together Variable creation in R follows a logical progression: 1. **Identify the need**: What insight do you want to extract? 2. **Define conditions**: What rules will guide the categorization? 3. **Choose your tool**: Simple (`ifelse`) or complex (`case_when`) conditions? 4. **Integrate thoughtfully**: Use `mutate()` to preserve your original data 5. **Verify relentlessly**: Check that your transformations work as intended