<sub>2025-05-29</sub> <sub>#data-visualization #data-management #r-programming #statistical-analysis #hmp669</sub>
<sup>[[maps-of-content|🌐 Maps of Content — All Notes]] </sup>
<sup>Series: [[hmp669|HMP 669 — Data Management and Visualization]]</sup>
<sup>Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]]</sup>
# Creating and Mutating Variables in R: From Raw Data to Meaningful Insights
> [!abstract]- Overview
>
> Transform existing data into new, meaningful variables that reveal patterns and support analysis---turning numbers into insights that matter for health research.
>
> **Key Concepts**: Three essential transformations
>
> - **Categorization**: Converting continuous measurements into clinically meaningful groups
> - **Combination**: Merging multiple variables into composite scores or indices
> - **Integration**: Seamlessly adding new variables to existing datasets
>
> **Critical Connections**:
>
> - Conditional logic drives categorization (if-then thinking)
> - The `mutate()` function serves as your primary workshop for variable creation
> - Logic checking ensures your transformations work as intended
>
> **Must Remember**:
>
> - Always verify expectations before and after creating variables
> - Use `case_when()` for complex conditions, `ifelse()` for simple ones
> - New variables should enhance, not replace, your understanding of the data
> [!info]- Required R Packages
>
> |Package|Purpose|Installation|
> |---|---|---|
> |`dplyr`|Data manipulation and mutate() function|`install.packages("tidyverse")`|
> |`ggplot2`|Advanced grouping functions (cut_interval, cut_number)|`install.packages("tidyverse")`|
>
> **Setup code:**
> ```r
> library(tidyverse) # Includes both dplyr and ggplot2
> # OR separately:
> # library(dplyr)
> # library(ggplot2)
> ```
>
> **Note**: Basic functions (ifelse, relevel, table, summary) are built into base R.
> [!code]- Syntax Reference
>
> Download
>
> |Command/Syntax|Purpose|Example|
> |---|---|---|
> |**Conditional Logic**|||
> |`ifelse(condition, true, false)`|Simple conditional assignment|`ifelse(score >= 70, "pass", "fail")`|
> |`case_when(cond1 ~ result1, cond2 ~ result2)`|Multiple conditional assignment|`case_when(score >= 90 ~ "A", score >= 80 ~ "B")`|
> |**Categorization**|||
> |`cut_interval(x, n = 5)`|Split into equal-range groups|`cut_interval(age, n = 5)`|
> |`cut_number(x, n = 5)`|Split into equal-size groups|`cut_number(age, n = 5)`|
> |`relevel(factor, ref = "level")`|Set reference level|`relevel(risk_cat, ref = "Low")`|
> |**Variable Creation**|||
> |`mutate(new_var = expression)`|Add single new variable|`mutate(bmi = weight/height^2)`|
> |`mutate(var1 = expr1, var2 = expr2)`|Add multiple variables|`mutate(bmi = weight/height^2, age_group = cut_interval(age, 5))`|
> |**Arithmetic**|||
> |`+`, `-`, `*`, `/`, `^`|Basic math operations|`total_score = test1 + test2 + test3`|
> |**Checking**|||
> |`table(variable)`|Count categories|`table(bp_category)`|
> |`summary(variable)`|Descriptive statistics|`summary(risk_score)`|
---
## The Why: From Numbers to Knowledge
Imagine you're analyzing health data with a blood pressure reading of 145/90. That number tells you something, but what you really need to know is: _"Does this person have hypertension?"_ This is where variable creation becomes powerful---**you're not just manipulating data, you're translating measurements into meaningful categories that inform decisions**.
In health research, this transformation happens constantly:
- Converting test scores into pass/fail categories
- Combining multiple cognitive assessments into a total score
- Grouping ages into clinically relevant ranges
- Creating risk categories from continuous biomarkers
> [!note] The Bridge Concept
> Think of variable creation as building bridges between raw measurements and actionable insights. You're the architect deciding where those bridges should connect.
## Foundation: Conditional Logic
**The core principle**: Every new categorical variable starts with a question---_"Under what conditions should this participant be classified this way?"_
### Simple Conditions: The `ifelse()` Function
```r
# Basic structure: ifelse(condition, if_true, if_false)
pass_status <- ifelse(score >= 70, "pass", "fail")
```
**How it works**:
1. **Condition**: `score >= 70` (the test you're applying)
2. **If True**: `"pass"` (what happens when condition is met)
3. **If False**: `"fail"` (what happens when condition is not met)
> [!tip] Logic Checking Strategy
> Before running your code, estimate: _"How many participants do I expect in each category?"_ Then verify your results match your expectations.
### Complex Conditions: The `case_when()` Function
When you need multiple categories, `case_when()` becomes your powerful ally:
```r
grade <- case_when(
score >= 90 ~ "A",
score >= 80 ~ "B",
score >= 70 ~ "C",
score >= 60 ~ "D",
TRUE ~ "F" # Default for everything else
)
```
**The pattern**: `condition ~ result`
- The tilde (`~`) connects your condition to its outcome
- Conditions are evaluated in order (first match wins)
- `TRUE ~` acts as your safety net for any unmatched cases
> [!warning] Order Matters `case_when()` checks conditions sequentially. Put more specific conditions first, broader ones later.
## Advanced Categorization: Automated Grouping
### Equal-Range Groups: `cut_interval()`
Sometimes you want to divide data into groups with equal spacing:
```r
age_groups <- cut_interval(age, n = 5) # 5 groups of equal year ranges
```
### Equal-Size Groups: Quantile-Based Splitting `cut_number()`
For approximately equal numbers in each group:
```r
age_quintiles <- cut_number(age, n = 5) # 5 groups of roughly equal size
```
> [!note] When to Use Which
>
> - **Equal-range**: When the spacing matters (e.g., age decades)
> - **Equal-size**: When you want balanced groups for statistical analysis
## Combining Variables: Building Composite Scores
Health research often requires combining multiple measurements:
```r
# Body Mass Index calculation
bmi <- weight_kg / (height_m^2)
# Total cognitive score
total_score <- executive_score + memory_score + attention_score
```
**The arithmetic operators** become your tools:
- `+` and `-` for addition and subtraction
- `*` and `/` for multiplication and division
- `^` for exponentiation
## Integration: The `mutate()` Workshop
`mutate()` is where individual variable creations become part of your dataset. Think of it as your workshop where you craft new variables while keeping all your existing ones safe.
### Single Variable Creation
```r
enhanced_data <- original_data %>%
mutate(
hypertension = ifelse(systolic_bp >= 140 | diastolic_bp >= 90,
"Yes", "No")
)
```
### Multiple Variables at Once
The real power emerges when you create several variables simultaneously:
```r
enhanced_data <- original_data %>%
mutate(
bmi = weight_kg / (height_m^2),
age_group = cut_interval(age, n = 5),
test_result = case_when(
score >= 90 ~ "Excellent",
score >= 80 ~ "Good",
score >= 70 ~ "Satisfactory",
score >= 60 ~ "Needs Improvement",
TRUE ~ "Unsatisfactory"
),
risk_category = case_when(
bmi >= 30 ~ "High Risk",
bmi >= 25 ~ "Moderate Risk",
TRUE ~ "Low Risk"
)
)
```
> [!tip] The Beauty of Chaining
> Notice how new variables can immediately use other newly created variables within the same `mutate()` call. You can reference `bmi` in `risk_category` even though both are being created in the same operation.
## Working with Factor Levels
When you create categorical variables, R treats them as factors with levels. R automatically sets the first level alphabetically as the reference. **One level becomes the reference level**---the baseline against which all others are compared. Use relevel() to set a clinically meaningful reference category.
```r
# Set "Low Risk" as the reference level
risk_category <- relevel(risk_category, ref = "Low Risk")
```
This becomes crucial for statistical analysis and visualization---your reference level affects how results are interpreted and how graphs are ordered.
## Real-World Application: Creating a Clinical Dashboard
> [!example]- Scenario
> ```r
> # Starting with raw health data
> clinical_data <- health_survey %>%
> mutate(
> # Blood pressure categories (AHA guidelines)
> bp_category = case_when(
> systolic_bp >= 180 | diastolic_bp >= 120 ~ "Hypertensive Crisis",
> systolic_bp >= 140 | diastolic_bp >= 90 ~ "High Blood Pressure",
> systolic_bp >= 130 | diastolic_bp >= 80 ~ "Elevated",
> systolic_bp < 120 & diastolic_bp < 80 ~ "Normal",
> TRUE ~ "Elevated"
> ),
>
> # BMI calculation and categories
> bmi = weight_kg / (height_m^2),
> weight_status = case_when(
> bmi >= 30 ~ "Obese",
> bmi >= 25 ~ "Overweight",
> bmi >= 18.5 ~ "Normal Weight",
> TRUE ~ "Underweight"
> ),
>
> # Age grouping for analysis
> age_decade = cut_interval(age, width = 10),
>
> # Composite risk score (simplified example)
> risk_score = (systolic_bp/140) + (bmi/25) + (age/100),
>
> # Overall risk classification
> overall_risk = case_when(
> risk_score >= 2.5 ~ "High Risk",
> risk_score >= 1.5 ~ "Moderate Risk",
> TRUE ~ "Low Risk"
> )
> )
> ```
## The Logic Checking Process
**Before creating variables**:
- What categories do I expect?
- How many participants should fall into each?
- Do my cutpoints make clinical sense?
**After creating variables**:
```r
# Check your work
table(clinical_data$bp_category)
summary(clinical_data$risk_score)
```
**Questions to ask**:
- Do the numbers match my expectations?
- Are there any missing or unexpected categories?
- Do the distributions make sense given my data?
## Connecting It All Together
Variable creation in R follows a logical progression:
1. **Identify the need**: What insight do you want to extract?
2. **Define conditions**: What rules will guide the categorization?
3. **Choose your tool**: Simple (`ifelse`) or complex (`case_when`) conditions?
4. **Integrate thoughtfully**: Use `mutate()` to preserve your original data
5. **Verify relentlessly**: Check that your transformations work as intended