plot-types-ggplot2-r - Em Royce & Company

2025-05-30 #data-visualization #data-management #r-programming #statistical-analysis #hmp669 [[maps-of-content|🌐 Maps of Content — All Notes]] Series: [[hmp669|HMP 669 — Data Management and Visualization]] Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]] # Plot types with ggplot2 in R > [!abstract]- Overview > > Choose the right visualization by matching your data types and variables to proven plot patterns, creating clear stories from complex datasets. > > **Key Concepts**: > > - **Data Type Foundation**: Categorical vs numeric variables determine plot families > - **Variable Count Strategy**: One vs two variables opens different visualization paths > - **Implementation Toolkit**: ggplot2 provides systematic functions for each plot type > > **Critical Connections**: > > - Data structure → Plot choice → R function → Visual insight > - Each plot type reveals different aspects of your data's story > - Layering techniques (like jitter) enhance basic plots with additional information > > **Must Remember**: > > - Start with data type and count, not visual preferences > - Every plot type has a specific strength---match the tool to the question > - ggplot2's consistent grammar makes switching between plots seamless > [!info]- Required R Packages > > > |Package|Purpose|Installation| > |---|---|---| > |`ggplot2`|Core plotting system|`install.packages("ggplot2")` or `install.packages("tidyverse")`| > |`dplyr`|Data manipulation (optional but helpful)|`install.packages("tidyverse")`| > > **Setup code:** > > > ```r > library(ggplot2) # or library(tidyverse) for both > library(dplyr) # for data manipulation > ``` > [!code]- Syntax Reference > > > |Plot Type|Data Structure|ggplot2 Function|Key Options| > |---|---|---|---| > |**Categorical Data**|||| > |Bar Chart|1 categorical|`geom_bar()`|`fill`, `color`| > |**Single Numeric Variable**|||| > |Histogram|1 numeric|`geom_histogram()`|`binwidth`, `fill`, `color`| > |Density Plot|1 numeric|`geom_density()`|`linewidth`, `color`, `fill`| > |**Two Numeric Variables**|||| > |Scatterplot|2 numeric|`geom_point()`|`alpha`, `size`, `color`| > |Line Plot|2 numeric|`geom_smooth()`|`method`, `se`, `color`| > |**Numeric + Categorical**|||| > |Boxplot|1 numeric + 1 categorical|`geom_boxplot()`|`fill`, `color`| > |Violin Plot|1 numeric + 1 categorical|`geom_violin()`|`fill`, `draw_quantiles`| > |**Enhancement Layer**|||| > |Data Points|Any categorical axis|`geom_jitter()`|`width`, `alpha`, `size`| ## The Decision Framework: From Data to Visualization **Think of data visualization like choosing the right lens for a camera**---each plot type reveals different aspects of your data's story. The key is starting with your data's fundamental characteristics, not your aesthetic preferences. ### The Central Question Tree Plot type based on data type & variable number ![[plot-types-r-1748592749251.webp]] ```plaintext What's your primary variable type? ├── Categorical → Bar Chart └── Numeric ├── One variable only? │ ├── Distribution shape → Histogram │ └── Smooth distribution → Density Plot └── Two variables? ├── Both numeric │ ├── Relationship → Scatterplot │ └── Trend → Line Plot └── One categorical ├── Distribution comparison → Boxplot └── Shape comparison → Violin Plot ``` ## Categorical Data, Barplot: `geom_bar()` **When to use**: You have one categorical variable and want to show frequency or counts. **What it reveals**: How many observations fall into each category---the fundamental building blocks of your dataset. ```r # Basic bar chart structure ggplot(data = nhanes, aes(y = Education, fill = Education)) + geom_bar() + labs(y = "Education Level", x = "Count") ``` ![[plot-types-r-1748593054964.webp]] **Key insight**: Bar charts excel at showing relative sizes between groups. The human eye naturally compares bar lengths, making differences immediately apparent. ## Single Numeric Variables: Understanding Distributions ### Histograms — The Shape of Your Data: `geom_histogram()` **When to use**: You want to see the actual distribution shape of a numeric variable. **What it reveals**: Where your data clusters, whether it's normal or skewed, and where outliers might hide. ```r # Histogram with customization ggplot(data = nhanes, aes(x = RedBloodCells)) + geom_histogram(binwidth = 0.1, fill = "lightblue", color = "darkblue") + labs(x = "Red Blood Cell Count", y = "Frequency") ``` ![[plot-types-r-1748593130011.webp]] > [!note] Binwidth Matters > The `binwidth` parameter controls how detailed your view becomes. Narrow bins show fine detail but might be noisy; wide bins show general patterns but miss nuances. ### Density Plots — The Smooth Story: `geom_density()` **When to use**: You want to see distribution patterns without the "choppiness" of histogram bins. **What it reveals**: The underlying probability distribution of your data---imagine a smooth curve drawn over your histogram. ```r # Smooth density visualization ggplot(data = nhanes, aes(x = RedBloodCells)) + geom_density(linewidth = 1.2, color = "darkred") + labs(x = "Red Blood Cell Count", y = "Density") ``` ![[plot-types-r-1748593260441.webp]] ## Two Numeric Variables: Exploring Relationships ### Scatterplots — The Relationship Detector: `geom_point()` **When to use**: You want to see how two numeric variables relate to each other. **What it reveals**: Correlation patterns, outliers, and the strength of relationships between variables. ```r # Scatterplot with transparency for overlapping points ggplot(data = nhanes, aes(x = Age, y = log10(BloodLead))) + geom_point(alpha = 0.6, size = 1.5) + labs(x = "Age (years)", y = "Blood Lead Level (log10)") ``` ![[plot-types-r-1748593363506.webp]] > [!tip] Alpha Transparency > Use `alpha` values between 0.3-0.7 when you have many overlapping points. This reveals density patterns that solid points would hide. ### Line Plots — Following the Trend: `geom_smooth()` **When to use**: You want to show trends or fit a model through your data points. **What it reveals**: The overall relationship pattern, whether linear or curved, and the confidence in that pattern. ```r # Line plot with confidence band ggplot(data = nhanes, aes(x = Age, y = log10(BloodLead))) + geom_smooth(method = "lm", se = TRUE, color = "red", fill = "pink") + labs(x = "Age (years)", y = "Blood Lead Level (log10)") ``` ![[plot-types-r-1748593414883.webp]] ## Mixed Data Types: Numeric and Categorical ### Boxplots — The Statistical Summary: `geom_boxplot()` **When to use**: You want to compare the distribution of a numeric variable across different categories. **What it reveals**: **Think of a boxplot as a data biography**---it tells you the median (middle line), the middle 50% of data (the box), the typical range (whiskers), and the unusual cases (dots). ```r # Boxplot with color coding ggplot(data = nhanes, aes(x = AgeGroup, y = log10(BloodIron), fill = AgeGroup)) + geom_boxplot() + labs(x = "Age Group", y = "Blood Iron Level (log10)") ``` ![[plot-types-r-1748593529965.webp]] > [!note] Reading Boxplots > > - **Horizontal line in box**: Median (50th percentile) > - **Box boundaries**: 25th and 75th percentiles (middle 50% of data) > - **Whiskers**: Extend to 5th and 95th percentiles > - **Dots**: Outliers beyond the whiskers ### Violin Plots — The Shape Story: `geom_violin()` **When to use**: You want to see both the summary statistics AND the distribution shape across categories. **What it reveals**: **Imagine a density plot rotated and mirrored**---wider sections show where more data points cluster. ```r # Violin plot with quantile lines ggplot(data = nhanes, aes(x = AgeGroup, y = log10(BloodIron), fill = AgeGroup)) + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) + labs(x = "Age Group", y = "Blood Iron Level (log10)") ``` ![[plot-types-r-1748593579353.webp]] > [!tip] Violin vs Boxplot > Boxplots excel at quick statistical summaries; violin plots reveal distribution shapes. Use violins when the shape of the distribution matters for your analysis. ### Jitter Points— Showing Individual Stories: `geom_jitter()` **When to use**: You want to overlay actual data points on any plot with a categorical axis. **What it reveals**: Individual observations while preventing overplotting through small random horizontal movement. ```r # Boxplot enhanced with jittered points ggplot(data = nhanes, aes(x = AgeGroup, y = log10(BloodIron))) + geom_boxplot(fill = "lightgray", alpha = 0.7) + geom_jitter(width = 0.2, alpha = 0.4, size = 0.8) + labs(x = "Age Group", y = "Blood Iron Level (log10)") ``` ```r ggplot(data = nhanes, aes(x = AgeGroup, y = log10(BloodIron), fill = AgeGroup)) + geom_violin(draw_quantiles = c(0.25, 0.5, 0.75)) + labs(x = "Age Group", y = "Blood Iron Level (log10)") + geom_jitter(shape = 16, alpha = 0.25, position = position_jitter(0.2), size = 0.75) ``` ![[plot-types-r-1748593733415.webp]] > [!warning] Jitter Width > Use `width` values between 0.1-0.3. Too little and points overlap; too much and they spread beyond the category boundaries. ## Real-World Application: Choosing Your Visualization Strategy > [!example]- Scenario > **Scenario**: You're analyzing health data and need to present findings about blood lead levels. > > ### Step 1: Identify Your Variables > > - **Blood lead level**: Numeric (continuous) > - **Age**: Numeric (continuous) > - **Age group**: Categorical (ordered) > - **Education**: Categorical (unordered) > > ### Step 2: Match Questions to Plot Types > > |Research Question|Variables Needed|Best Plot Choice|Why This Works| > |---|---|---|---| > |"What's the typical blood lead level?"|1 numeric|Histogram or Density|Shows distribution shape| > |"How does blood lead change with age?"|2 numeric|Scatterplot or Line|Reveals relationship pattern| > |"Do age groups differ in blood lead?"|1 numeric + 1 categorical|Boxplot or Violin|Compares distributions| > |"What's the education breakdown?"|1 categorical|Bar chart|Shows group sizes| > > ### Step 3: Layer for Insight > > ```r > # Comprehensive visualization combining multiple techniques > ggplot(data = nhanes, aes(x = AgeGroup, y = log10(BloodLead), fill = AgeGroup)) + > geom_violin(alpha = 0.7, draw_quantiles = 0.5) + > geom_jitter(width = 0.2, alpha = 0.3, size = 0.5) + > labs( > title = "Blood Lead Levels by Age Group", > x = "Age Group", > y = "Blood Lead Level (log10 transformed)", > caption = "Violin plots show distribution shape; dots show individual measurements" > ) + > theme_minimal() > ``` ## Connecting the Framework: Your Visualization Toolkit Each data structure points toward specific visualization strengths: - **Categorical data** → **Bar charts** reveal group sizes and comparisons - **Single numeric** → **Histograms/Density plots** reveal distribution patterns - **Two numeric** → **Scatterplots/Line plots** reveal relationships and trends - **Mixed types** → **Boxplots/Violin plots** reveal group comparisons > [!tip] Strategy > Start with the basic plot that matches your data structure, then enhance with additional layers (jitter points, smooth lines, color coding) to add explanatory power. **Let your data's structure guide your visualization choice, not visual preferences.** A systematic approach based on variable types and counts will consistently lead you to plots that reveal genuine insights rather than simply looking attractive. Spend less time choosing plots and more time discovering patterns in your data.