r-basics-part-3-data-wrangling-plots

2025-04-07 Monday #r-programming #rstudio #data-science [[maps-of-content]] # Data Manipulation in R: Base R, dplyr, and data.table Approaches > [!success]- Concept Sketch: [[]] > ![[]] > [!abstract]- Quick Review > > **Core Essence**: R offers multiple approaches to manipulate data efficiently, from base R indexing to specialized packages like dplyr and data.table, each with its own syntax and strengths. > > **Key Concepts**: > > - Vector indexing using logical conditions > - The dplyr "grammar of data manipulation" (filter, select, mutate, summarize) > - data.table's concise, fast syntax for large datasets > - Tibbles as modern data frames in the tidyverse > > **Must Remember**: > > - The pipe operator (`%>%`) connects operations in a readable flow > - `group_by()` enables powerful split-apply-combine workflows > - data.table uses `[i, j, by]` syntax for efficient operations > - Different approaches have trade-offs in readability vs. performance > > **Critical Relationships**: > > - dplyr functions connect via the pipe operator to form data workflows > - Grouped data connects with summarize() to create aggregated statistics > - Base R, dplyr, and data.table offer different syntax for similar operations > [!code]- Code Reference > > Download > > |Command/Syntax|Purpose|Example| > |---|---|---| > |**Base R: Indexing & Subsetting**||| > |`vector[logical_condition]`|Subset elements using condition|`values[values > 10]`| > |`which(logical_vector)`|Get indices of TRUE values|`which(values > 10)`| > |`match(x, table)`|Find positions of x in table|`match(c("a", "b"), letters)`| > |`x %in% y`|Check if elements of x are in y|`c("a", "b") %in% letters`| > |`sum(logical_vector)`|Count TRUE values|`sum(values > 10)`| > |**Base R: Data Frames**||| > |`data.frame(...)`|Create a data frame|`data.frame(x = 1:3, y = c("a", "b", "c"))`| > |`df$column`|Access a column|`mtcars$mpg`| > |`df[rows, columns]`|Subset data frame|`mtcars[mtcars$cyl == 6, c("mpg", "hp")]`| > |**Base R: Plotting**||| > |`plot(x, y)`|Create scatterplot|`plot(mtcars$wt, mtcars$mpg)`| > |`hist(x)`|Create histogram|`hist(mtcars$mpg)`| > |`boxplot(formula, data)`|Create boxplot|`boxplot(mpg ~ cyl, data = mtcars)`| > |**dplyr: Core Functions**||| > |`filter(df, condition)`|Subset rows|`filter(mtcars, mpg > 20)`| > |`select(df, columns)`|Choose columns|`select(mtcars, mpg, cyl, hp)`| > |`mutate(df, new_col = expr)`|Create/modify columns|`mutate(mtcars, efficiency = mpg/wt)`| > |`summarize(df, stat = func())`|Calculate summaries|`summarize(mtcars, avg_mpg = mean(mpg))`| > |`group_by(df, columns)`|Group data|`group_by(mtcars, cyl)`| > |`arrange(df, columns)`|Sort rows|`arrange(mtcars, desc(mpg))`| > |`%>%`|Pipe operator|`mtcars %>% filter(cyl == 6) %>% select(mpg)`| > |`n()`|Count rows in group|`summarize(group_by(mtcars, cyl), count = n())`| > |`pull(df, column)`|Extract column as vector|`mtcars %>% pull(mpg)`| > |`head(df, n)`|Get first n rows|`head(mtcars, 5)`| > |`top_n(df, n, column)`|Get top n rows by column|`top_n(mtcars, 5, mpg)`| > |**data.table: Core Syntax**||| > |`data.table(...)`|Create a data.table|`data.table(x = 1:3, y = letters[1:3])`| > |`setDT(df)`|Convert data frame to data.table|`setDT(mtcars_copy)`| > |`dt[i, j, by]`|Main syntax pattern|`dt[mpg > 20, .(mean_hp = mean(hp)), by = cyl]`| > |`dt[rows]`|Filter rows|`dt[mpg > 20]`| > |`dt[, columns]`|Select columns|`dt[, .(mpg, cyl)]`| > |`dt[, new_col := expr]`|Add/modify column|`dt[, power_ratio := hp/wt]`| > |`dt[, .(stats), by = group]`|Grouped summaries|`dt[, .(avg = mean(mpg)), by = cyl]`| > |`dt[order(columns)]`|Sort rows|`dt[order(-mpg)]`| > |`.N`|Count rows|`dt[, .N, by = cyl]`| > |**Tibbles**||| > |`tibble(...)`|Create a tibble|`tibble(x = 1:3, y = letters[1:3])`| > |`as_tibble(df)`|Convert to tibble|`as_tibble(mtcars)`| ## Introduction to Data Manipulation in R R excels at transforming, filtering, and summarizing data—collectively known as "data manipulation" or "data wrangling." This material explores three distinct approaches to manipulate data in R: 1. **Base R** - The fundamental tools built into the language 2. **dplyr** - A grammar-based package for intuitive data manipulation 3. **data.table** - A high-performance package for efficient operations Understanding these approaches gives you flexibility to choose the right tool based on your needs: readability, simplicity, or performance with large datasets. ## Base R: Indexing and Subsetting ### Vector Indexing with Logical Conditions **Vector indexing** forms the foundation of data manipulation in R. You can extract elements that meet specific conditions using logical expressions. > [!note] When you apply a logical condition to a vector, R returns a logical vector (TRUE/FALSE) that can be used to select matching elements. For example: ```r # Create a vector of murder rates murder_rate <- c(0.8, 1.2, 0.5, 2.1, 0.7) # Create a logical vector for rates <= 0.71 low_rates <- murder_rate <= 0.71 # Use logical vector to subset original vector murder_rate[low_rates] ``` ### Useful Logical Functions - **sum()** with a logical vector counts TRUE values - **which()** returns the indices where values are TRUE - **match()** finds positions of values from one vector in another - **%in%** checks if elements appear in another vector > [!tip] Combine logical conditions using & (AND) and | (OR) operators to create complex filters. ## Creating Data Frames Data frames are R's primary structure for tabular data. Create them using the `data.frame()` function: ```r # Create a simple data frame df <- data.frame( name = c("Alice", "Bob", "Charlie"), age = c(25, 30, 35), score = c(98, 87, 92) ) ``` > [!warning] In older R versions (before 4.0), character vectors were automatically converted to factors in data frames. Use `stringsAsFactors = FALSE` if working with older R code. ## Visualizing Data in Base R Base R includes several functions for exploratory visualization: - **plot()** - Creates scatterplots - **hist()** - Generates histograms - **boxplot()** - Produces box-and-whisker plots ```r # Simple scatterplot plot(mtcars$wt, mtcars$mpg) # Histogram hist(mtcars$mpg) # Boxplot boxplot(mpg ~ cyl, data = mtcars) ``` ## The dplyr Approach to Data Manipulation ### Core dplyr Functions The **dplyr** package provides a consistent set of functions designed specifically for data manipulation: - **filter()** - Subset rows based on conditions - **select()** - Choose columns by name - **mutate()** - Add new columns or modify existing ones - **summarize()** - Calculate summary statistics - **arrange()** - Sort rows based on values ### The Pipe Operator: Chaining Operations The pipe operator (`%>%`) allows you to chain multiple operations together, making code more readable by passing the result of one function to the next. ```r library(dplyr) mtcars %>% filter(cyl == 6) %>% select(mpg, wt, hp) %>% mutate(power_to_weight = hp/wt) %>% arrange(desc(power_to_weight)) ``` > [!note] Think of the pipe (`%>%`) as saying "and then" between operations, allowing you to read the code as a sequence of steps. ### Filtering Rows with dplyr The `filter()` function selects rows that satisfy specific conditions: ```r # Select cars with good mileage and 6 cylinders good_cars <- mtcars %>% filter(mpg > 20 & cyl == 6) ``` ### Selecting Columns Use `select()` to keep only the columns you need: ```r # Keep only mpg, cyl, and wt columns mtcars %>% select(mpg, cyl, wt) ``` ### Creating or Modifying Columns The `mutate()` function adds new columns or changes existing ones: ```r # Add a column showing power-to-weight ratio mtcars %>% mutate(power_ratio = hp/wt) ``` ## Summarizing Data with dplyr ### Basic Summaries The `summarize()` function (also spelled `summarise()`) computes summary statistics: ```r # Calculate average and standard deviation of mpg mtcars %>% summarize( avg_mpg = mean(mpg), sd_mpg = sd(mpg), count = n() ) ``` ### Grouped Summaries `group_by()` combined with `summarize()` creates summaries for each group: ```r # Calculate average mpg for each cylinder group mtcars %>% group_by(cyl) %>% summarize( avg_mpg = mean(mpg), count = n() ) ``` > [!tip] The `n()` function used inside `summarize()` counts the number of rows in each group. ### Accessing Summarized Values After creating summaries, you can extract specific values: - Using the `

operator: `summary_df$column_name` - Using the `pull()` function: `summary_df %>% pull(column_name)` The dot (`.`) in dplyr functions refers to the data being piped in: ```r mtcars %>% filter(mpg == max(.$mpg)) # Finds the car with highest mpg ``` ## Sorting Data with dplyr ### Basic Sorting Use `arrange()` to sort rows based on column values: ```r # Sort cars by mpg (ascending order) mtcars %>% arrange(mpg) # Sort by mpg in descending order mtcars %>% arrange(desc(mpg)) ``` ### Multi-level Sorting Provide multiple columns to `arrange()` for nested sorting: ```r # Sort by cylinder group, then by descending horsepower mtcars %>% arrange(cyl, desc(hp)) ``` ### Getting Top/Bottom Rows - `head(n)` returns the first n rows - `top_n(n, column)` returns the top n rows based on column values > [!warning] `top_n()` doesn't guarantee the results are sorted. For sorted results, combine with `arrange()`. > [!case]- Case Application: dplyr Workflow > > Imagine we have a dataset of student scores from different schools and want to identify the top-performing schools. >```r > # Step 1: Calculate average scores by school > school_performance <- student_data %>% > group_by(school) %>% > summarize( > avg_score = mean(score), > num_students = n(), > passing_rate = mean(score >= 70) > ) > > # Step 2: Filter for schools with sufficient students > # and sort by average score > top_schools <- school_performance %>% > filter(num_students >= 30) %>% > arrange(desc(avg_score)) %>% > head(10) > >``` > > This workflow follows a logical sequence: group data → calculate summaries → filter results → sort and display, all connected with the pipe operator. > ## Introduction to data.table The **data.table** package provides an alternative approach to data manipulation, optimized for performance with large datasets. It uses a concise syntax but requires learning a different approach. ### Converting to data.table Convert a data frame to a data.table object with `setDT()`: ```r library(data.table) dt <- data.table(mtcars) # Create new data.table # or mtcars_dt <- copy(mtcars) setDT(mtcars_dt) # Convert existing data frame ``` > [!note] `data.table` uses **reference semantics** - operations modify the original object without creating copies. ### data.table Syntax: [i, j, by] The core data.table syntax uses matrix-like notation: - `i` - which rows to select (1st position) - `j` - what to do with those rows (2nd position) - `by` - which groups to perform operations on ```r # Basic syntax dt[i, j, by] # Example: Select rows where mpg > 20, and return cyl and hp columns dt[mpg > 20, .(cyl, hp)] ``` ### Selecting Columns in data.table ```r # Select single column (returns vector) dt[, mpg] # Select multiple columns (returns data.table) dt[, .(mpg, cyl, hp)] ``` The `.()` function (shorthand for `list()`) specifies multiple columns. ### Adding or Modifying Columns The `:=` operator modifies columns in place: ```r # Add a new column dt[, power_ratio := hp/wt] # Modify multiple columns at once dt[, `:=`( power_ratio = hp/wt, weight_class = ifelse(wt > 3, "heavy", "light") )] ``` > [!warning] Unlike dplyr, data.table modifies the original data without creating a copy. Use `copy(dt)` if you need to preserve the original. ## Subsetting and Summarizing with data.table ### Filtering Rows Place logical conditions in the first position: ```r # Select cars with 6 cylinders dt[cyl == 6] ``` ### Calculating Summaries Use functions within `.()` in the second position: ```r # Calculate average mpg dt[, .(avg_mpg = mean(mpg))] # Multiple summaries dt[, .( avg_mpg = mean(mpg), min_mpg = min(mpg), max_mpg = max(mpg) )] ``` ### Grouped Summaries Add the `by` parameter to perform grouped calculations: ```r # Average mpg by cylinder group dt[, .(avg_mpg = mean(mpg)), by = cyl] # Multiple grouping variables dt[, .(avg_mpg = mean(mpg)), by = .(cyl, vs)] ``` ## Sorting with data.table Use the `order()` function in the first position: ```r # Sort by mpg dt[order(mpg)] # Sort by mpg descending dt[order(-mpg)] # Note the minus sign # Multiple sorting columns dt[order(cyl, -mpg)] ``` ## Introduction to Tibbles **Tibbles** are modern data frames used in the tidyverse ecosystem, offering several improvements: 1. More readable printing (shows only the first few rows and columns) 2. Consistent subsetting behavior 3. Warnings for non-existent columns ### Creating and Converting Tibbles ```r library(tibble) # Create a tibble from scratch tbl <- tibble( name = c("Alice", "Bob", "Charlie"), age = c(25, 30, 35), score = c(98, 87, 92) ) # Convert a data frame to a tibble mtcars_tbl <- as_tibble(mtcars) ``` ### Tibble Benefits Tibbles differ from traditional data frames in several ways: - **Smarter printing**: Only shows the first 10 rows and columns that fit the screen - **Type consistency**: Doesn't convert strings to factors - **Stricter subsetting**: Subsetting always returns another tibble - **Column checking**: Warns when accessing non-existent columns ```r # Tibble printing example mtcars_tbl # Prints concisely with column types # Subsetting returns a tibble (not a vector) mtcars_tbl[, "mpg"] # Returns a one-column tibble ``` > [!note] Working with tidyverse packages (like dplyr) is seamless with tibbles since all operations maintain the tibble structure. ### Grouped Tibbles When you apply `group_by()` to a tibble, it returns a special "grouped tibble" that maintains information about its grouping: ```r grouped_cars <- mtcars_tbl %>% group_by(cyl) # The grouping is shown when printed grouped_cars # Summarize operations respect the grouping grouped_cars %>% summarize(avg_mpg = mean(mpg)) ``` ## Comparing Approaches: Base R vs. dplyr vs. data.table Each approach to data manipulation in R has its strengths and ideal use cases: Download |Approach|Strengths|Ideal Use Cases| |---|---|---| |**Base R**|No dependencies, always available|Simple operations, scripts that need maximum compatibility| |**dplyr**|Readable, intuitive verb-based syntax|Data analysis workflows, teaching, readability-focused projects| |**data.table**|Performance with large datasets, memory efficiency|Big data operations, production systems, performance-critical applications| ### Syntax Comparison Here's how the same operations look with different approaches: > [!example] **Filtering rows where mpg > 20 and selecting mpg, cyl columns:** > > **Base R:** >```r > result <- mtcars[mtcars$mpg > 20, c("mpg", "cyl")] > >``` > **dplyr:** >```r > result <- mtcars %>% > filter(mpg > 20) %>% > select(mpg, cyl) > >``` > **data.table:** >```r > dt <- as.data.table(mtcars) > result <- dt[mpg > 20, .(mpg, cyl)] > >``` ## Practical Workflow Considerations ### When to Use Each Approach - **Use Base R when**: - Working with simple, vector-based operations - Maximum compatibility is needed - No additional packages can be installed - **Use dplyr when**: - Readability and expressiveness are priorities - Working within the tidyverse ecosystem - Communicating analysis to others - **Use data.table when**: - Processing large datasets (millions of rows) - Performance is critical - Memory efficiency matters > [!tip] It's perfectly acceptable to mix approaches in a single project, using each tool where it shines brightest. ### Common Workflow Patterns Regardless of which approach you choose, data manipulation tasks often follow common patterns: 1. **Import/Create** → Load or create your dataset 2. **Clean** → Handle missing values, correct data types 3. **Transform** → Create new variables, reshape data 4. **Subset** → Filter rows, select columns of interest 5. **Aggregate** → Group and summarize 6. **Arrange** → Sort results as needed 7. **Export/Visualize** → Save or present your results > [!case]- Case Application: Multi-Package Workflow > > In real-world analysis, you might combine approaches for optimal results: >```r > # Load libraries > library(data.table) > library(dplyr) > library(ggplot2) > > # Load a large dataset with data.table for speed > dt <- fread("large_dataset.csv") > > # Perform initial aggregation with data.table (for performance) > summary_dt <- dt[, .( > avg_value = mean(value), > count = .N > ), by = .(group, year)] > > # Convert to tibble for tidyverse integration > summary_tbl <- as_tibble(summary_dt) > > # Use dplyr for readable further manipulation > final_result <- summary_tbl %>% > filter(count >= 30) %>% > mutate(reliability = sqrt(count)/10) %>% > arrange(desc(avg_value)) > > # Visualize with ggplot2 > ggplot(final_result, aes(x = year, y = avg_value, color = group)) + > geom_line() + > theme_minimal() > >``` > > This workflow leverages data.table's speed for the heavy lifting on the large dataset, then switches to dplyr for its readable manipulation syntax, and finally uses ggplot2 for visualization. ## Summary: Mastering Data Manipulation in R Data manipulation is at the heart of data analysis in R. The three approaches covered—Base R, dplyr, and data.table—offer different tools for similar tasks, each with its own advantages: - **Base R** provides fundamental tools that are always available - **dplyr** offers a grammar of data manipulation with readable, verb-based functions - **data.table** provides concise syntax optimized for performance The most effective R programmers understand all three approaches and choose the right one for each specific task—sometimes combining them in a single analysis. > [!abstract] Most Important Takeaway > > **The choice between Base R, dplyr, and data.table is not about which one is "best" overall, but about selecting the right tool for your specific needs, considering factors like readability, performance, and integration with other packages.** -- Reference: - Data Science, HarvardX