<sub>2025-05-28</sub> <sub>#data-visualization #data-management #r-programming #hmp669</sub>
<sup>[[maps-of-content|π Maps of Content β All Notes]] </sup>
<sup>Series: [[hmp669|HMP 669 β Data Management and Visualization]]</sup>
<sup>Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]]</sup>
# Reading and Writing Data in R
> [!abstract]- Learning Overview
>
>
> **Key Concepts**:
>
> - **Reading Data**: Bringing external data into R's working memory
> - **Writing Data**: Moving data from R to permanent storage or sharing
> - **Storage Distinction**: Temporary environment vs. permanent hard drive storage
>
> **Critical Connections**:
>
> - File extensions must match their corresponding R functions
> - R environment is like RAM (temporary), hard drive storage is permanent
> - R's native formats (.rda) offer significant advantages for future work
>
> **Must Remember**:
>
> - Always identify file type first, then choose matching function
> - Data in R environment disappears when you close R unless explicitly saved
> - Specialized packages can directly retrieve public datasets, skipping download steps
> [!info]- Required R Packages
>
> |Package|Purpose|Installation|
> |---|---|---|
> |`readr`|CSV and TSV files (tidyverse approach)|`install.packages("tidyverse")`|
> |`readxl`|Reading Excel files|`install.packages("readxl")`|
> |`writexl`|Writing Excel files (simple)|`install.packages("writexl")`|
> |`openxlsx`|Advanced Excel operations (multiple sheets)|`install.packages("openxlsx")`|
> |`haven`|SAS, STATA, SPSS files|`install.packages("haven")`|
> |`nhanesA`|Direct NHANES data access|`install.packages("nhanesA")`|
> |`WDI`|World Bank data retrieval|`install.packages("WDI")`|
>
> **Basic setup:**
> ```r
> library(tidyverse) # Includes readr
> library(readxl)
> library(writexl)
> ```
>
> **For specialized data sources:**
> ```r
> library(nhanesA) # Only if accessing NHANES
> library(WDI) # Only if accessing World Bank data
> ```
>
> **Note**: Base R functions (read.csv, write.csv, save, load) require no additional packages.
> [!code]- Syntax Reference
>
>
> |Command/Syntax|Purpose|Example|
> |---|---|---|
> |**Reading Data**|||
> |`read.csv()`|Read CSV files|`data <- read.csv("file.csv")`|
> |`read_excel()`|Read Excel files|`data <- read_excel("file.xlsx")`|
> |`read_sas()`|Read SAS files|`data <- read_sas("file.sas7bdat")`|
> |**Writing Data**|||
> |`write.csv()`|Write CSV files|`write.csv(data, "output.csv")`|
> |`write_xlsx()`|Write Excel files|`write_xlsx(data, "output.xlsx")`|
> |**R Format Operations**|||
> |`save()`|Save R objects|`save(data, file = "data.rda")`|
> |`load()`|Load R objects|`load("data.rda")`|
> |**Environment Management**|||
> |`ls()`|List environment objects|`ls()`|
> |`rm()`|Remove objects|`rm(object_name)`|
---
## Why This Matters
**Public health data surrounds us**---from study participants, internet repositories, and collaborative sharing. But data sitting in files can't answer questions. You need to bridge the gap between static data storage and dynamic R analysis, then share your insights with the world.
This fundamental skill determines whether you spend hours wrestling with data formats or minutes moving seamlessly between data sources and analytical workflows.
## The Essential Workflow: Read β Analyze β Write
### Reading Data: Bringing the Outside World In
**Reading data** means importing information from external sources into R's working environment. Think of it as inviting data to join your analytical conversation.
The process follows a simple two-step pattern:
**Step 1: Identify the data type** by examining the file extension
- `.csv` (comma-separated values)
- `.tsv` (tab-separated values)
- `.xlsx` (Excel spreadsheet)
- `.sas7bdat` (SAS data files)
**Step 2: Select the matching function** designed for that specific format
- `read.csv()` for `.csv` files (from readr package)
- `read_excel()` for `.xlsx` files (from readxl package)
- `read_sas()` for SAS data (from haven package)
| Where you likely made the data | Data extension | Function | Package |
| ------------------------------ | -------------- | ------------ | ------- |
| Excel | .tsv | read_tsv() | readr |
| Excel | .csv | read_csv() | readr |
| Excel | .xls, .xlsx | read_excel() | readxl |
| SAS | .xpt | read_xpt() | haven |
| SAS | .sas7bdat | read_sas() | haven |
| STATA | .dta | read_dta() | haven |
| SPSS | .sav | read_sav() | haven |
> [!tip] The Golden Rule
> **Match the function to the file extension.** Each data type has evolved specific functions because different formats store information differently---like needing different keys for different locks.
#### Example in Practice
```r
# Reading a CSV file
schools <- read.csv("school_districts.csv")
# This says: "Take the CSV file called 'school_districts',
# read it into R, and create an object called 'schools'"
```
### Direct Data Retrieval: Skip the Download
**Many public health datasets offer R packages** that directly query and load data, making the process more efficient and reducing potential errors.
Instead of: Download β Save β Read into R
You can: Query directly β Load into R
> [!note] NHANES Example
> The National Health and Nutrition Examination Survey (NHANES) produces data every two years on ~5,000 people. The `nhanesA` package lets you directly access:
>
> - Demographics data
> - Dietary information
> - Laboratory results
> - Questionnaire responses
>
> All without ever visiting a website or managing downloads.
## Understanding R's Memory System
### The Environment: Your Temporary Workshop
When you read data into R, **where does it go?**
Your data lands in the **R environment**---visible in RStudio's upper-right panel. This is R's active memory, analogous to your computer's RAM.
> [!warning] Critical Understanding
> **The environment is temporary storage.** Objects here disappear when you:
>
> - Close R without saving
> - Experience a system crash
> - Clear your environment
>
> Think of it like items on your desk---they're immediately accessible for work, but they're not filed away for permanent keeping.
### The Hard Drive: Permanent Storage
**Saving to your hard drive is a separate, intentional step.** This is like moving important documents from your desk into a filing cabinet for long-term storage.
The distinction parallels computer shopping decisions:
- **RAM (Environment)**: Fast, temporary, active memory
- **Hard Drive (Saved Files)**: Permanent, persistent storage
## Writing Data: Sharing Your Work
**Writing data** means exporting information from R for external use---sharing with collaborators or making data publicly available.
The process mirrors reading
**Step 1: Choose your desired output format**
**Step 2: Use the matching write function**
```r
# Writing to Excel format
write_xlsx(schools, "school_districts.xlsx")
# This says: "Take the 'schools' object from R and
# save it as an Excel file called 'school_districts'"
```
| Where you want to open the data | Data extension | Function | Package |
| ------------------------------- | -------------- | ------------ | ------- |
| Excel | .tsv | write_tsv() | readr |
| Excel | .csv | write_csv() | readr |
| Excel | .xls, .xlsx | write_xlsx() | writexl |
| SAS | .xpt | write_xpt() | haven |
| SAS | .sas7bdat | write_sas() | haven |
| STATA | .dta | write_dta() | haven |
| SPSS | .sav | write_sav() | haven |
### The R Format Advantage
**Once you've imported external data, save it in R's native format (.rda)** for future efficiency.
R format advantages:
- **Faster loading** of large datasets
- **No specialty packages required** (uses base R only)
- **Preserves object names** automatically
- **Maintains data structure** perfectly
```r
# Saving in R format
save(schools, file = "schools.rda")
# Next time:
load("schools.rda") # schools object appears instantly
```
> [!tip] Best Practice Workflow
>
> 1. Import external data using appropriate read function
> 2. Immediately save as `.rda` for future sessions
> 3. Work with the R format version going forward
> 4. Export to other formats only when sharing externally
## R File Types: Understanding Your Options
|Extension|Purpose|Contains|
|---|---|---|
|**`.Rmd`**|R Markdown|Code + formatted output + text|
|**`.R`**|R Script|Pure code only|
|**`.rda`**|R Data Object|Single data object|
|**`.RData`**|R Workspace|Multiple data objects|
|**`.Rproj`**|R Project|Organizational container|
> [!note] File Type Strategy
>
> - Use **`.Rmd`** for analysis reports and documentation
> - Use **`.R`** for reusable code and functions
> - Use **`.rda`** for clean, processed datasets
> - Use **`.Rproj`** to organize related files together
## Connecting the Concepts
Data handling in R follows a **logical ecosystem**:
**External World** β β **R Environment** β β **Permanent Storage**
- **Reading** moves data from external sources into your active workspace
- **Analysis** happens in the environment using temporary objects
- **Writing** moves results from workspace to permanent storage or sharing
The key insight: **R's environment is your analytical workshop**---temporary, flexible, and powerful. But like any workshop, you must deliberately save your important work.
# Examples: Reading and Writing Data in R
## Reading Data: From Files to R Environment
### CSV Files (Most Common)
```r
# Step 1: Identify β .csv file
# Step 2: Match function β read_csv()
library(readr)
my_data <- read_csv("data/health_survey.csv")
```
### Excel Files
```r
# Step 1: Identify β .xlsx or .xls file
# Step 2: Match function β read_excel()
library(readxl)
patient_data <- read_excel("data/patient_records.xlsx")
# If multiple sheets exist, specify which one
patient_data <- read_excel("data/patient_records.xlsx", sheet = "2023_data")
```
### Tab-Separated Values
```r
# Step 1: Identify β .tsv file
# Step 2: Match function β read_tsv()
library(readr)
survey_results <- read_tsv("data/survey_responses.tsv")
```
### SAS Data Files
```r
# Step 1: Identify β .sas7bdat file
# Step 2: Match function β read_sas()
library(haven)
clinical_trial <- read_sas("data/clinical_study.sas7bdat")
```
### R Data Files (Previously Saved)
```r
# Step 1: Identify β .rda file
# Step 2: Match function β load()
load("data/cleaned_dataset.rda") # Loads object with original name
# Alternative: assign to new name
my_data <- readRDS("data/cleaned_dataset.rds")
```
## Writing Data: From R Environment to Files
### CSV Files (Universal Format)
```r
# Step 1: Choose output β .csv format
# Step 2: Match function β write_csv()
library(readr)
write_csv(my_cleaned_data, "output/final_analysis.csv")
# Base R alternative
write.csv(my_cleaned_data, "output/final_analysis.csv", row.names = FALSE)
```
### Excel Files
```r
# Step 1: Choose output β .xlsx format
# Step 2: Match function β write_xlsx()
library(writexl)
write_xlsx(patient_summary, "output/patient_report.xlsx")
# For multiple sheets
library(openxlsx)
dataset_list <- list("Summary" = summary_data, "Details" = detail_data)
write.xlsx(dataset_list, "output/complete_report.xlsx")
```
### R Data Format (Recommended for Future Use)
```r
# Step 1: Choose output β .rda format
# Step 2: Match function β save()
# Save single object
save(cleaned_data, file = "data/cleaned_data.rda")
# Save multiple objects
save(data1, data2, results, file = "data/analysis_workspace.rda")
# Alternative: saveRDS for single objects
saveRDS(cleaned_data, "data/cleaned_data.rds")
```
## Direct Data Retrieval Examples
Skip the download step entirely with specialized packages:
```r
# NHANES data (no file download needed)
library(nhanesA)
demo_data <- nhanes('DEMO_J') # Gets 2017-2018 demographic data
# World Bank data
library(WDI)
gdp_data <- WDI(country = "all",
indicator = "NY.GDP.MKTP.CD",
start = 2010, end = 2020)
```