<sub>2025-05-28</sub> <sub>#data-visualization #data-management #r-programming #hmp669</sub> <sup>[[maps-of-content|🌐 Maps of Content β€” All Notes]] </sup> <sup>Series: [[hmp669|HMP 669 β€” Data Management and Visualization]]</sup> <sup>Topic: [[hmp669#Data Visualization Using R|Data Visualization Using R]]</sup> # Reading and Writing Data in R > [!abstract]- Learning Overview > > > **Key Concepts**: > > - **Reading Data**: Bringing external data into R's working memory > - **Writing Data**: Moving data from R to permanent storage or sharing > - **Storage Distinction**: Temporary environment vs. permanent hard drive storage > > **Critical Connections**: > > - File extensions must match their corresponding R functions > - R environment is like RAM (temporary), hard drive storage is permanent > - R's native formats (.rda) offer significant advantages for future work > > **Must Remember**: > > - Always identify file type first, then choose matching function > - Data in R environment disappears when you close R unless explicitly saved > - Specialized packages can directly retrieve public datasets, skipping download steps > [!info]- Required R Packages > > |Package|Purpose|Installation| > |---|---|---| > |`readr`|CSV and TSV files (tidyverse approach)|`install.packages("tidyverse")`| > |`readxl`|Reading Excel files|`install.packages("readxl")`| > |`writexl`|Writing Excel files (simple)|`install.packages("writexl")`| > |`openxlsx`|Advanced Excel operations (multiple sheets)|`install.packages("openxlsx")`| > |`haven`|SAS, STATA, SPSS files|`install.packages("haven")`| > |`nhanesA`|Direct NHANES data access|`install.packages("nhanesA")`| > |`WDI`|World Bank data retrieval|`install.packages("WDI")`| > > **Basic setup:** > ```r > library(tidyverse) # Includes readr > library(readxl) > library(writexl) > ``` > > **For specialized data sources:** > ```r > library(nhanesA) # Only if accessing NHANES > library(WDI) # Only if accessing World Bank data > ``` > > **Note**: Base R functions (read.csv, write.csv, save, load) require no additional packages. > [!code]- Syntax Reference > > > |Command/Syntax|Purpose|Example| > |---|---|---| > |**Reading Data**||| > |`read.csv()`|Read CSV files|`data <- read.csv("file.csv")`| > |`read_excel()`|Read Excel files|`data <- read_excel("file.xlsx")`| > |`read_sas()`|Read SAS files|`data <- read_sas("file.sas7bdat")`| > |**Writing Data**||| > |`write.csv()`|Write CSV files|`write.csv(data, "output.csv")`| > |`write_xlsx()`|Write Excel files|`write_xlsx(data, "output.xlsx")`| > |**R Format Operations**||| > |`save()`|Save R objects|`save(data, file = "data.rda")`| > |`load()`|Load R objects|`load("data.rda")`| > |**Environment Management**||| > |`ls()`|List environment objects|`ls()`| > |`rm()`|Remove objects|`rm(object_name)`| --- ## Why This Matters **Public health data surrounds us**---from study participants, internet repositories, and collaborative sharing. But data sitting in files can't answer questions. You need to bridge the gap between static data storage and dynamic R analysis, then share your insights with the world. This fundamental skill determines whether you spend hours wrestling with data formats or minutes moving seamlessly between data sources and analytical workflows. ## The Essential Workflow: Read β†’ Analyze β†’ Write ### Reading Data: Bringing the Outside World In **Reading data** means importing information from external sources into R's working environment. Think of it as inviting data to join your analytical conversation. The process follows a simple two-step pattern: **Step 1: Identify the data type** by examining the file extension - `.csv` (comma-separated values) - `.tsv` (tab-separated values) - `.xlsx` (Excel spreadsheet) - `.sas7bdat` (SAS data files) **Step 2: Select the matching function** designed for that specific format - `read.csv()` for `.csv` files (from readr package) - `read_excel()` for `.xlsx` files (from readxl package) - `read_sas()` for SAS data (from haven package) | Where you likely made the data | Data extension | Function | Package | | ------------------------------ | -------------- | ------------ | ------- | | Excel | .tsv | read_tsv() | readr | | Excel | .csv | read_csv() | readr | | Excel | .xls, .xlsx | read_excel() | readxl | | SAS | .xpt | read_xpt() | haven | | SAS | .sas7bdat | read_sas() | haven | | STATA | .dta | read_dta() | haven | | SPSS | .sav | read_sav() | haven | > [!tip] The Golden Rule > **Match the function to the file extension.** Each data type has evolved specific functions because different formats store information differently---like needing different keys for different locks. #### Example in Practice ```r # Reading a CSV file schools <- read.csv("school_districts.csv") # This says: "Take the CSV file called 'school_districts', # read it into R, and create an object called 'schools'" ``` ### Direct Data Retrieval: Skip the Download **Many public health datasets offer R packages** that directly query and load data, making the process more efficient and reducing potential errors. Instead of: Download β†’ Save β†’ Read into R You can: Query directly β†’ Load into R > [!note] NHANES Example > The National Health and Nutrition Examination Survey (NHANES) produces data every two years on ~5,000 people. The `nhanesA` package lets you directly access: > > - Demographics data > - Dietary information > - Laboratory results > - Questionnaire responses > > All without ever visiting a website or managing downloads. ## Understanding R's Memory System ### The Environment: Your Temporary Workshop When you read data into R, **where does it go?** Your data lands in the **R environment**---visible in RStudio's upper-right panel. This is R's active memory, analogous to your computer's RAM. > [!warning] Critical Understanding > **The environment is temporary storage.** Objects here disappear when you: > > - Close R without saving > - Experience a system crash > - Clear your environment > > Think of it like items on your desk---they're immediately accessible for work, but they're not filed away for permanent keeping. ### The Hard Drive: Permanent Storage **Saving to your hard drive is a separate, intentional step.** This is like moving important documents from your desk into a filing cabinet for long-term storage. The distinction parallels computer shopping decisions: - **RAM (Environment)**: Fast, temporary, active memory - **Hard Drive (Saved Files)**: Permanent, persistent storage ## Writing Data: Sharing Your Work **Writing data** means exporting information from R for external use---sharing with collaborators or making data publicly available. The process mirrors reading **Step 1: Choose your desired output format** **Step 2: Use the matching write function** ```r # Writing to Excel format write_xlsx(schools, "school_districts.xlsx") # This says: "Take the 'schools' object from R and # save it as an Excel file called 'school_districts'" ``` | Where you want to open the data | Data extension | Function | Package | | ------------------------------- | -------------- | ------------ | ------- | | Excel | .tsv | write_tsv() | readr | | Excel | .csv | write_csv() | readr | | Excel | .xls, .xlsx | write_xlsx() | writexl | | SAS | .xpt | write_xpt() | haven | | SAS | .sas7bdat | write_sas() | haven | | STATA | .dta | write_dta() | haven | | SPSS | .sav | write_sav() | haven | ### The R Format Advantage **Once you've imported external data, save it in R's native format (.rda)** for future efficiency. R format advantages: - **Faster loading** of large datasets - **No specialty packages required** (uses base R only) - **Preserves object names** automatically - **Maintains data structure** perfectly ```r # Saving in R format save(schools, file = "schools.rda") # Next time: load("schools.rda") # schools object appears instantly ``` > [!tip] Best Practice Workflow > > 1. Import external data using appropriate read function > 2. Immediately save as `.rda` for future sessions > 3. Work with the R format version going forward > 4. Export to other formats only when sharing externally ## R File Types: Understanding Your Options |Extension|Purpose|Contains| |---|---|---| |**`.Rmd`**|R Markdown|Code + formatted output + text| |**`.R`**|R Script|Pure code only| |**`.rda`**|R Data Object|Single data object| |**`.RData`**|R Workspace|Multiple data objects| |**`.Rproj`**|R Project|Organizational container| > [!note] File Type Strategy > > - Use **`.Rmd`** for analysis reports and documentation > - Use **`.R`** for reusable code and functions > - Use **`.rda`** for clean, processed datasets > - Use **`.Rproj`** to organize related files together ## Connecting the Concepts Data handling in R follows a **logical ecosystem**: **External World** ← β†’ **R Environment** ← β†’ **Permanent Storage** - **Reading** moves data from external sources into your active workspace - **Analysis** happens in the environment using temporary objects - **Writing** moves results from workspace to permanent storage or sharing The key insight: **R's environment is your analytical workshop**---temporary, flexible, and powerful. But like any workshop, you must deliberately save your important work. # Examples: Reading and Writing Data in R ## Reading Data: From Files to R Environment ### CSV Files (Most Common) ```r # Step 1: Identify β†’ .csv file # Step 2: Match function β†’ read_csv() library(readr) my_data <- read_csv("data/health_survey.csv") ``` ### Excel Files ```r # Step 1: Identify β†’ .xlsx or .xls file # Step 2: Match function β†’ read_excel() library(readxl) patient_data <- read_excel("data/patient_records.xlsx") # If multiple sheets exist, specify which one patient_data <- read_excel("data/patient_records.xlsx", sheet = "2023_data") ``` ### Tab-Separated Values ```r # Step 1: Identify β†’ .tsv file # Step 2: Match function β†’ read_tsv() library(readr) survey_results <- read_tsv("data/survey_responses.tsv") ``` ### SAS Data Files ```r # Step 1: Identify β†’ .sas7bdat file # Step 2: Match function β†’ read_sas() library(haven) clinical_trial <- read_sas("data/clinical_study.sas7bdat") ``` ### R Data Files (Previously Saved) ```r # Step 1: Identify β†’ .rda file # Step 2: Match function β†’ load() load("data/cleaned_dataset.rda") # Loads object with original name # Alternative: assign to new name my_data <- readRDS("data/cleaned_dataset.rds") ``` ## Writing Data: From R Environment to Files ### CSV Files (Universal Format) ```r # Step 1: Choose output β†’ .csv format # Step 2: Match function β†’ write_csv() library(readr) write_csv(my_cleaned_data, "output/final_analysis.csv") # Base R alternative write.csv(my_cleaned_data, "output/final_analysis.csv", row.names = FALSE) ``` ### Excel Files ```r # Step 1: Choose output β†’ .xlsx format # Step 2: Match function β†’ write_xlsx() library(writexl) write_xlsx(patient_summary, "output/patient_report.xlsx") # For multiple sheets library(openxlsx) dataset_list <- list("Summary" = summary_data, "Details" = detail_data) write.xlsx(dataset_list, "output/complete_report.xlsx") ``` ### R Data Format (Recommended for Future Use) ```r # Step 1: Choose output β†’ .rda format # Step 2: Match function β†’ save() # Save single object save(cleaned_data, file = "data/cleaned_data.rda") # Save multiple objects save(data1, data2, results, file = "data/analysis_workspace.rda") # Alternative: saveRDS for single objects saveRDS(cleaned_data, "data/cleaned_data.rds") ``` ## Direct Data Retrieval Examples Skip the download step entirely with specialized packages: ```r # NHANES data (no file download needed) library(nhanesA) demo_data <- nhanes('DEMO_J') # Gets 2017-2018 demographic data # World Bank data library(WDI) gdp_data <- WDI(country = "all", indicator = "NY.GDP.MKTP.CD", start = 2010, end = 2020) ```