biostats-part-1-study-design-types-of-data

2025-04-09 Wednesday #r-programming #rstudio #git # github [[maps-of-content]] # Understanding Data in Public Health Research > [!success]- Concept Sketch: [[]] > ![[]] > [!abstract]- Quick Review > > **Core Essence**: Public health research uses specialized data vocabulary to tell the story of population health, with variables serving distinct roles in research questions and data collected through various study designs. > > **Key Concepts**: > > - Three variable types: Exposure (beginning), Outcome (end), Supporting (middle) > - Data classification: Qualitative (categorical) vs. Quantitative (numerical) > - Study designs determine when exposure and outcome data are collected > > **Must Remember**: > > - A variable's role changes based on the research question being asked > - Study design influences our ability to establish causality > - Dataset structure: rows = individuals, columns = variables > > **Critical Relationships**: > > - Exposure variables potentially influence outcome variables > - Study design determines temporal relationship between exposure and outcome > - Data types determine appropriate analysis methods ## Introduction to Public Health Data Public health researchers rely on data to understand, analyze, and address health issues affecting populations. To effectively work with this data, researchers use a specific vocabulary and framework that helps them formulate questions, organize information, and draw meaningful conclusions about community health. This specialized approach to data allows public health professionals to tell evidence-based stories about health patterns, risk factors, and interventions. Understanding these fundamentals is essential for anyone seeking to interpret or contribute to public health knowledge. ## The Building Blocks: Variable Types ### The Three Core Variables > [!visual]- Sketch Idea > > a horizontal timeline/arrow with three elements: > > - Left: "Exposure Variables" (beginning) with an icon representing cause/origin > - Middle: "Supporting Variables" with connecting elements or context icons > - Right: "Outcome Variables" (end) with a result/effect icon **Outcome Variables** represent the end of the public health story—what ultimately happened to individuals in the study. These might include: - Development of a disease - Recovery from illness - Birth weight of a child - Health behaviors adopted **Exposure Variables** mark the beginning of the story—what was done to individuals or what they initially experienced. Examples include: - Treatment received - Environmental exposure - Behavioral factor (smoking, exercise) - Policy implementation **Supporting Variables** fill in the middle of the story, providing context and additional information about study participants. These typically include: - Demographics (age, gender, race) - Socioeconomic factors - Pre-existing health conditions - Other experiences or characteristics ### Variable Roles Are Fluid An important concept to understand is that variables can change roles depending on the research question being asked. The same variable might serve as: - An exposure in one study - An outcome in another - A supporting variable in yet another > [!note] Example of Shifting Variable Roles Consider "smoking status": > > - In a lung cancer study: Smoking is the **exposure** variable > - In a stress management study: Smoking might be the **outcome** variable (Does stress reduction decrease smoking?) > - In a substance abuse/depression study: Smoking serves as a **supporting** variable providing context ## Types of Public Health Data ### Qualitative vs. Quantitative Variables All variables in public health research can be classified into two broad categories: ```mermaid graph TD A[Variable Types] --> B[Qualitative Variables] A --> C[Quantitative Variables] B --> D[Ordinal: Categories with natural order] B --> E[Nominal: Categories without natural order] C --> F[Continuous: Any value within range] C --> G[Count: Non-negative integers] C --> H[Binary: Only two possible values] ``` #### Qualitative Variables (Categorical) **Qualitative variables** are best expressed with words rather than numbers: - **Ordinal variables** have categories with a natural order - Example: Strongly disagree → Disagree → Neutral → Agree → Strongly agree - Numbers can be assigned to reflect this order (1, 2, 3, 4, 5) - **Nominal variables** have categories with no natural order - Example: Race, gender, political affiliation, blood type - Any numbers assigned are for identification only, not for mathematical operations #### Quantitative Variables (Numerical) **Quantitative variables** are best expressed with numeric values: - **Continuous variables** can take any value within a range - Example: Age, blood pressure, height, weight, temperature - Can include decimals and infinite possible values - **Count variables** represent the number of events or items - Example: Number of cigarettes smoked, children in household, doctor visits - Only non-negative integers (0, 1, 2, 3...) - **Binary variables** have only two possible values, typically 0 and 1 - Example: Presence of disease (Yes=1, No=0), mortality (Dead=1, Alive=0) - Can be viewed as either qualitative or quantitative > [!tip] Selecting Analysis Methods > Understanding variable types is essential for choosing appropriate statistical methods: > > - Means and standard deviations for continuous variables > - Medians and ranges for ordinal variables > - Frequencies and proportions for nominal variables > - Different visualization techniques based on variable type ## Public Health Datasets in Practice ### Dataset Structure A typical public health dataset is organized with: - Each **row** representing one individual - Each **column** representing one variable - A **data dictionary** explaining what each variable represents ### Examples of Public Health Datasets #### Low Birth Weight Study - **Purpose**: Examining factors affecting infant birth weight - **Size**: 189 women, 11 variables (collected in 1986) - **Example Research Question**: Does maternal smoking (exposure) affect low birth weight (outcome)? - **Supporting Variables**: Mother's age, race, prenatal care #### Medical Expenditure Panel Survey (MEPS) - **Purpose**: Tracking healthcare usage, costs, and payment methods - **Size**: Large-scale survey (subset used contains 9,000+ adults, 23 variables) - **Example Research Question**: Does regular physical exercise (exposure) affect BMI (outcome)? - **Note**: Uses numerical codes for categorical variables #### General Social Survey (GSS) - **Purpose**: Tracking attitudes, behaviors, and attributes in US adults - **Size**: Ongoing bi-annual survey (subset from 1993 used) - **Example Research Question**: Does number of siblings (exposure) affect number of children (outcome)? - **Data Challenge**: Categories represented by words increase risk of data entry errors > [!warning] Missing Data Representation Datasets use > > - Dots (.) > - "NA" text > - Implausible values (e.g., -99 for age) > > Always check the data dictionary to understand how missing data is coded! ## Study Designs for Data Collection ### The Timing Question: When is Data Collected? The timing of data collection is critical in public health research and determines the study design. Different designs offer varying levels of evidence for causal relationships. mermaid ```mermaid graph LR A[Study Designs] --> B[Prospective Studies] A --> C[Retrospective Studies] A --> D[Cross-Sectional Studies] B --> E[Randomized Controlled Trial] B --> F[Prospective Cohort Study] C --> G[Case-Control Study] style E fill:#c6ecc6 style G fill:#f9d6d6 style D fill:#d6e8f9 ``` ### Prospective Studies: Exposure First In prospective studies, researchers: 1. Collect exposure data first 2. Follow individuals over time 3. Observe who develops the outcome **Randomized Controlled Trial (RCT)** - Gold standard for establishing causality - Researchers control and randomly assign the exposure - Example: Randomly assigning participants to exercise program vs. control group, then tracking heart disease **Prospective Cohort Study** - Used when randomization is unethical or not feasible - Groups with and without exposure are followed over time - Example: Following smokers and non-smokers to track lung cancer development ### Retrospective Studies: Outcome First In retrospective studies, researchers: 1. Identify individuals based on outcome status 2. Look backward in time to determine exposure history **Case-Control Study** - Groups are formed based on presence (cases) or absence (controls) of outcome - Past exposure is then compared between groups - Example: Comparing past cell phone use in brain cancer patients vs. healthy controls ### Cross-Sectional Studies: Simultaneous Collection In cross-sectional studies: - Data on both exposure and outcome collected at the same time - Cannot establish temporal relationship or causality - Useful for examining associations and prevalence - Example: Survey measuring both current smoking habits and current respiratory symptoms > [!visual]- Visual Note Idea > > **Core Concept**: Study Design Timeline **Full Description**: Different study designs collect data in different temporal sequences, affecting their ability to establish causality. **Memorable Description**: "RCTs predict the future, Case-Controls investigate the past, Cross-Sectionals capture the present" **Visual Representation**: Draw three timeline arrows: > > 1. Prospective (RCT): Arrow pointing right with "Exposure" at start, "Time" in middle, "Outcome" at end > 2. Retrospective: Arrow pointing left with "Outcome" at start, "Time" in middle, "Exposure" at end > 3. Cross-Sectional: Vertical line with "Exposure" and "Outcome" side by side ## Practical Application in Public Health Research > [!example]- Case Application: Maternal Smoking and Birth Weight > > **Research Question**: Does maternal smoking during pregnancy affect infant birth weight? > > **Variables**: > > - **Exposure**: Maternal smoking during pregnancy (Binary: Yes/No) > - **Outcome**: Infant birth weight (Continuous: grams) > - **Supporting Variables**: Mother's age (Continuous), Prenatal care (Ordinal), Race (Nominal) > > **Possible Study Designs**: > > 1. **Cross-Sectional Study**: > - Survey mothers at delivery about smoking during pregnancy > - Measure infant birth weight at the same time > - Can show association but not definitively prove causation > 2. **Prospective Cohort Study**: > - Identify pregnant women who smoke and don't smoke > - Follow them through pregnancy > - Compare birth weights between groups > - Stronger evidence for causality than cross-sectional > 3. **Case-Control Study**: > - Identify infants with low birth weight (cases) and normal birth weight (controls) > - Compare maternal smoking rates during pregnancy > - Efficient for rare outcomes > > **Note**: A randomized controlled trial would be unethical in this case as researchers cannot assign mothers to smoke. > > **Data Analysis Approach**: > > - Compare mean birth weights between smoking/non-smoking groups > - Create visualization showing distribution of birth weights by smoking status > - Control for supporting variables in statistical analysis ## Summary: The Public Health Data Story Public health data tells stories about population health through carefully structured research: 1. **Variables play specific roles** in these stories—as exposures (beginning), outcomes (end), or supporting elements (middle)—but these roles can shift based on the research question. 2. **Study designs determine the temporal sequence** of data collection, which affects our ability to establish causality between exposures and outcomes. 3. **Data types guide analysis approaches**, with qualitative (categorical) and quantitative (numerical) variables requiring different statistical methods. 4. **Existing datasets provide valuable resources** for answering public health questions, but researchers must understand their structure, variables, and limitations. > [!tip] Most Important Takeaway > Public health research is fundamentally about telling data-driven stories of population health, where variables play specific roles that can change depending on the research question, and the study design determines how convincingly we can connect the beginning (exposure) to the end (outcome) of the story. > [!visual]- Sketch Idea > > **Core Concept**: The Public Health Data Story **Full Description**: Public health data tells a story with exposure variables as the beginning, supporting variables as the middle context, and outcome variables as the end result, all influenced by study design. **Memorable Description**: "Public Health Stories: Exposures start the plot, Supporting variables add context, Outcomes reveal the ending" **Visual Representation**: Create a book or storybook with three chapters: > > - Chapter 1: "Exposure" (what happened first) > - Chapter 2: "Context" (supporting variables and study design) > - Chapter 3: "Outcome" (what happened as a result) Add small icons for different variable types and study designs in the margins of each chapter. -- Reference: - Biostats