<sub>2025-06-01</sub> <sub>#data-science #sas-programming </sub> <sup>[[maps-of-content|🌐 Maps of Content β€” All Notes]] </sup> <sup>Series: [[sas-programming-1-essentials|SAS Programming 1 β€” Essentials]]</sup> <sup>Topic: [[sas-programming-1-essentials#Lesson 2 Accessing Data|Lesson 2: Accessing Data]]</sup> # Understanding SAS Data > [!abstract]- Overview > > **Core Essence**: SAS organizes data into structured tables with defined attributes, making it possible to access, analyze, and report on information systematically. > > **Key Concepts**: > > - **Data Types**: Structured (defined columns) vs. unstructured (text strings) > - **Table Anatomy**: Descriptor portion (metadata) + data portion (actual values) > - **Column Requirements**: Every column needs name, type, and length > > **Critical Connections**: Column attributes determine how SAS reads and processes your data, while the descriptor portion acts as a blueprint that makes structured analysis possible. > > **Must Remember**: Use PROC CONTENTS to examine any table's structure before working with it---understanding your data's attributes prevents errors and guides analysis decisions. > [!code]- Syntax Reference > > > |Syntax|Purpose|Example| > |---|---|---| > |`PROC CONTENTS DATA=dataset;`|Basic table examination|`PROC CONTENTS DATA=work.sales;`| > |`PROC CONTENTS DATA="filepath";`|External file examination|`PROC CONTENTS DATA="/data/class.sas7bdat";`| > |`RUN;`|Execute procedure|`RUN;`| ## The Foundation: Why Data Structure Matters Think of data like a library. **Structured data** is like books organized on labeled shelves with a card catalog---everything has a defined place and searchable attributes. **Unstructured data** is like a pile of loose papers---the information exists, but you need to sort and organize it before you can find what you need. SAS excels at working with both types, but understanding the difference shapes how you approach each format. ## Understanding Data Categories ### Structured Data: Ready for Analysis **Structured data** arrives with **defined columns and attributes** that tell SAS (and other applications) exactly how to read and display values. It's like receiving a spreadsheet where each column has a clear header and consistent data type. **Examples include:** - SAS tables (.sas7bdat files) - Microsoft Access tables - Database tables (Oracle, Teradata, Hadoop) - Excel files with defined structure > [!tip] SAS Advantage > SAS uses specialized "engines" to automatically understand and read different structured data formats---you don't need to manually specify how to interpret each type. ### Unstructured Data: Needs Preparation **Unstructured data** lacks defined columns. Even if it appears columnar (like comma-separated values), your computer sees it as one long text string without inherent meaning. **Examples include:** - Text files (.txt) - Comma-delimited files (.csv) - JSON files - Web logs > [!note] The Import Process Unstructured data must be **imported into SAS** before analysis. During import, you're essentially teaching SAS how to interpret the text string by defining where columns begin and end, what data types to expect, and how to handle special characters. ## Anatomy of a SAS Table A SAS table (file extension `.sas7bdat`) functions like a two-part blueprint: one section describes the structure, another holds the information. ### The Descriptor Portion: Your Data's Metadata This contains the **table's vital statistics**: - Table name and location - Number of rows and columns - Creation and modification timestamps - **Column definitions** (names, types, lengths) Think of this as the architectural plans---it tells you what the structure looks like without showing the actual contents. ### The Data Portion: Your Actual Information This contains the **data values themselves**, organized in the columns defined by the descriptor portion. > [!note] SAS Terminology Translation > SAS historically used its own terms, but they're equivalent to common data concepts: > > - **Table** = Data set > - **Column** = Variable > - **Row** = Observation > > These terms are interchangeable. This course uses familiar terms, but you'll see SAS-specific terms in output logs and documentation. ## Column Attributes: The Three Requirements Every SAS column must have three essential attributes. Think of these as the column's identity card. ### 1. Name: The Column's Identity **Requirements:** - **Length**: 1 to 32 characters - **First character**: Letter or underscore - **Remaining characters**: Letters, numbers, or underscores - **Case**: Stored as created, but referenced in any case **Examples:** ```sas /* Valid names */ Age _Score Student_ID Test1_Results /* Invalid names */ 1stPlace /* Can't start with number */ Test-Score /* Hyphens not allowed */ ``` > [!tip] Naming Best Practice > Even if your environment allows spaces or special characters in names, stick to SAS conventions for simplicity and consistency across different systems. ### 2. Type: How SAS Interprets the Data SAS recognizes two fundamental column types: #### Numeric Columns Can store **only numeric values**: - Digits 0-9 - Minus sign - Single decimal point - E for scientific notation #### Character Columns Can store **any text**: - Letters - Numbers (as text) - Special characters - Blank spaces #### Special Case: SAS Dates **SAS dates are numeric values** representing the number of days between **January 1, 1960** and a specific date. ```sas January 1, 1960 = 0 January 2, 1960 = 1 December 31, 1959 = -1 ``` > [!note] Why This Matters > Storing dates as numbers enables mathematical calculations (finding differences between dates) and logical sorting, while formatting options make them display as recognizable dates. ### 3. Length: Storage Space Allocation **Length** defines the number of bytes allocated to store column values. #### Numeric Columns - **Default**: Always 8 bytes - **Capacity**: About 16 significant digits - **Consistency**: All numeric columns use same storage #### Character Columns - **Range**: 1 to 32,767 bytes - **Rule**: One byte per character - **Strategy**: Set length to accommodate longest expected value **Example:** ```sas Country_Code: Length 2 (stores "US", "CA", "UK") Country_Name: Length 20 (stores "United States", "Canada", "United Kingdom") ``` ## Examining Table Structure with PROC CONTENTS The **PROC CONTENTS** procedure creates a report showing the descriptor portion of any table---your window into understanding data structure before analysis. ### Basic Syntax ```sas PROC CONTENTS DATA=table-name; RUN; ``` ### Reading the Output The PROC CONTENTS report includes several sections, but focus on the **"Alphabetic List of Variables and Attributes"**: |#|Variable|Type|Len| |---|---|---|---| |3|Age|Num|8| |6|Birthdate|Num|8| |4|Height|Num|8| |1|Name|Char|8| |2|Sex|Char|1| |5|Weight|Num|8| **Reading this table:** - **#**: Column position in the table - **Variable**: Column name - **Type**: Num (numeric) or Char (character) - **Len**: Storage length in bytes > [!tip] Analysis Strategy > Always run PROC CONTENTS on unfamiliar data before beginning analysis. Understanding column types and lengths prevents processing errors and guides your analytical approach. ## Practical Application: Evaluating a New Dataset > [!example]- Example Scenario > When you encounter a new SAS table, follow this systematic approach: > > 1. **Run PROC CONTENTS** to understand structure > 2. **Identify column types** to plan appropriate analyses > 3. **Check character lengths** to anticipate potential truncation issues > 4. **Note date columns** for special formatting needs > 5. **Review total rows/columns** for scope understanding > > **Example workflow:** > > > ```sas > /* Step 1: Examine the structure */ > PROC CONTENTS DATA=mydata.customer_info; > RUN; > > /* Step 2: Based on PROC CONTENTS output, plan your analysis */ > /* If you see Date_Joined (Num, 8), you know it's a SAS date */ > /* If you see Customer_ID (Char, 10), you know it's text */ > ``` > ## Connecting the Concepts **Column attributes determine processing capabilities**---you can't perform mathematical operations on character columns, and you can't store long text in short character fields. The **descriptor portion acts as a contract** between you and SAS: it defines what each column can hold and how SAS should interpret the values. **PROC CONTENTS reveals this contract**, letting you work confidently with your data. -- Reference: - SAS Programming 1 β€” Essentials