unix-for-data-science - Em Royce & Company

2025-04-04 Friday #r-programming #rstudio #git # github #sketch [[maps-of-content]] # Unix for Data Science: Command-Line Essentials > [!success]- Concept Sketch: [[unix-for-data-science.excalidraw.svg|Concept Sketch]] > ![[unix-for-data-science.excalidraw.svg]] > [!abstract]- Quick Review > > **Core Essence**: Unix provides a powerful command-line environment for efficiently organizing and automating data science workflows > > **Key Concepts**: > > - Command Line Interface (CLI) vs. Graphical User Interface (GUI) > - Hierarchical filesystem structure (root, home, directories) > - Navigation commands (pwd, cd, ls) > - File manipulation commands (mv, cp, rm) > - Project organization using structured directories > > **Must Remember**: > > - Always use relative paths in your code for portability > - rm permanently deletes files (no trash recovery) > - Create a consistent project structure for organization > - Tab autocomplete saves time and prevents errors > - "No news is good news" - successful commands often give no output > > **Critical Relationships**: > > - Working directory ↔ relative path references > - Parent directory (..) ↔ current directory (.) > - Command structure: command [options] [arguments] > - Project organization ↔ reproducibility > [!code]- Code Reference > > |Command/Syntax|Purpose|Example| > |---|---|---| > |**Navigation Commands**||| > |`pwd`|Print working directory|`pwd`| > |`ls`|List directory contents|`ls -la`| > |`cd`|Change directory|`cd projects`| > |`cd ..`|Move up one directory|`cd ..`| > |`cd ~`|Go to home directory|`cd ~`| > |**Directory Management**||| > |`mkdir`|Create directory|`mkdir new_project`| > |`mkdir -p`|Create nested directories|`mkdir -p dir1/dir2/dir3`| > |`rmdir`|Remove empty directory|`rmdir empty_dir`| > |`rm -r`|Remove directory and contents|`rm -r old_project`| > |**File Operations**||| > |`mv`|Move or rename files|`mv file.txt ~/projects/`| > |`cp`|Copy files|`cp data.csv backup.csv`| > |`cp -r`|Copy directories recursively|`cp -r src backup/src`| > |`rm`|Remove files|`rm unwanted.txt`| > |`less`|View file contents|`less data.csv`| > |`touch`|Create empty file|`touch README.md`| ## Introduction to Unix for Data Science Unix has become the operating system of choice for data scientists due to its powerful command-line tools that enable efficient file management, automation, and reproducible workflows. While graphical interfaces (GUIs) are intuitive, the command line interface (CLI) offers superior speed, precision, and automation capabilities that become increasingly valuable as projects grow in complexity. This guide introduces the fundamental Unix concepts and commands you'll need to organize data science projects effectively, moving from basic navigation to structured project management. > [!note] Throughout this guide, commands will be shown in `code format`. When you see something like `pwd`, it means you type "pwd" at the command prompt in your terminal. ## The Unix Filesystem: A Hierarchical Structure **The Unix filesystem organizes everything in a tree-like hierarchy** starting from a single root directory. Understanding this structure is essential for effective navigation and file management. ### Key Filesystem Concepts - **Root directory** (`/`): The top-level directory containing all other directories - **Home directory** (`~`): Your personal directory where your files are stored - **Working directory**: The directory you're currently in (context for commands) - **Parent directory** (`..`): The directory one level up in the hierarchy - **Current directory** (`.`): A reference to your current location ### Understanding Paths There are two ways to specify locations in the filesystem: - **Absolute paths** start from the root (`/`) or home (`~`) directory - Example: `/home/username/projects/data_analysis` - Example: `~/projects/data_analysis` - **Relative paths** start from your current working directory - Example: `data_analysis/datasets` (a subdirectory of your current location) - Example: `../backup` (a directory in the parent of your current location) > [!tip] When writing code for data science projects, using relative paths makes your code more portable and reproducible across different systems. ## Basic Navigation Commands ### Finding Your Location: `pwd` **The first essential command is `pwd` (Print Working Directory),** which tells you where you are in the filesystem. ```bash pwd ``` This might return something like `/home/username` showing your current location. ### Listing Directory Contents: `ls` **To see what files and directories exist in your current location, use `ls` (list):** ```bash ls ``` Add options for more details: - `ls -l`: Long format with permissions, size, date - `ls -a`: Show all files (including hidden ones that start with .) - `ls -h`: Human-readable file sizes ### Changing Directories: `cd` **To move between directories, use `cd` (change directory):** ```bash cd projects ``` Special navigation shortcuts: - `cd ~`: Go to home directory - `cd ..`: Go up one level (to parent directory) - `cd -`: Go back to previous directory > [!warning] When you're navigating in the terminal, you don't get a visual breadcrumb trail like in a GUI. Use `pwd` frequently to confirm your location. ## Managing Directories and Files ### Creating Directories: `mkdir` **To create a new directory, use `mkdir` (make directory):** ```bash mkdir projects ``` Create nested directories with the `-p` flag: ```bash mkdir -p projects/murders/data ``` ### Removing Directories: `rmdir` and `rm -r` **To delete an empty directory, use `rmdir`:** ```bash rmdir old_project ``` For directories containing files, use `rm -r` (remove recursively): ```bash rm -r old_project ``` > [!warning] `rm` and `rm -r` permanently delete files and directories. There is no trash bin or undo function. Be extremely careful, especially when using wildcards like `*`. ### Moving and Renaming: `mv` **To move or rename files and directories, use `mv` (move):** Move a file: ```bash mv data.csv projects/murders/data/ ``` Rename a file: ```bash mv old_name.txt new_name.txt ``` ### Copying Files: `cp` **To copy files, use `cp` (copy):** ```bash cp original.csv projects/murders/data/copy.csv ``` Add the `-r` flag to copy directories and their contents: ```bash cp -r analysis_scripts new_project/ ``` ### Viewing File Content: `less` **To quickly examine the contents of a text file, use `less`:** ```bash less data.csv ``` - Press `q` to quit - Use arrow keys or `Page Up`/`Page Down` to navigate - Press `/` followed by a search term to find text ## Organizing Data Science Projects ### Creating a Project Structure **An organized project structure improves efficiency and reproducibility.** A common approach is: 1. Create a main projects directory: `mkdir ~/projects` 2. Create separate directories for each project: `mkdir ~/projects/murders` 3. Within each project, create subdirectories for different components: - `data`: Raw data files - `rda`: Processed data - `figs`: Figures and visualizations - `src`: Source code bash ```bash mkdir -p ~/projects/murders/{data,rda,figs,src} ``` ### Documentation and Reproducibility **Always include a README file** that explains the project structure and purpose: ```bash touch ~/projects/murders/README.txt ``` Edit this file to include: - Project description - Directory structure explanation - Data sources and formats - Instructions for running analyses > [!tip] A well-documented project structure serves not only collaborators but also "future you" who may return to the project months later. ## Best Practices for Efficiency ### Using Tab Completion **Tab completion is essential for efficiency and preventing errors.** Press Tab while typing: - Directory or file names - Command names - Command options ### Relative Paths for Reproducibility **In your analysis code (R, Python, etc.), always use relative paths:** Instead of: r ```r data <- read.csv("/home/username/projects/murders/data/crime_data.csv") ``` Use: r ```r data <- read.csv("data/crime_data.csv") ``` This ensures your code works regardless of which computer it runs on, improving reproducibility. ### Automating Repetitive Tasks **When you find yourself performing the same sequence of commands repeatedly, consider automation:** - Create shell scripts for common workflows - Use wildcards (`*`) to operate on multiple files - Learn more advanced commands like `grep`, `awk`, and `sed` for text processing > [!case]- Case Application: Setting Up a New Analysis Project > > Imagine you're starting a new analysis of US election data. Here's how you would set it up: >```bash > # Create the project structure > mkdir -p ~/projects/elections/{data,rda,figs,src} > > # Navigate to the project > cd ~/projects/elections > > # Create a README > echo "# US Elections Analysis" > README.md > echo "Analysis of voting patterns from 2000-2020" >> README.md > > # Download data to the right location > cd data > curl -o election_results.csv https://example.com/data/elections.csv > > # Go back to project root > cd .. > > # Create initial analysis script > echo "# Load and clean election data" > src/01_data_prep.R > echo "data <- read.csv('data/election_results.csv')" >> src/01_data_prep.R > >``` > > Notice how each file is placed in the appropriate directory, and the R script uses a relative path. ## Summary Unix provides a powerful environment for organizing and managing data science projects through its command-line interface. By understanding the filesystem structure, mastering basic navigation and file manipulation commands, and adopting best practices for project organization, you can create efficient, reproducible workflows. **The most important takeaway**: Invest time in learning Unix commands and creating structured project directories. This initial investment will save countless hours as your projects grow in complexity and will make your work more reproducible and shareable. > [!NOTE]- #resources > - [Codecademy](https://www.codecademy.com/learn/learn-the-command-line) > - [Quora list of Linux reference books](https://www.quora.com/Which-are-the-best-Unix-Linux-reference-books) -- Reference: - Data Science, HarvardX