1. PreProcessing Operations
1.1. Cleaning Data Frames
1.1.1. Planing
1.1.1.1. Determine operation to get data
1.1.1.1.1. downloading
1.1.1.1.2. scrapping
1.1.1.2. Best guess on source of messiness
1.1.1.2.1. Human
1.1.1.2.2. computer
1.1.1.3. See data first
1.1.1.3.1. head
1.1.1.3.2. tail
1.1.1.4. Keep columns needed
1.1.1.4.1. maybe missing values can help filtering!
1.1.1.4.2. text pattern may help
1.1.1.5. Keep rows needed
1.1.1.5.1. maybe missing values can help filtering!
1.1.1.5.2. index resetting may be needed
1.1.2. Pay attention
1.1.2.1. Identifier uniqueness
1.1.2.1.1. simple
1.1.2.1.2. composite
1.1.2.2. Column names
1.1.2.2.1. many characters?
1.1.2.2.2. weird symbols
1.1.2.2.3. leading and trailing spaces
1.1.2.2.4. spaces between words
1.1.2.3. Cell values
1.1.2.3.1. text
1.1.2.3.2. categories
1.1.2.3.3. numbers
1.1.3. Saving
1.1.3.1. Just as CSV as no formatting has been done
1.1.3.1.1. save it in GoogleSheets
1.2. Formatting
1.2.1. Text
1.2.1.1. mixture of capitalization
1.2.2. Categories
1.2.2.1. Use levels correctly
1.2.2.2. Verify range of ordinal levels
1.2.3. Numbers
1.2.3.1. coerce?
1.2.3.2. ignore?
1.2.3.3. raise?
1.2.4. Saving
1.2.4.1. pickle vs csv
1.3. Merging
1.3.1. Key
1.3.1.1. DFs to be merged must share common column values in one or several columns
1.3.1.1.1. DF location
1.3.1.1.2. DF uniqueness
1.3.2. the Varieties of Merge
1.3.2.1. **Default result**: only matches when the same 'key' is present in both DFs
1.3.2.1.1. AKA: inner join
1.3.2.2. Keeping all rows from...
1.3.2.2.1. **LEFT (X)** DF
1.3.2.2.2. **RIGHT (Y)** DF
1.3.2.2.3. **BOTH- Left and Right (X and Y)** DFs
1.3.3. Fuzzy Merge
1.3.3.1. Algorithm to match similar keys
1.4. Transformation
1.4.1. Aggregating
1.4.2. Reshaping
1.4.2.1. Wide to Long
1.4.2.1.1. you avoid missing cells
1.4.2.1.2. easier for complex plots and some longitudinal methods
1.4.2.2. Long to Wide
1.4.2.2.1. easier to compute stats for unit of analysis
1.4.2.3. These processes help discover issues
1.4.2.3.1. key repetitions
1.4.2.3.2. key mistypying
1.5. Exporting
1.5.1. From Python to Python
1.5.1.1. pickle
1.5.2. From R to R
1.5.2.1. RDS
1.5.3. From Python to R
1.5.3.1. parquet
1.5.3.2. feather
1.5.4. From R to Python
1.5.4.1. parquet
1.5.4.2. feather
1.5.5. **What about CSV?**
1.5.5.1. you may lose the formatting you did
1.5.5.1.1. dates
1.5.5.1.2. categorical variables
1.5.5.2. the file size can be too big!