
1. PreProcessing Operations
1.1. Cleaning Data Frames
1.1.1. Planing
1.1.1.1. See data first
1.1.1.2. Keep columns needed
1.1.1.3. Best guess on source of messiness
1.1.1.3.1. Human
1.1.1.3.2. computer
1.1.2. Pay attention
1.1.2.1. Identifier uniqueness
1.1.2.1.1. simple
1.1.2.1.2. composite
1.1.2.2. Column names
1.1.2.2.1. many characters?
1.1.2.2.2. weird symbols
1.1.2.2.3. spaces between words
1.1.2.2.4. leading and trailing spaces
1.1.2.3. Cell values
1.1.2.3.1. text
1.1.2.3.2. categories
1.1.2.3.3. numbers
1.1.3. operations
1.1.3.1. subsetting
1.1.3.2. replace
1.1.3.2.1. brute force
1.1.3.2.2. some programming
1.2. Formatting
1.2.1. Text
1.2.1.1. mixture of capitalization
1.2.2. Categories
1.2.2.1. Use levels correctly
1.2.2.2. Verify range of ordinal levels
1.2.3. Numbers
1.2.3.1. coerce?
1.2.3.2. ignore?
1.2.3.3. raise?
1.3. Transformation
1.3.1. Aggregating
1.3.2. Re scaling
1.3.2.1. Normalization (min-max)
1.3.2.2. Standardization
1.4. Reshaping
1.4.1. Wide to Long
1.4.1.1. you avoid missing cells
1.4.1.2. easier for complex plots and some longitudinal methods
1.4.2. Long to Wide
1.4.2.1. easier to compute stats for unit of analysis
1.4.3. These processes help discover issues
1.4.3.1. key repetitions
1.4.3.2. key mistypying
1.5. Exporting
1.5.1. keep working in Python?
1.5.2. moving to R?
2. EXAMPLE
2.1. Collect Data
2.1.1. uploading file
2.1.1.1. path
2.1.1.2. call
2.1.2. scraping table
2.1.2.1. path
2.1.2.2. call
2.2. See data first
2.3. clean data
2.3.1. bad
2.3.1.1. better_1
2.3.1.1.1. better_2
2.3.1.1.2. key code:
2.3.1.2. key code:
2.3.1.2.1. a)
2.3.2. key code:
2.3.2.1. a)
2.4. Format
2.4.1. explore
2.4.1.1. formatted
2.5. Merge
2.5.1. Tables
2.5.1.1. hdi
2.5.1.2. demo
2.5.2. identify KEY(s)
2.5.3. basic merge
2.5.3.1. only coincidences in both keys
2.5.4. fuzzy merge
2.5.4.1. find keys written differently
2.5.4.1.1. what was not matched?
2.5.4.1.2. go fuzzy
2.6. Scaling
2.6.1. currently
2.6.2. scaling code
2.6.3. scaled data
2.7. concatenating DFs
2.7.1. code
2.8. exporting
2.8.1. for Python
2.8.2. for R