
1. Planing
1.1. See data first
1.1.1. head
1.1.2. tail
1.2. Check **data dictionary** if available
1.3. Familiarity with the data?
1.3.1. Best guess on source of messiness
1.3.1.1. Human
1.3.1.1.1. mistyping
1.3.1.1.2. language
1.3.1.1.3. lack of standards
1.3.1.2. Computer
1.3.1.2.1. regional configuration
1.3.1.2.2. miscalibration of censors
1.3.1.2.3. misuse of functions (defaults)
2. Pay attention
2.1. a table
2.1.1. Identifier uniqueness
2.1.1.1. simple
2.1.1.2. composite
2.1.2. Column names
2.1.2.1. in the right place?
2.1.2.2. need to shrink?
2.1.2.3. numbers and special characters?
2.1.2.3.1. always safe:
2.1.2.3.2. never duplicates
2.1.2.3.3. Python vs R
2.1.2.4. leading and trailing spaces?
2.1.2.5. need to normalize to get rid of punctuation?
2.1.3. Cell values
2.1.3.1. text
2.1.3.1.1. leading and trailing spaces?
2.1.3.1.2. characters beyond alphanumeric?
2.1.3.2. categories
2.1.3.2.1. ALWAYS verify with frequency table
2.1.3.2.2. the representation of missing values
2.1.3.3. numbers
2.1.3.3.1. presence of characters different than numbers due to number format
2.1.3.3.2. leading and trailing spaces when read as text
2.1.3.3.3. the representation of dates
2.1.3.3.4. the representation of missing values
3. common operations
3.1. subsetting / filtering / skipping
3.2. basic exploration
3.2.1. look for characters that contaminate the real interpretation of the value
3.2.1.1. number
3.2.1.1.1. see if something different than a number is present in the value
3.2.1.2. text
3.2.1.2.1. see if something different from your alphabet is present
3.3. ad-hoc programming
3.3.1. replace
3.3.2. extract
3.3.3. split
3.3.4. strip / trim
3.3.5. using regular expressions