4. Cleaning data
da Profesor Magallanes
1. Source of messiness
1.1. Human
1.1.1. mistyping
1.1.2. language
1.1.3. lack of standards
1.2. computer
1.2.1. regional configuration
1.2.2. miscalibration of censors
1.2.3. misuse of functions (defaults)
2. Planing
2.1. See data first
2.2. Keep columns needed
3. Pay attention
3.1. Identifier uniqueness
3.2. Column names
3.2.1. many characters?
3.2.2. weird symbols
3.2.3. spaces between words
3.2.4. leading and trailing spaces
3.3. Cell values
3.3.1. text
3.3.1.1. same as column names
3.3.2. categories
3.3.2.1. same as column names
3.3.2.2. ALWAYS verify with frequency table
3.3.2.3. the representation of missing values
3.3.3. numbers
3.3.3.1. presence of characters different than numbers due to number format
3.3.3.2. the representation of missing values