7. PreProcessing - CLEANING

Get Started. It's Free
or sign up with your email address
7. PreProcessing - CLEANING by Mind Map: 7. PreProcessing - CLEANING

1. Planing

1.1. See data first

1.1.1. head

1.1.2. tail

1.2. Check **data dictionary** if available

1.3. Familiarity with the data?

1.3.1. Best guess on source of messiness

1.3.1.1. Human

1.3.1.1.1. mistyping

1.3.1.1.2. language

1.3.1.1.3. lack of standards

1.3.1.2. Computer

1.3.1.2.1. regional configuration

1.3.1.2.2. miscalibration of censors

1.3.1.2.3. misuse of functions (defaults)

2. Pay attention

2.1. Identifier uniqueness

2.1.1. simple

2.1.2. composite

2.2. Column names

2.2.1. need to shrink?

2.2.2. spaces between words?

2.2.2.1. no spaces

2.2.2.2. another character

2.2.3. leading and trailing spaces?

2.3. Cell values

2.3.1. text

2.3.1.1. leading and trailing spaces?

2.3.1.2. characters beyond alphabet?

2.3.2. categories

2.3.2.1. ALWAYS verify with frequency table

2.3.2.1.1. possible mistypings

2.3.2.2. the representation of missing values

2.3.3. numbers

2.3.3.1. presence of characters different than numbers due to number format

2.3.3.1.1. currency?

2.3.3.1.2. units of measurement?

2.3.3.2. leading and trailing spaces when read as text

2.3.3.3. the representation of dates

2.3.3.4. the representation of missing values

3. common operations

3.1. subsetting / filtering / skipping

3.2. basic exploration

3.2.1. look for characters that contaminate the real interpretation of the value

3.2.1.1. number

3.2.1.1.1. see if something different than a number or a dot is present in the value

3.2.1.2. text

3.2.1.2.1. see if something different your alphabet is present

3.3. ad-hoc programming

3.3.1. replace

3.3.2. extract

3.3.3. split

3.3.4. strip / trim

3.3.5. using regular expressions