3. Cleaning and Formatting

Comienza Ya. Es Gratis
ó regístrate con tu dirección de correo electrónico
3. Cleaning and Formatting por Mind Map: 3. Cleaning and Formatting

1. Cleaning

1.1. Planing

1.1.1. See data first

1.1.1.1. head

1.1.1.2. tail

1.1.2. Check **data dictionary** if available

1.1.3. Familiarity with the data?

1.1.3.1. Best guess on source of messiness

1.1.3.1.1. Human

1.1.3.1.2. Computer

1.2. Pay attention

1.2.1. a table

1.2.1.1. Identifier uniqueness

1.2.1.1.1. simple

1.2.1.1.2. composite

1.2.1.2. Column names

1.2.1.2.1. in the right place?

1.2.1.2.2. need to shrink?

1.2.1.2.3. numbers and special characters?

1.2.1.2.4. leading and trailing spaces?

1.2.1.2.5. need to normalize to get rid of punctuation?

1.2.1.3. Cell values

1.2.1.3.1. text

1.2.1.3.2. categories

1.2.1.3.3. numbers

1.3. common operations

1.3.1. subsetting / filtering / skipping

1.3.2. basic exploration

1.3.2.1. look for characters that contaminate the real interpretation of the value

1.3.2.1.1. number

1.3.2.1.2. text

1.3.3. ad-hoc programming

1.3.3.1. replace

1.3.3.2. extract

1.3.3.3. split

1.3.3.4. strip / trim

1.3.3.5. using regular expressions

2. Formatting

2.1. You are **assuming** data is clean

2.1.1. See DATA TYPES

2.1.1.1. R: **str()**

2.2. Format numeric data

2.2.1. when numeric values are clean...

2.2.1.1. formatting numeric data is the easiest!

2.2.1.1.1. R: **as.numeric()**

2.3. Format dates

2.3.1. avoid date inference

2.3.2. Be aware of the date/time symbols

2.3.2.1. Year

2.3.2.1.1. %y (24)

2.3.2.1.2. %Y (2024)

2.3.2.2. Month

2.3.2.2.1. %m (00-12)

2.3.2.2.2. %b (Jan, Dec)

2.3.2.2.3. %B (January, December)

2.3.2.3. Day

2.3.2.3.1. %d (01-31)

2.3.2.3.2. %a (Mon, Tue)

2.3.2.3.3. %A (Monday, Tuesday)

2.4. Format text

2.4.1. decide

2.4.1.1. capitalization

2.4.1.2. normalization

2.4.2. columns are a particular case

2.4.2.1. simplicity

2.4.2.1.1. easy to reference

2.4.2.1.2. avoid your own mistypings

2.5. Format categorical data

2.5.1. Nominal

2.5.1.1. you could keep them as they come

2.5.1.1.1. or just change the data type

2.5.1.2. never nominal as ordinal

2.5.2. Ordinal

2.5.2.1. verify ordering in categories

2.5.2.2. Homogenize range of ordinal levels

2.5.2.2.1. same min

2.5.2.2.2. same max

2.5.2.3. recoding

2.5.2.3.1. integers in a column

2.5.2.3.2. levels as labels

2.5.3. This may be complicated to export to be used in a different program