Python for PreProcessing

马上开始. 它是免费的哦
注册 使用您的电邮地址
Python for PreProcessing 作者: Mind Map: Python for PreProcessing

1. PreProcessing Operations

1.1. Cleaning Data Frames

1.1.1. Planing

1.1.1.1. See data first

1.1.1.1.1. head

1.1.1.1.2. tail

1.1.1.2. Best guess on source of messiness

1.1.1.2.1. Human

1.1.1.2.2. computer

1.1.1.3. Check **data dictionary** if available

1.1.2. Pay attention

1.1.2.1. Identifier uniqueness

1.1.2.1.1. simple

1.1.2.1.2. composite

1.1.2.2. Column names

1.1.2.2.1. many characters?

1.1.2.2.2. weird symbols

1.1.2.2.3. spaces between words

1.1.2.2.4. leading and trailing spaces

1.1.2.3. Cell values

1.1.2.3.1. text

1.1.2.3.2. categories

1.1.2.3.3. numbers

1.1.3. operations

1.1.3.1. subsetting / filtering / skipping

1.1.3.1.1. keep columns that you need

1.1.3.2. basic exploration

1.1.3.2.1. look for characters that contaminate the real interpretation of the value

1.1.3.3. ad-hoc programming

1.1.3.3.1. replace

1.1.3.3.2. extract

1.1.3.3.3. split

1.1.3.3.4. strip / trim

1.1.3.3.5. using regular expressions

1.2. Formatting

1.2.1. See DATA TYPES

1.2.1.1. not always the data type it is suppossed to be

1.2.2. Text

1.2.2.1. decide

1.2.2.1.1. capitalization

1.2.2.1.2. normalization

1.2.3. Numbers

1.2.3.1. coerce?

1.2.3.2. ignore?

1.2.3.3. raise?

1.2.4. Dates

1.2.4.1. avoid date automatic inference

1.2.4.2. Be aware of the date/time symbols

1.2.4.2.1. Year

1.2.4.2.2. Month

1.2.4.2.3. Day

1.2.5. Categorical data

1.2.5.1. Nominal

1.2.5.1.1. dichotomous

1.2.5.1.2. polytomous

1.2.5.2. Ordinal

1.2.5.2.1. consider duplicating the column(s)

1.2.5.2.2. Levels

1.2.5.2.3. Homogenize range of ordinal levels

1.3. Integrating

1.3.1. Appending/Concatenating

1.3.1.1. vertical

1.3.1.1.1. concept

1.3.1.2. horizontal

1.3.1.2.1. concept

1.3.2. Merging

1.3.2.1. Key

1.3.2.1.1. DFs to be merged must share common column values in one or several columns

1.3.2.2. the Varieties of Merge

1.3.2.2.1. **Default result**: only matches when the same 'key' is present in both DFs

1.3.2.2.2. Keeping all rows from...

1.3.2.3. Fuzzy Merge

1.3.2.3.1. Algorithm to match similar keys

1.4. Transformation

1.4.1. Aggregating

1.4.2. Re scaling

1.4.2.1. Normalization (min-max)

1.4.2.2. Standardization

1.5. Reshaping

1.5.1. Wide to Long

1.5.1.1. you avoid missing cells

1.5.1.2. easier for complex plots and some longitudinal methods

1.5.2. Long to Wide

1.5.2.1. easier to compute stats for unit of analysis

1.5.3. These processes help discover issues

1.5.3.1. key repetitions

1.5.3.2. key mistypying

1.6. Exporting

1.6.1. From Python to Python

1.6.1.1. pickle

1.6.1.2. parquet

1.6.1.2.1. feather

1.6.2. From R to R

1.6.2.1. RDS

1.6.3. From Python to R

1.6.3.1. parquet

1.6.3.1.1. feather

1.6.4. From R to Python

1.6.4.1. parquet

1.6.4.1.1. feather

1.6.5. **What about CSV?**

1.6.5.1. you may lose the formatting you did

1.6.5.1.1. dates

1.6.5.1.2. categorical variables

1.6.5.2. the file size can be too big!