Parquet

Get Started. It's Free
or sign up with your email address
Parquet by Mind Map: Parquet

1. Columnar format

1.1. why?

1.1.1. only read the columns you need

1.1.1.1. faster I/O

1.1.1.1.1. scan a subset of column

1.1.2. like data together

1.1.2.1. better compression, more homogenous

1.1.2.1.1. save sapace

1.1.2.2. better compression

1.1.2.3. encoding better to suite modern processor's pipeline

1.1.2.3.1. because branching more predictable

1.1.3. type-specific encodings possible

1.1.3.1. dictionary encoding

1.1.3.1.1. a large number of unique strings

1.1.3.1.2. map a string to a number

1.1.3.1.3. only store number

1.1.3.1.4. given a number, do a lookup in dictionary to find the string

1.1.4. skip unnecessary deserialization

1.1.4.1. skip unwanted columns

1.1.4.1.1. don't deserialize them

1.1.5. possible to operate on encoded data

1.1.6. natural fit for vectorized operations

1.2. diff

1.2.1. cassandra and hbase

1.2.1.1. key-value pairs

1.2.2. parquet

2. Design

2.1. Framework Independent

2.1.1. How does parquet achieve this?

2.1.1.1. let people specify their own converters

2.1.1.1.1. convert parquet to your object

2.1.1.1.2. no middle parquet object then transfer to your object

2.1.1.2. sort of like SAX style model

2.2. Row Groups

2.2.1. a group of rows in columnar format

2.2.1.1. max size buffered in memory while writing

2.2.1.2. one or more per split while reading

2.2.1.3. Roughly: 50MB to 1GB

2.2.1.3.1. buffer needed in memory

2.2.2. a collection of rows that will be converted to a columnar format

2.2.2.1. we don't convert the whole dataset at once

2.2.2.2. a piece at a time

2.2.3. diagram

2.2.4. diagram

2.3. Column Chunk

2.3.1. data for one column in a row group

2.3.1.1. can be read independently for efficient scans

2.3.1.2. skip columns chunks

2.3.1.3. makes deserialization fast

2.3.1.3.1. read

2.4. Page

2.4.1. Unit of access in a column chunk

2.4.1.1. should be big enough for compression to be efficient

2.4.1.2. Minimum size to read to access a single record

2.4.1.2.1. when index pages are available

2.4.1.3. roughly: 8KB~1MB

2.4.2. contains

2.4.2.1. header

2.4.2.2. repetition & definition levels

2.4.2.3. data

2.4.2.3.1. encoded

2.4.2.3.2. compressed

2.4.3. footer can be separate from the file

2.4.3.1. so that you can rewrite it cheaply

2.5. block

2.5.1. The Parquet file block size should be no larger than the HDFS block size for the file so that each Parquet block can be read from a single HDFS block (and therefore from a single datanode). It is common to set them to be the same, and indeed both defaults are for 128 MB block sizes.

2.6. file

3. Design Goal

3.1. Reduce storage cost for large datasets with stable schemas

3.2. Reduce required IO for queries

3.3. support nested data structures

3.3.1. maps, lists. JSON

3.3.2. twitter

3.4. Work for multiple frameworks

3.4.1. write binary data to disk

3.4.2. want the data available by hive/pig/etc

3.4.3. however, if you have your custom binary format, this won't work

3.5. built-in flexibility to allow evolving the format

3.6. minimal requirements to adopt

3.6.1. no company SPOF

3.6.1.1. single point of failure?

3.7. Absolutely no implicit Hive/Pig/Impala/etc preference

4. history

4.1. 2012

4.1.1. Fall

4.1.1.1. twitter and cloudera merge efforts to develop columnar formats

4.2. 2013

4.2.1. March

4.2.1.1. support hive

4.2.1.1.1. hive able to read qarquet

4.2.2. July

4.2.2.1. 1.0 release

4.3. 2014

4.3.1. May

4.3.1.1. went to apache incubator

4.3.1.2. go open source to public

5. Nested Structures

6. Encoding

7. Question

7.1. a schema for a parquet file

7.2. add field

7.2.1. can add column to a existing type

7.2.2. Can NOT remove column

7.2.2.1. because there's data lose for old data

7.3. change type

7.3.1. can change int to long

7.3.2. can NOT change long to int

7.3.2.1. info lose