Get Started. It's Free
or sign up with your email address
Parquet by Mind Map: Parquet

1. Columnar format

1.1. why?

1.1.1. only read the columns you need faster I/O scan a subset of column

1.1.2. like data together better compression, more homogenous save sapace better compression encoding better to suite modern processor's pipeline because branching more predictable

1.1.3. type-specific encodings possible dictionary encoding a large number of unique strings map a string to a number only store number given a number, do a lookup in dictionary to find the string

1.1.4. skip unnecessary deserialization skip unwanted columns don't deserialize them

1.1.5. possible to operate on encoded data

1.1.6. natural fit for vectorized operations

1.2. diff

1.2.1. cassandra and hbase key-value pairs

1.2.2. parquet

2. Design

2.1. Framework Independent

2.1.1. How does parquet achieve this? let people specify their own converters convert parquet to your object no middle parquet object then transfer to your object sort of like SAX style model

2.2. Row Groups

2.2.1. a group of rows in columnar format max size buffered in memory while writing one or more per split while reading Roughly: 50MB to 1GB buffer needed in memory

2.2.2. a collection of rows that will be converted to a columnar format we don't convert the whole dataset at once a piece at a time

2.2.3. diagram

2.2.4. diagram

2.3. Column Chunk

2.3.1. data for one column in a row group can be read independently for efficient scans skip columns chunks makes deserialization fast read

2.4. Page

2.4.1. Unit of access in a column chunk should be big enough for compression to be efficient Minimum size to read to access a single record when index pages are available roughly: 8KB~1MB

2.4.2. contains header repetition & definition levels data encoded compressed

2.4.3. footer can be separate from the file so that you can rewrite it cheaply

2.5. block

2.5.1. The Parquet file block size should be no larger than the HDFS block size for the file so that each Parquet block can be read from a single HDFS block (and therefore from a single datanode). It is common to set them to be the same, and indeed both defaults are for 128 MB block sizes.

2.6. file

3. Design Goal

3.1. Reduce storage cost for large datasets with stable schemas

3.2. Reduce required IO for queries

3.3. support nested data structures

3.3.1. maps, lists. JSON

3.3.2. twitter

3.4. Work for multiple frameworks

3.4.1. write binary data to disk

3.4.2. want the data available by hive/pig/etc

3.4.3. however, if you have your custom binary format, this won't work

3.5. built-in flexibility to allow evolving the format

3.6. minimal requirements to adopt

3.6.1. no company SPOF single point of failure?

3.7. Absolutely no implicit Hive/Pig/Impala/etc preference

4. history

4.1. 2012

4.1.1. Fall twitter and cloudera merge efforts to develop columnar formats

4.2. 2013

4.2.1. March support hive hive able to read qarquet

4.2.2. July 1.0 release

4.3. 2014

4.3.1. May went to apache incubator go open source to public

5. Nested Structures

6. Encoding

7. Question

7.1. a schema for a parquet file

7.2. add field

7.2.1. can add column to a existing type

7.2.2. Can NOT remove column because there's data lose for old data

7.3. change type

7.3.1. can change int to long

7.3.2. can NOT change long to int info lose