[En][Learning][CMU 15721] Database Compression & Storage Layout

Zebing Lin

Developer. Triathlete.

[En][Learning][CMU 15721] Database Compression & Storage Layout

Dec 14, 2020

I/O is the main bottlenecks if the DBMS has to fetch data from disk. Key trade-off is speed vs. compression ratio

Database Compression

Must produce fixed-length values (exception: var-length data stored in separate pool)
Postpone decompression for as long as possible during query execution -> late materialization
Lossless

Zone Maps: Pre-compute columnar aggregations per block that allow the DBMS to check whether queries need to access it

Run-length encoding: the value of the attribute, the start position in the column segment, the # of elements in the run. Columns better sorted intelligently to maximize compression opportunities.

Delta encoding: recording the difference between values that follow each other in the same column

Store base value in-line or a separate look-up tablse
combine with RLE to get even better compression ratios

Incremental encoding: avoids duplicating common prefixes/suffixes dictionary encoding: order-preserving encoding: encoded values need to support sorting in the same order as original values. One array of variable length strings and another array with pointers that maps to string offsets.

One can think of an in-memory database as just a large array of bytes:

The schema tells the DBMS how to convert the bytes into the appropriate type
Each tuple is prefixed with a header that constrains its meta-data NULL Data Types: most common one is a bitmap in the tuple header that specifies what attributes are null Word-Alignment: padding/reordering

N-Array Storage model

Advantage:

Fast inserts, updates and deletes
Good for queries that need the entire tuple
Can use index-oriented physical storage Disadvantage:
Not good for scanning large portions of the table and/or a subset of the attributes

Decomposition Storage Model

Advantage:

Reduces the amount wasted work because the DBMS only reads the data that it needs
Better compression

Disadvantage:

Slow for point queries, inserts, updates, and deletes because of tuple splitting/stitching