What is a Columnar Database?
A columnar database, also known as a column-oriented database, stores data tables by column rather than by row. This allows for huge advantages in performance and compression.
How Columnar Databases Store Data
In a traditional row-oriented database, each row contains all the information for a particular record. For example, if you have a table containing customer data, each row would have data like customer ID, name, address, phone number, etc.
In a columnar database, the data is stored column-by-column. So all the customer IDs would be stored together in one column, all the names in another column, addresses in another, etc.
This columnar storage provides advantages in the following ways:
Columnar storage allows for huge compression ratios, often 10x or more. This is because the data in each column is highly repetitive. In the customer name column for example, there are many repetitions of first and last names. Traditional compression algorithms like LZW can easily compress this repetitive data.
Only data from the relevant columns needs to be extracted from disk. If a query only needs to select customer names, then only the name column needs to be read. This provides a huge performance boost for queries.
Columnar storage also lends itself to vectorization – operating on a whole column vector in one go, rather than iterating row by row. Modern CPUs can apply single instructions to the whole vector simultaneously.
Updates only modify data in the relevant columns. In a row-oriented database, any update requires rewriting the entire row. In a column store, only affected columns need to be modified.
However, deletes in a column store still require tombstones to mark deleted rows, which can affect compression. Inserts require some form of row reconstruction. Thus updates and inserts are still more efficient in row-oriented databases.
Columnar Database Use Cases
Column stores are well suited for data warehousing, business intelligence and analytics workloads for the following reasons:
- Dataset sizes are large, so compression provides major savings in storage.
- Queries tend to do aggregations over large numbers of records, which column stores can complete faster.
- Low update frequency compared to read operations, so update/insert penalties are less important.
Columnar databases are not optimal for transactional workloads with frequent inserts and updates. Row stores perform better for these use cases.
Popular Columnar Databases
Some popular columnar databases and their underlying technologies include:
- Google BigQuery – Uses the Dremel columnar storage engine. Also provides distributed SQL query engine.
- Amazon Redshift – Columnar storage and MPP architecture. Based on ParAccel analytic database.
- Snowflake – Built on Amazon S3 and virtual warehouses in EC2. Columnar storage across clusters.
- Apache Parquet – Open source columnar format that works across data frameworks like Apache Spark, etc.
- Apache Kudu – Columnar storage format for the Apache Hadoop ecosystem.
- Cassandra – NoSQL database with option of storing tables as columns.
- ScyllaDB – Column oriented NoSQL database compatible with Cassandra.
One of the most popular cloud columnar databases is Google BigQuery.
Important features of Google BigQuery are:
- Serverless architecture – no infrastructure to setup or manage
- Petabyte scale queries using the Dremel columnar engine
- SQL support with extensions like ARRAY and STRUCT
- Built into other Google Cloud services like Google Sheets
- Integrated analytics like Google BigQuery ML
BigQuery is designed for fast ad-hoc queries over read-only or append-only datasets. BI tools can hook directly into BigQuery to run SQL queries and visualize data.
Amazon Redshift provides petabyte-scale data warehousing on AWS infrastructure. Some key features of Amazon Redshift are:
- Columnar storage for high compression rates and query performance
- Massively Parallel Processing (MPP) architecture
- Advanced query optimization using machine learning
- Integration with data lakes in Amazon S3
- Automatic backups and monitoring
Redshift provides a scalable data warehouse solution without managing infrastructure. But it uses a fixed provisioned model so capacity planning is required.
Column Stores vs Row Stores
Here’s a quick comparison between columnar and row oriented databases:
|Parameter||Column Stores||Row Stores|
|Storage||By column||By row|
|Query speed||Very fast for analytics||Fast for transactions|
|Use cases||Analytics & BI||Transactions & OLTP|
|Examples||Redshift, BigQuery, Vertica||MySQL, PostgreSQL, Sqlite|
Implementing a Column Store
To implement a basic column store, these are the key steps:
- Store each column as a separate file on disk.
- Use metadata to track row membership across columns.
- Compress each column file using algorithms like LZW.
- Read only relevant columns for each query.
- Employ vectorized processing for column scans.
For updates, deletes require tombstones while inserts need some scheme to synchronize across columns.
More advanced implementations allow:
- Distributed storage across nodes
- Parallelized queries using MPI
- Caching hot columns in memory
- Encrypted and compressed storage for security
But the core ideas remain the same – split tables into columns, then store and process them separately.
Is columnar database SQL or NoSQL?
Columnar databases can be either SQL or NoSQL.
SQL columnar databases like Amazon Redshift, Google BigQuery, and Snowflake support standard SQL for querying data. They use a columnar storage engine underneath to provide performance benefits transparently to the user. The columnar nature is handled internally, while exposing standard relational SQL interfaces.
NoSQL columnar databases like Cassandra and ScyllaDB also offer columnar storage layouts as an option. These are primarily key-value NoSQL databases, but allow tables to be configured with column-orientation for analytics workloads. They support basic APIs like PUT and GET rather than SQL.
Advantages of Columnar Databases
Here are some of the advantages columnar databases provide:
- Compression – High compression ratios of 10x or more reduces storage costs.
- Query Performance – Accessing only necessary columns makes queries much faster.
- Vectorization – Single operations on column vectors leverages CPU parallelism.
- Analytics – Perfect fit for analytics/BI workloads with aggregations across huge datasets.
- Flexibility – New analytic indices can be added without affecting data.
- Scalability – Scale out using distributed storage and parallel execution.
- Availability – Replication and distributed query engines provide high availability.
Disadvantages of Columnar Databases
Columnar databases also come with some limitations:
- Updates – Modifying and deleting data is slower and more complex.
- Transactions – Lack of row-level locking makes transactions difficult.
- Overheads – Added complexity for inserts, deletes and locking mechanisms.
- Joins – Joins on unsorted data can be slow. Requires sortedness or indices.
- Not fit for OLTP – Overheads and inability to do fast point queries.
So columnar stores are not meant for transactional systems. They best suit analytical workloads.
Why We Use Columnar Databases?
Column oriented databases are a perfect fit for modern analytics pipelines dealing with huge datasets. By storing data column-by-column and then processing them separately, they achieve compression and performance improvements of an order of magnitude or more. This makes insightful analysis over massive data possible. Leading analytic solutions are built using columnar storage and processing for these reasons.
However, column stores are not optimal for transactional workloads. Row oriented databases are better suited for point lookups, updates and inserts. The strengths of both can be obtained by using columnar storage for analytics and row storage for transactions.
Read also: 9 Types of Databases