Column store database formats like ORC and Parquet reach new levels of performance
Growing in popularity in the big data Hadoop world
It was not all that long ago (2005) that Stonebraker et al. wrote the paper entitled “C-Store: A Column-Oriented DBMS” which called for an architecture that stores data in columns rather than rows. When the team at MIT first released the concept of a columnar database, they were thinking about and predicting that big data was on the way. They were looking at new ways of thinking about the way we store and access data was in the cards. This paper led to many new technologies, including HPE Vertica.
It’s ten plus years later and the column store engine designed by the HPE Vertica team is proven in countless success stories as an enterprise grade solution for handling big data analytics. It works for petabyte scale big data analytics because it economizes the storage by dense-packing the data, and uses a storage system that includes a shared nothing machine environment. It works for analytics because it can handle hundreds or thousands of concurrent users in their quest for business information.
However, along the way the world was excited, hyped even, by the promise of a new set of technologies powered by the open source community and Hadoop. The promise of an open source MPP environment and the capability to use the memory and compute resources of an entire cluster on a task was welcomed. Part of the promise of the Hadoop data lake is that data can be stored in any format and accessed by analytical tools. However, if you store data in any format it’s also difficult to make it performant. Many Hadoop-based analytics began life by saving data in raw semi-structured formats.
When the realization that the raw formats were not delivering analytics fast enough for the Hadoop world, file formats like ORC and Parquet were developed, both of which offer compression and columnar storage. With this latest release, HPE Verticanow supports fast data access to both ORC and Apache parquet. Apache Parquet is a columnar storage format available to the Hadoop ecosystem, but is particularly popular in Cloudera distributions. Like Vertica’s native file format, ORC and Parquet are compressed, efficient columnar formats. In our testing, these formats are much more performant than raw formats. This is a big step forward in Hadoop delivering on the promise of big data analytics. In our testing, our own Vertica ROS format is still the fastest by far, but the others offer a new boost to performing analytics on Hadoop.
Why Column Store?
One of the reasons that the industry has adopted columnar databases has to do with working around the relatively slow data transfer rate of hard disk versus memory. For example, while RAM memory can reach transfer rates of 17 GB/s (Gigabytes per second), a Serial ATA (SATA) hard drive has a maximum transfer rate of 600 MB/second or 0.6 GB/s. Both of these numbers are being improved all the time, but it illustrates the huge difference between disk and memory. There is a strong reason to come up with ways to limit the amount of data that a disk has to read and write.
Compression is a tested and viable solution to overcome the disparity. The processing power needed to compress and decompress data is minimal and good compression may reduce the amount of data you have to read from disk by a ratio of up to 40:1. Even in-memory databases have embraced columnar, perhaps for a different motive. If you compress, you can potentially use 1/40th of the RAM, and when you conserve RAM, you can fit more data in memory. Columnar with compression is both powerful for disk access and prudent to conserve memory resources.
There are some details that need to be ironed out for columnar. For example, there’s a bit of overhead to take a row of data, convert it to a column and compress it. Advanced databases like HPE Vertica have systems in place so that if you suddenly do a large amount of INSERTs, you won’t overwhelm it. The system contains write optimized storage to stores up inserts and give time to the system to processes them for optimization in the column store.
Another detail is compression optimization. There are different types of compression for different types of data. Integers will compress more efficiently with certain algorithms while strings compress better with others. Orc and Parquet let you specify which algorithms to use, while more advanced solutions will examine the column and auto-select the optimal compression. Additionally, you shouldn’t have to necessarily decompress the data in order to perform analytics. The analytics engine itself should be able to understand compressed data. The advanced solutions have both of these details figured out, but it is more manual in the newer solutions.
Adoption of the columnar format is good for business and good for those who need to perform analytics fast. It’s no wonder that Stonebraker’s concepts are still finding good uses in the big data analytics world.
Sponsored by HPE.