Q&A with Data Engineers: Jamie O’Mahony
Jamie O’Mahony is a Senior Kx Solutions Architect who has built a number of complex enterprise database systems over the past four years in financial institutions around the world. Jamie is currently based in New York City.
Q1. What lessons did you learn from using shared-nothing scale-out data architectures to support large volumes of data?
At times “Big Data” is not as big as people think and therefore the potential of scaling offered by this architecture is not required. To define the terms, “shared-nothing” applies to distributed systems in which each node is independent and does not share memory or disk. “Scale-out” refers to horizontal scaling, it suggests increasing the power of a system typically by adding extra storage, or CPU’s.
In the financial domain, where datasets can typically exceed 100’s of billions of records, these techniques have been used in production systems for decades often on single large memory machines or with very simple architectures.
The speed of access to get the data and to analyze it in an environment that understands time-series data is critical.
In this area, the ability to join time-series data across multiple tables has meant there has been less focus on using shared-nothing architectures.
Q2. What kind of data infrastructure do you use to support applications?
Data architectures supported by kdb+ are easily extensible. Kdb+ has been proven through industry benchmarks to demonstrate performance that allows querying billions of records very quickly. These architectures take advantage of the many cores available in production systems.
Lambda architecture is frequently the design paradigm in financial services trading applications due to its simplicity.
This is because with Lamdba architecture a system is capable of handling massive quantities of data by taking advantage of both batch (historical) and stream processing methods (real time). Although the term has once again gained popularity only recently, Kx have been following this methodology for over 20 years.
Using a single unified programming language further simplifies the data architecture. In other applications with more layers to the technology stack data is moved between layers adding latency and complexity causing both performance and maintenance problems. For communication within the data infrastructure a series of open source APIs are provided for integration with other programming languages as well as JDBC/ODBC interfaces for communication with other third party databases.
Q3. Do you have some advice for people looking to handle petabyte databases?
It is commonly assumed that a petabyte database requires a massively distributed solution. However, this is not how the problem is solved in the financial services industry.
Using high-end commodity hardware, the largest banks and trading operations build sophisticated systems using very simple, elegant designs on as few machines as possible. When transaction speed is a competitive advantage, only the most efficient solutions survive.
People don’t realize how much can be done with a single machine if a little more thought is expended up front on the design. It is my experience that people are too quick to jump on the latest technology band wagon.
If people rush into implementation too quickly, instead of taking time to really understand the underlying business logic and to design a system that reflects that logic then they pay the price in performance and problems maintaining and enhancing the system over the life of the whole application.
Q4. What are the typical mistakes made on large scale data projects? In your opinion, how can they be avoided in practice?
The design stage of a large, complex project must be given sufficient importance at the start. An essential part of the process is choosing the appropriate technology. Over the last few years, there have been some popular Big Data “solutions” that organizations have jumped to use as the basis of their enterprise systems, without understanding the full implications of their choices. Unfortunately these systems are not able to deliver on their promises, but they have also created maintenance nightmares.
It is very important – already at the design stage – to involve a system administrator. The system administrator should ensure that the hardware and software are optimized in tandem. Their deep understanding of the system will pay off later when they are needed for troubleshooting issues.
Another critical omission that I have seen in a number of projects is that when software choices were being made, the organization neglected to nominate internal staff to become proficient in the technologies chosen.
Q5. How do you ensure data quality?
Ensuring data quality is not a process to be “bolted on” later. It should be an integral element of any Big Data project. Both for the initial load of data from multiple legacy sources and later for the incremental load of current data.
The database chosen plays a role in ensuring data quality. When you have a highly performant database, you can do much more extensive checking before allowing the data into your “master copy.” A high performance database allows in-depth analysis of the new data in the context of the existing system – and makes it much easier to zero in on problems early on.