Big Data and the financial services industry. Interview with Simon Garland
“The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.”–Simon Garland
Q1. Talking about the financial services industry, what types of data and what quantities are common?
Simon Garland: The type of data we see the most is market data, which comes from exchanges like the NYSE, dark pools and other trading platforms. This data may consist of many billions of records of trades and quotes of securities with up to nanosecond precision — which can translate into many terabytes of data per day.
The data comes in through feed-handlers as streaming data. It is stored in-memory throughout the day and is appended to the on-disk historical database at the day’s end. Algorithmic trading decisions are made on a millisecond basis using this data. The associated risks are evaluated in real-time based on analytics that draw on intraday data that resides in-memory and historical data that resides on disk.
Q2. What are the most difficult data management requirements for high performance financial trading and risk management applications?
Simon Garland: There has been a decade-long arms race on Wall Street to achieve trading speeds that get faster every year. Global financial institutions in particular have spent heavily on high performance software products, as well as IT personnel and infrastructure just to stay competitive. Traders require accuracy, stability and security at the same time that they want to run lightning fast algorithms that draw on terabytes of historical data.
Traditional databases cannot perform at these levels. Column store databases are generally recognized to be orders of magnitude faster than regular RDBMS; and a time-series optimized columnar database is uniquely suited for delivering the performance and flexibility required by Wall Street.
Q3. And why is this important for businesses?
Simon Garland: Orders of magnitude improvements in performance will open up new possibilities for “what-if” style analytics and visualization; speeding up their pace of innovation, their awareness of real-time risks and their responsiveness to their customers.
The Internet of Things in particular is important to businesses who can now capitalize on the digitized time-series data they collect, like from smart meters and smart grids. In fact, I believe that this is only the beginning of the data volumes we will have to be handling in the years to come. We will be able to combine this information with valuable data that businesses have been collecting for decades.
Q4. One of the promise of Big Data for many businesses is the ability to effectively use both streaming data and the vast amounts of historical data that will accumulate over the years, as well as the data a business may already have warehoused, but never has been able to use. What are the main challenges and the opportunities here?
Simon Garland: This can seem like a challenge for people trying to put a system together from a streaming database; an in-memory database from a different vendor, and an historical database from yet another vendor. They then pull data from all of these applications into yet another programming environment. This method cannot give performance and long term is fragile and unmaintainable.
The opportunity here is for a database platform that unifies the software stack, like kdb+, that is robust, easily scalable and easily maintainable.
Q5. How difficult is to combine and process streaming, in-memory and historical data in real time analytics at scale?
Simon Garland: This is an important question. These functionalities can’t be added afterwards. Kdb+ was designed for streaming data, in-memory data and historical data from the beginning. It was also designed with multi-core and multi-process support from the beginning which is essential for processing large amounts of historical data in parallel on current hardware.
We were doing this for decades, even before multi-core machines existed — which is why Wall Street was an early adopter of our technology.
Q6. q programming language vs. SQL: could you please explain the main differences? And also highlight the Pros and cons of each.
Simon Garland: The q programming language is built into the database system kdb+. It is an array programming language that inherently supports the concepts of vectors and column store databases rather than the rows and records that traditional SQL supports.
The main difference is that traditional SQL doesn’t have a concept of order built in, whereas the q programming language does. Unlike traditional SQL, the language q contains a concept of order. This makes complete sense when dealing with time-series data.
Q is intuitive and the syntax is extremely concise, which leads to more productivity, less maintenance and quicker turn-around time.
Q7. Could you give us some examples of successful Big Data real time analytics projects you have been working on?
Simon Garland: Utility applications are using kdb+ for millisecond queries of tables with hundreds of billions of data points captured from millions of smart meters. Analytics on this data can be used for balancing power generation, managing blackouts and for billing and maintenance.
Internet companies with massive amounts of traffic are using kdb+ to analyze Googlebot behavior to learn how to modify pages to improve their ranking. They tell us that traditional databases simply won’t work when they have 100 million pages receiving hundreds of millions of hits per day.
In industries like pharmaceuticals, where decision-making is based on data that can be one day, one week or one month old, our customers and prospects say our column store database makes their legacy data warehouse software obsolete. It is many times faster on the same queries. The time needed for complex analyses on extremely large tables has literally been reduced from hours to seconds.
Q8. Are there any similarities in the way large data sets are used in different vertical markets such as financial service, energy & pharmaceuticals?
Simon Garland: The shared feature is that all of our customers have structured, time-series data. The scale of their data problems are completely different, as are their business use cases. The financial services industry, where kdb+ is an industry standard, demands constant improvements to real-time analytics.
Other industries, like pharma, telecom, oil and gas and utilities, have a different concept of time. They also often are working with smaller data extracts, which they often still consider “Big Data.” When data comes in one day, one week or one month after an event occurred, there is not the same sense of real-time decision making as in finance. Having faster results for complex analytics helps all industries innovate and become more responsive to their customers.
Q9. Anything else you wish to add?
Simon Garland: If we piqued your interest, we have a free, 32-bit version of kdb+ available for download on our web site.
Simon Garland, Chief Strategist, Kx Systems
Simon is responsible for upholding Kx’s high standards for technical excellence and customer responsiveness. He also manages Kx’s participation in the Securities Trading Analysis Center, overseeing all third-party benchmarking.
Prior to joining Kx in 2002, Simon worked at a database search engine company.
Before that he worked at Credit Suisse in risk management. Simon has developed software using kdb+ and q, going back to when the original k and kdb were introduced. Simon received his degree in Mathematics from the University of London and is currently based in Europe.
Follow ODBMS.org on Twittwer: @odbmsorg