Q&A with Data Engineers: Victor Olex
Victor Olex is a founder and CEO of SlashDB, an automated REST API gateway to databases. He is a pioneer of Resource Oriented Architecture, a new approach to data integration, in which database content becomes discoverable and easily accessible in full context and under standard data formats. Mr. Olex has over 20 years of experience as a technology architect and software engineering consultant in the financial industry including organizations such as Bank of America, Merrill Lynch, J.P. Morgan, Commerzbank as well as number of hedge funds. He began his career at CGI (formerly American Management Systems) where he built billing systems for Europe’s leading mobile telephony operators. Native of Poland, and an immigrant to the U.S., where he lives with his wife.
In his pastime Victor enjoys snowboarding, tennis and cycling. Victor believes that above all motivation is the key to success.
Q1. What are the technical challenges in master data management for the securities industry?
Securities industry and other information heavy enterprises such as telecom continue to face enormous challenges in data integration. The primarily reason for that is the large number of information systems, which were purpose-built for various aspects of their business operations. Those systems typically have a well thought out information schema and function well in their respective areas. But no department or business function works in complete isolation, so data integration is a must.
Historically, to combat those data silos techniques such as file sharing, Extract Transform and Load (ETL), Data Warehousing (DW), Enterprise Service Bus (ESB) and web services based on Simple Object Access Protocol (SOAP) have been employed.
Each solution comes with its own shortcomings. Files are disjoint from the source of record and subject to errors in parsing. ETL/DW never has the full context of the original systems and doesn’t scale well. ESB and SOAP have proven good choices for distributed systems but because they are so complicated to use they fail miserably for tasks such as reporting or business intelligence.
Those systems are expensive too. A 2015 study by Forrester Research found that global spent on systems integration work and middleware was over 487 billion dollars.
Q2.What are the technical challenges of web and mobile applications?
Web and mobile applications consist of three components: user interface, web application service and a persistent store – typically a database. This three-tier architecture is well understood by the development community and the main challenge is in delivering and extending those often auxiliary applications quickly, but without sacrificing the security of transactional systems.
Older web applications can be hard to modernize with responsive user interface or to port to mobile front ends because the entire user interface consisting of static parts and dynamic data content is rendered on the server side. Conversely, modern applications utilize REST APIs to access data and rely on the browser to weave it into page’s static elements. Development of those web API is relatively time consuming because not only does it involve programming the data access, but also needs to address issues such as high availability and security.
Q3. What is a Resource Oriented Architecture?
Wikipedia defines Resource Oriented Architecture as “style of software architecture and programming paradigm for designing and developing software in the form of resources with RESTful interfaces”. My own definition is this: uniform and unobstructed access to all data assets for reading and writing in alternative representations over hypertext transfer protocol (HTTP).
Under resource-oriented architecture applications are programmed in terms of references to data instead of in context of data that must be local to the application. A well implemented ROA should connect related data with hyperlinks in order to facilitate discovery and search. At the same time it should adequately protect the stores of record from any negative impact on performance and unauthorized access. We think we do a pretty good job at it with SlashDB.
Another way to explain ROA is to contrast it with Service Oriented Architecture (SOA), which it complements.
Service Oriented |
Resource Oriented |
SOAP/XML |
Plain HTTP/JSON |
Complex programming interfaces |
Easy to inspect |
Endpoint represents Action |
Endpoint represents state |
Transaction, Unit of Work |
Addressable resource |
Message (function call) |
Update to a resource |
API controlled by functional design |
API evolves with data |
Harder to adapt and scale beyond “enterprise” |
Harder to support multi-step transactions |
Harder to deprecate functionality |
Clients are expected to be resilient to change |
Q4. Can a Resource Oriented Architecture handle data volume diversity?
RESTful APIs on which ROA is based have proven to be extremely scalable, as evidenced by so many global web services on the public internet. Therefore it is rational to expect that the same technology can handle de facto smaller enterprise use case.
As data volume and client traffic grow additional nodes can be added behind load balancer to handle the load. Also, since URLs uniquely address data resources caching is greatly simplified using standard HTTP proxies or even content delivery networks. For historical data (and data, which changes infrequently) cache retention can be very long thus avoiding putting additional strain on underlying databases.
Q5. Is it possible to offer quality-aware techniques for reusing data available on the Web?
I am known for saying that everything is possible in software. In some ways the web as it is can be thought of as a giant graph database albeit not as formalized as structured as dedicated database systems. ProgrammableWeb lists over 17,000 publicly available REST APIs, a number which is said to be but a small fraction of private and protected APIs already in existence.
I envision applications which seamlessly span data resources internal to the organization and those external without the need to copy data into either domain. REST/HTTP serves as an wonderful abstraction, which makes it possible now.
Q6. What are the technical challenges in managing smart data?
They say that beauty is in the eye of the beholder. Similarly deciding how to process the raw data, which parts to keep, which ones to throw out and what transformations to apply can be quite different for different applications. It is not always possible to foresee all use cases, so much like deciding on the relational database model, designing a smart data ingest takes good amount of experience and foresight and just the same can be difficult and expensive to change later.
Q7. Can an adaptive architecture be used to generate smart data?
I think both terms are a bit of a neologism, at least for the time being. But let me offer this thought. In the ideal scenario I wouldn’t want to persist derived data. Instead the much like Excel formula derives new value from other cells (some of which are raw and others can be calculated) an end point for derived data could build on other data end points. Extrapolating this analogy to resource-oriented architecture, a URL would be equivalent of a range of cells on the spreadsheet. Now, that is much easier to imagine than implement in practice, but not impossible either.
Q8. Data quantity still hampers data exchanges and remains a bottleneck in architectures. What is your experience with this problem?
The challenge here is the human factor. There will always be vast competency gaps between business analysts in different departments, database administrators, statisticians, quantitative analysts and other knowledge workers. A database administrator will have no problem writing even the most complex SQL query, but generally will have limited understanding of the purpose for that data extraction, especially if the request came from another department. A quantitative analyst or a statistician will feel right at home analyzing and reshaping a comma-separated file, but often will be oblivious to assumptions, which had to made in constructing that file from joining multiple tables. As the data travels through the business process pipeline chances of unintended consequences compound. A resource-oriented architecture that is accessible for humans and programs alike mitigates many of those problems and increases productivity.
Q9. What are the advantages of REST API development?
The number one advantage is that they are very easy to get (no pun intended). All you need to interact with a REST APIs is your browser or a command line tool like curl.
Everybody understands what what URL is and what to do with it. So you open that API address in the browser and a most likely a so called JSON document will display. In SlashDB we offer alternative representations in HTML, CSV and XML to make it even more user friendly. But even JSON is a quite readable text format with some structure to it. Sending data is quite easy too.
You still work with the URL but instead of the HTTP GET method you will use POST and add content with your request. In fact every time you fill out a form online this exact process takes place.
The second tremendous advantage is the ubiquity and simplicity of the HTTP standard. For every request there is a numbered response code, so you program can react to any issues. And you don’t need specialized programming libraries for various databases or middleware. The HTTP client stack is pre-installed on just about every operating system and most programming languages come with convenient libraries to work with it.
Q10. There are several types of new data stores including NoSQL, XML-based, Document-based, Key-value, Column-family and Graph data stores. What is your experience with those?
Paraphrasing Jamie Zawinsky‘s law on software I coined this statement a few years ago: “Every NoSQL database evolves until it can do SQL. Those, which cannot so evolve are replaced by those which can”.
I had brief experiences with MongoDB, Riak and a bunch of key value stores. I firmly believe that the Codd’s relational model is still the best invention for general purpose database systems. There may be implementation details (such as the use of columnar store or fractal indices), which have profound effect on the performance in different application but the principles and the science of relational algebra and expressiveness of of SQL are unbeatable.
Q11. What are your experiences of using Spark in production?
It was too easy to break. Spark has a tendency of blowing up in memory especially when higher order abstractions are used such as its DataFrame.
Q12. How do you typically handle scalability problems and storage limitations?
Avoid denormalized and duplicated data sets. Pick the right storage engine for the application. For analytical purposes columnar stores offer much more efficient use of disk space than row stores.
Q13. What is your take of the various extensions of SQL to support data analytics?
Take any opportunity to avoid sending any more data over the wire than your end result requires. For that reason if a database system implements analytical functions by all means use them. One caveat to that is to avoid stored procedures. A table-valued user defined function is a cleaner approach because they can be used in standard SQL queries for downstream analyses while allowing for the full use of those extensions. The MADlib extension (now under Apache banner) is a library worth knowing about.