On Fintech and AI. Q&A with Hitoshi Harada
Hitoshi Harada, CTO and co-founder of Alpaca.
Q1. What are the main lessons you have learned as a database core developer?
I started my database core developer path by jumping into the open source PostgreSQL development. I implemented window functions in PostgreSQL 8.4 and extended it in 9.0 along with some help for CTE. The open source developer community is a full of knowledge and help to any kind of developers including me, and I learned so much from it that I needed to learn as a database developer. It was tremendous to me that I could work closely with some of the most senior developers in the open source database community.
This effort did not only give me the chance to be recognized as one of the major contributors in the project and to speak in a few international database conference, but also led me to the next career of working as a database developer of Greenplum, one of the most successful MPP databases.
In 2011, when I joined the Greenplum team in Silicon Valley from Japan, it was the most exciting moment of the Big Data movement with the wonderful co-workers coming from many angles. Some were the industry veterans with 20+ year experiences of database development, coming from Teradata, Tandem, Netezza, Oracle, DB2, and SQL Server. Some others were super knowledgeable storage developers from the EMC umbrella who gave a lot of insights to the database interaction with the lower layer. Others were genius software engineers rooted in Silicon Valley who solved whatever the problem was. Luke Lonergan is one of the symbols of these, who co-founded the company and developed the ridiculously high-speed MPP database from ground up. It was just amazing people and I was so fortunate to work with all of them.
The lessons learned from this great environments were uncountable, but I would present two things here.
– One is that the importance of listening to the users.
The database is inherently a complicated product, and different use cases need different solutions. There are myriad of database papers out from both the academia and industry, and as developers we tempted to build something cool in terms of newness, beauty or performance of the system. In addition, because of the product complexity, it takes so much time and resource to develop one features.
I have seen many such features that took so long to build and ended up being useless to end users, both in and out of my products, because they were built by developers idea, not based on user’s story. Great products with great features always come and succeed if we listen to users. Greenplum, for example, had severe stability issues back then when I joined the team, and that was because of the previous product focus on the performance side to solve the customer’s pain point first. Over the time, the adopted user’s demand had changed from performance to stability, since then the product was being viewed as the data warehouse from lightning-speed data processor.
We had so many escalation calls everyday to deal with on-site issues of crash, data corruption and unavailable data issues, but that helped us understand what part of database needed to be fixed. It was obvious that we had to fix these, even if we had to push back some of the big new features that have spent many men/years of effort. Incorporating agile methodology, the developer team quickly turned into the quality-first cycle based on this user’s feedback.
The result was significant, as we quickly saw the dramatic reduction of the number of escalation calls and gained huge trust from a lot of new and old customers. This was made as a team effort not just by the developers but also escalation support team, and product management team who spent a lot of time with our customers. The same thing was observed in the open source PostgreSQL, when lots of new developers tried to push their favorite features to the main branch, some of the most seasoned project maintainers spent a lot of time with non-developer mailing list to help the users and listened to them on what kind of problems they had, even if the issue seemed very trivial from developer’s point of view.
– The second part was related to the first, but during this time I learned a lot of how important it is to be focused.
Databases are used for many purposes and different users demand different things. It’s easy to be de-focused and tempted to jump into a problem that looks easy to solve, but because the product is complicated, it is very important to deliver one good thing as quickly as possible for a database product to succeed. Greenplum did a good job to solve the pain point of data processing scalability by employing state-of-art parallel processing at software layer, achieving 100x faster than the user’s data warehouse product that costed 10x or more, even if we did not be as stable as other products.
That was ok in order to get into the market first. Then the stability problem became the main issues as users started using the product in their serious data hub. If we had tried to solve both the performance and stability problems from the first, it would have not survived in the rapidly-changing market before something like Hadoop came in.
Q2. What is Alpaca?
Alpaca is a Silicon Valley based fintech startup whose mission is to help everyone in the financial market with the latest AI and Big Data technology. The genesis of the company goes back to the experience of our CEO, Yoshi. He used to work for Lehman Brothers around the subprime mortgage securitization, and saw everything going down despite of any kind of risk hedge that was based on modern portfolio theory. He had a few good friends in macro-hedge fund who made a lot of money during this crash, which is pretty much what you can see in the movie Big Short. Then he left the industry and became an individual day trader in the Forex market. He then found how different from institutions and underserved the retail trading space was. It was clear that with the recent advance of software technology, the retail should be more geared, and be competitive with the institutions. Alpaca was founded to solve this personal pain that is common to many people.
Q3. How do you leverage AI and Big Data technology in the financial trading and personal finance for retail users?
One of our initial offering is to help retail traders duplicate their manual trading strategy into an automated algo without any lines of code written, using the state-of-art deep learning technology.
We applied the similar concept of image/voice recognition to the visual price chart to understand the condition one trader has as a strategy. Automating a trading strategy has a lot of advantages, since it cuts emotion that hurts the profitability and it can control risks more mechanically, not to mention it saves human’s time to sit and monitor the charts for the whole day.
Even with these huge benefits, there are so many traders who have hard time writing their strategy down to a programming language, and this product we offered helped them do so without writing any programming.
We built a proprietary AI specialized in the financial market that understands the fuzzy decision making process from the data, and find the common pattern at the highest accuracy of any other existing tools. Since training an AI for each user needs a lot of data input, it was necessary for us to have quick access to the vast real-time and historical data of the financial market. This automated trading algo then needs backtesting, where it performs the same strategy to the history to see how it works in different markets, different situations, and different timeframes. Then the algo runs against real-time market, which requires the lowest latency to the market updates, and you can see how hard it is to catch up to every movement of each market, which consists of several hundreds of currency pairs in forex, nearly ten thousands of stocks if it is US equity, or uncountable fiat and crypto-currency pair in dozens of exchanges in the latest crypto market. Finding the best opportunity in this huge flow of data in real-time needs our Big Data technology.
Q4. One of your technology stack is a time series database built from scratch by you. What is special about it? and why didn`t you use some existing time series database available in the market?
From this experience, it was obvious to us that we would need some sort of time series database that can hold decades of historical data at the sub-second level granularity with a good write performance to be up-to-date with the market changes, which are also able to serve to hundreds of thousands of algos simultaneously, with a good support of data science toolkit.
Coming from database world, I knew time series database wasn’t new at all, but also knew that most of the time series database was for general purpose, and especially they were not designed specifically for financial market data. Lots of quants are familiar with PyData which has a good support of this type of data, but holding the entire market data in memory was not a choice for us, and especially we found DataFrame was not designed to append new data frequently.
Hadoop ecosystem also had a lot of different solutions to time series data, but again it was for general purpose and the latency was not in our acceptable range. SQL database was out of questions, again by its design of unawareness to the ordered time series records. The only possible solution I thought of was KDB+, since I knew that was specifically designed for this type of data and would do a good job with the workload we would do, but it was a commercial product that was not affordable to someone like us, and anyway the technology was so old as seen like its query language of Q.
So, we sat down and discussed the solution, and there were not many ways than building something from scratch. I saw a very possible solution that could come into production quickly by focusing on the data type for our needs. I also knew that it is kind of crazy to say we would support both deep historical data and quick real time updates in one database, since the industry trend was shifting to separating these two workloads.
It helped me design the correct architecture of MarketStore, our time series database for financial market data, from my experience with one of the Wall Street customer back in Greenplum. They had struggled with such a frequent updates in conjunction with high performance analytical query demands. Such a general purpose database tends to separate those workloads and if it is good at updates, it sucks at analytics, and if it is good at analytics, it sucks at updates. I exactly knew what this comes from, and was able to solve this problem with the right design.
MarketStore’s storage system is designed almost perfectly for this specific needs, written in Go, optimized for today’s hardware and filesystem, employs a plugin system for the different needs, delivers data with the modern HTTP API powered by MessagePack serialization, and the wire format is optimized for the lightning deserialization to DataFrame and C array with analytics use case in mind.
Q5. Will it be made open source? and if yes, how developers can contribute?
After a short period of time we started to use MarketStore in production, it turned clear that it is pretty powerful and should be used anywhere. Initially we thought it was a very unique use case, but the more and more players come into this space, we are seeing the similar problems as we had. So we decided to make it open source and today it is in GitHub.
We are pretty much open to any kind of suggestion. Trying it is a good contribution as well as sending pull requests to implement new features that we have left yet. Please file issues in GitHub and let us know what works and doesn’t. It is very important to listen to users as I learned previously.
Hitoshi Harada is CTO and co-founder of Alpaca, a fintech startup helping retail financial trading with AI and Big Data technology. Hitoshi was one of the major contributors in the PostgreSQL Global Development team, developing window functions and CTE around version 8.4 through 9.1. He then became the main architect of Greenplum MPP database and built all aspects of the system such as parallel query, optimizer, replication system, distributed transaction and storage optimization. He’s a big fan of open source software and also the author of PL/v8, the JavaScript procedural language engine of PostgreSQL. He spoke at several conferences such as PostgreSQL conference, NVDA GTC, and Deep Learning Summit.