Analysing 1.4 Billion dataset with CortexDB. Q&A with Thomas Kalippke
Q1. Your current showcase is a database with 1.4 Billion dataset (~1.2 TB data file) with taxi trips of New York using the TLC Trip Record Data . What kind of data discovery do you do with such data?
That’s right. We use the data from all taxi rides collected in New York over nine years. So far we do not use an individual development that was adapted to this data, but run the data discovery with our standard tools on a laptop (Mac Air) and a USB SSD disk (Samsung T3 with 2 TB).
The first idea was to use this data to show individual functions of our database technology and our platform. We already knew about other people’s analyses before the import and we were considering creating similar graphical and aggregated information. After the import and a first look at the data, we found that it is much more interesting and exciting to show the different values per field and perform on-the-fly analyses in our 6th normal form.
In the first step, we only looked at the different contents. For example, incorrect data in each field can be identified at a mouse click without the need to implement an algorithm. The data shows that many credit card transactions contain a negative tip. Interestingly, tips are only recorded for credit card payments.
The data also include the geo-coordinate of the start and end of the trip. Above this we could see that there were some journeys that supposedly once led around the earth. Since this effect has been repeated, we assume that it is a test of the taximeters.
After examining the individual fields and their contents, we proceeded combinatorially and randomly selected a few data records manually. This means that we have used our standard search to combine quantities in order to test whether correlations can be recognized. This led us to the idea of using pictures of celebrities shot by paparazzi. That’s how we knew who got in a cab, where and when. It was therefore possible to combine this information and find the driving exactly in the transaction data.
Q2. What technologies do you use for data discovery and analytics?
We only use our own technology for this showcase. The CortexDB forms the core and the tools are all combined under the CortexPlatform. It is possible that we also use other tools from other vendors via our APIs, but we want to show what is possible in the standard version on a laptop with USB disk. Of course, other developers can also use other tools on our database platform and we are very curious what will come out and how other people will approach this data.
The data is completely on an external SSD USB disk and we use it on a Mac Air laptop. However, we did the import on a small server (with 16 cores and 64GB RAM, the import took about 3 hours and the reorganization into the 6th normal form took another 4 hours). In contrast to other solutions based on this data, we can select all contents of all fields of all data records in any combination and bypass them relatively playfully.
The CortexDB forms the basis, so that we can work playfully with this data on a laptop, with the combination of a document-store for data set storage and a key/value store for indexing per field content (6th normal form). We therefore combine the flexibility of a schemaless approach with a defined scheme for an index per field content (redundancy-free).
For data discovery we use our web-based application CortexUniplex in combination with the server-side JavaScript. We can go through the data manually, but we can also run scripts to check values.
For the analysis we use the integrated functions to analyze the contents of each field (e.g. we immediately see that the sum of 80% of all tips were paid by 1$, 2$ or 3$); on the other hand we also use other of our tools to create aggregated information and filter it as desired. This is the so-called pivot server with which we calculate the results for any combination and display them graphically. The source is known to each result in the transaction data. If a transaction changes, only the affected results are regenerated.
For the graphical representation we use the library D3.js. In this showcase it may not look as nice as the solution from other vendors, but we just want to show the feasibility of new approaches that can be opened up to other developers and departments.
Q3. What results did you obtain from such analysis?
By knowing and showing the contents of every field in every record, we can recommend that everyone should identify possible sources of error and exclude them from aggregated results (data discovery). For example, if the average of all tips is calculated and the negative values are included, the result can only be wrong.
This naturally applies to all analyses in all databases and special areas. If you do not know the possible value set and do not know what actually exists, selections are incomplete and results are incorrect. Therefore it is very easy to look into the field index (6th normal form) with the CortexPlatform or to let it look by algorithm and check the actual values to identify errors.
With reference to this showcase with the taxi data, it was interesting to see that own assumptions and recommendations from travel guides were not correct.
For example, we suspected that most taxi rides take place on certain holidays. But this was not true. On the days of the New York Marathon and the league games in baseball and American football, most people took a taxi.
In addition, many travel guides say that a tip of 30% is appropriate. This is also not true. 80% of all tips are $1, $2 and $3. It is obvious that longer journeys (both in terms of time and distance) make up more tips, but to speak of a lump sum of 30% is wrong.
Interestingly, we were right about one assumption. The best tips are paid on weekend trips when two passengers are carried. Our guess is that men are more generous when accompanied by women 😉
Qx Anything else you wish to add?
Meanwhile we provide a free version of the CortexPlatform. It can be downloaded from our website. We will soon be adding more tools to the download area. So everyone would be able to import the same taxi data.
We are also working on publishing examples in github. We also want to exchange ideas and examples with other developers.
Ideas, questions and suggestions are therefore very welcome and we are also very happy to help if we are contacted directly.
—————————-
Thomas Kalippke, Product- and Partner-Management, Cortex AG