“Gaia continues to be a challenging mission in all areas even after 4 years of operation.
In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.”
— Uwe Lammers.
In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past. The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
I have been following the GAIA mission since 2011, and I have reported it in three interviews until now.
In this interview, Uwe Lammers – Gaia’s Science Operations Manager – gives a very detailed description of the data challenges and the opportunities of the Gaia mission.
This interview is the fourth of the series, the second after the launch.
Q1. Of the raw astrometry, photometry and spectroscopy data collected so far by the Gaia spacecraft, what is their Volume, Velocity, Variety, Veracity and Value?
Since the beginning of the nominal mission in 2014 until end June 2017 the satellite has delivered about 47.5 TB compressed raw data. This data is not suitable for any scientific analysis but first has to be processed into higher-level products which inflates the volume about 4 times.
The average raw daily data rate is about 40 GB but highly variable depending on which part of the sky the satellite is currently scanning through. The data is highly-complex and interdependent but not unstructured – it does not come with a lot of meta-information as such but follows strictly defined structures. In general it is very trustworthy, however, the downstream
data processing cannot blindly assume that every single observation is valid.
As with all scientific measurements, there can be outliers which must be identified and eliminated from the data stream as part of the analysis. Regarding value, Gaia’s data set is absolutely unique in a number of ways.
Gaia is the only mission surveying the complete sky with unprecedented precision and completeness. The end results is expected to be a treasure trove for generations of astronomers to come.
Q2 How is this data transmitted to Earth?
Under normal observing conditions the data is transmitted from the satellite to the ground through a so-called phased-array-antenna (PAA) at a rate of up to 8.5 Mbps. As the satellite spins, it continuously keeps a radio beam directed towards the Earth by activating successive panels on the PAA. This is a fully electronic process as there can be no moving parts on Gaia which would otherwise disturb the precise measurements. On the Earth we use three 35m radio dishes in Spain, Australia, and Argentina to receive the telemetry from Gaia.
Q3. Calibrated processed data, high level data products and raw data. What is the difference? What kind of technical data challenges do they each pose?
That question is not easy to answer in a few words. Raw data are essentially unprocessed digital measurements from the CCDs – perhaps comparable to data from the “raw mode” of digital consumer cameras. They have to be processed with a range of complex software to turn it into higher level products from which at the end astrophysical information can be inferred. There are many technical challenges, the most basic one is still to handle the 100s of GBs of daily data. Handling means, reception, storage, processing, I/O by the scientific algorithms, backing-up, and disseminating the processed data to 5 other partner data processing centres across Europe.
Here at the Science Operations Centre (SOC) near Madrid we have chosen years ago InterSystems Caché RDMS + NetApp hardware as our storage solution and this continues to be a good solution. The system is reliable and performant which are crucial pre-requisites for us. Another technical challenge is data accountability which means to keep track of the more than 70 Mio scientific observation we get from the satellite every single day.
Q4. Who are the users for such data and what they do with it?
The data we are generating here at the SOC has no immediate users. It is sent out to the 5 other Gaia Data Processing Centres where more scientific processing takes place and more higher-level products get created. From all this processed data we are constructing a stellar catalogue which is our final result and this is what the end users – the astronomical community of world – to see. The first version of our catalogue was published 14 September last year (Gaia Data Release 1) and we are currently working hard to release the second version (DR2) in April next year.
Our end users do fundamental astronomical research with the data ranging from looking at individual stars, studies of clusters, dynamics of our Milky-Way to cosmological questions like the expansion rate of our universe. The scientific exploitation of the Gaia data has just started but already now more than 200 scientific articles have been published. This is about 1 per day since DR1 and we expect this rate to go higher up after DR2.
Q5. Can you explain at a high level how is the ground processing of Gaia data implemented?
ESA has entrusted the Gaia data processing to the Data Processing Analysis Consortium (DPAC) which the SOC is an integral part of. DPAC consists of 9 so-called Coordination Units (CU) and 6 data processing centres (DPCs) across Europe, so this is a large distributed system.
In total some 450 people from 20+ countries with a large range of educational backgrounds and experiences are forming DPAC. Roughly speaking, the CUs are responsible for writing and validating the scientific processing software which is then run in one of the DPCs (every CU is associated with exactly one DPC).
The different CUs cover different aspects of the data processing (e.g. CU3 takes care of astrometry, CU5 of photometry).
The corresponding processes run more or less independent of each other, however, due to the complex interdependencies of the Gaia data itself this is only a first approximation. Ultimately, everything depends on everything else (e.g. astrometry depends on photometry and vice versa) which means that the entire system has to be iterated to produce the final solution. As you can imagine a lot of data has the be exchanged. SOC/DPCE is the hub in a hub-and-spokes topology where the other 5 DPCs are sitting at the ends of the spokes. No data exchange between DPCs is allowed but all the data flow is centrally managed through the hub at DPCE.
Q6. How do you process the data stream in near real-time in order to provide rapid alerts to facilitate ground-base follow up?
Yes, indeed we do. For ground-based follow up observations of variable objects quick turn-around times are essential. The time difference between an observation made on-board and the confirmation of a photometric alert on the ground is typically 2 days now which is close to the optimal value given all the operational constraints we have.
Q7. What are the main technical challenges with respect to data processing, manipulation and storage you have encountered so far? and how did you solved them?
Regarding storage, the handling of 100s of GBs of raw and processed data every day has always been and remains until today quite a challenge as explained above. The Gaia data reduction task is also a formidable computational problem. Years ago we estimated the total numerical effort to produce the final catalogue at some 10^20 FLOPs and this has proven fairly accurate.
So we need quite some number-crunching capabilities in the DPCs and to continuously expand CPU resources as the data volume keeps growing in the operational phase of the mission. Moore’s law is slowly coming to an end but, fortunately, a number of algorithms are perfectly parallelizable (processing every object in the sky individually and isolated) such that CPU bottlenecks can be ameliorated by simply adding more processors to the existing systems.
Data transfers are likewise a challenge. At the moment 1 Gbps connections (public Internet) between DPCE and the other 5 DPCs are sufficient, however, in the coming years we heavily rely on seeing bandwidths increasing to 10 Gbps and beyond. Unfortunately, this is largely not under our control which is a risk to the project.
Q8. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?
As explained above, for the so-called daily pipeline we have chosen InterSystems Caché and are very satisfied with this approach. We had some initial problems with the system but were able to overcome all difficulties with the help of Intersystems. We much appreciated their excellent service and customer orientation in this phase and till the present day. Regarding analytics tools we use most facilities that are part of Caché, but have also developed a suite of custom-made solutions.
Q9. How do you transform the raw information into useful and reliable stellar positions?
The raw data from the satellite is first turned into higher level-products which already includes preliminary estimates for the stellar positions. But each of these positions is then only based on a single measurements. The high accuracy of Gaia comes from combining _all_ observations that have been taken during the mission with a scheme called Astrometric Iterative Solution (AGIS) [see The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation].
This cannot be done on a star-by-star basis but is a global, simultaneous optimization of a large number of parameters including the 5 basic astrometric parameters of each star (about 1 Billion in total), the time-varying attitude of the satellite
(a few Million), and a number of calibration parameters (a few 10.000).
The process is iterative and in the end gives the best match between the model parameters and the actual observations. The stellar positions are two of the five astrometric parameter of each object.
Q10. What is the level of accuracy you have achieved so far?
The accuracies depend on the brightnesses of the stars – the brighter a star, the higher is the achievable accuracy. In DR1 the typical uncertainty is about 0.3 mas for the positions and parallaxes, and about 1 mas yr^-1 for the proper motions.
For positions and parallaxes a systematic component of another 0.3 mas should be added. With DR2 we are aiming to reduce these formal errors by at least a factor 3 and likewise eliminate systematic errors by the same or a larger amount.
Q11. The first catalogue of more than a billion stars from ESA’s Gaia satellite was published on 14 September 2016 – the largest all-sky survey of celestial objects to date. What data is in this catalog? What is the size and structure of the information you analysed so far?
Gaia DR1 contains astrometry, G-band photometry (brightnesses), and a modest number of variable star light curves, for a total of 1 142 679 769 sources [See Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties]. For the large majority of those we only provide position and magnitude but about 2 Million stars also have parallaxes and proper motions. In DR2 these numbers will be substantially larger.
The information is structured in simple, easy-to-use tables which can be queried via the central Gaia Archive and a number of other data centres around the world.
Q12. What insights have been derived so far by analysing this data?
The astronomical community eagerly grabbed the DR1 data and since 14 September a couple of hundred scientific articles have appeared in peer-reviewed astronomical journals covering a large breads of topics.
Only to give one example: A new so-called open cluster of stars was discovered very close to the brightest star in the night sky, Sirius. All previous surveys had missed it!
Q13 How do you offer a proper infrastructure and middleware upon which scientists will be able to do exploration and modeling with this huge data set?
That is a very good question! At the moment the archive system does not allow yet real big data-mining using the entire large Gaia data set. Up to know we do not know precisely yet what scientists will want to do with the Gaia data in the end.
There is the “traditional” astronomical research which mostly uses only subsets of the data, e.g. all stars in a particular area of the sky. Such data requests can be satisfied with traditional queries to a RDBMS.
But in the future we expect also applications which will need data mining capabilities and we are experimenting with a number of different approaches using the “code-to-the-data” paradigm. The idea is that scientists will be able to upload and deploy their codes directly through a platform which allows execution with quick data access close to the archive.
For DR2 this will only be available for DPAC-internal use but, depending on experiences gained, as per DR3 it might become a service for public use. One technology we are looking at is Apache Spark for big data mining.
Q14. What software technologies do you use for accessing the Gaia catalogue and associated data?
As explained above, at the moment we are offering access to the catalogue only through a traditional RDBMS system which allows queries to be submitted in a special SQL dialect called ADQL (Astronomical Data Query Language). This DB system is not using InterSystems Caché but Postgres.
Q15. In addition to the query access, how do you “visualize” such data? Which “big data” techniques do you use for histograms production?
Visualization is done with a special custom-made application that sits close to the archive and is using not the raw data but pre-computed special objects especially constructed for fast visualization. We are not routinely using any big data techniques but are experimenting with a few key concepts.
For visualization one interesting novel application is called vaex and we are looking at it.
Histogramming of the entire data set is likewise done using pre-canned summary statistics which was generated when the data was ingested into the archive. The number of users really wanting the entire data set and this kind of functionality is very limited at the moment. We as well as the scientific community are still learning what can be done with the Gaia data set.
Q16. Which “big data” software and hardware technologies did use so far? And what are the lessons learned?
Again, we are only starting to look into big data technologies that may be useful for us. Until now most of the effort has gone into robustifying all systems and prepare DR1 and now DR2 for April next year. One issue is always that the Gaia data is so peculiar and special that COTS solutions rarely work. Most of the software systems we use are special developments.
Q.17 What are the main technical challenges ahead?
As far as the daily systems are concerned we are now finally in the routine phase. The main future challenges lie in robustifying and validating the big outer iterative loop that I described above. It has not been tested yet, so, we are executing it for the first time with real flight data.
Producing DR3 (mid to late 2020) will be a challenge as this for the first time involves output from all CUs and the results from the outer iterative loop. DR4 around end 2022 is then the final release for the nominal mission and for that we want to release “everything”. This means also the individual observation data (“epoch data”) which will inflate the total volume served by the archive by a factor 100 or so.
Qx Anything else you wish to add?
Gaia continues to be a challenging mission in all areas even after 4 years of operation. In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.
Gaia is fulfilling its promises in every regard and the scientific community is eagerly looking into what is available already now and the coming data releases. This continues to be a great source of motivation for everybody working on this great mission.
Uwe Lammers. My academic background is in physics and computer science. After my PhD I joined ESA to first work on the X-ray missions EXOSAT, Beppo-SAX, and XMM-Newton before getting interested in Gaia in 2005. The first years I led the development of the so-called Astrometric Global Iterative Solution (AGIS) system and then became Gaia’s Science Operations Manager in 2014.
– The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation
L. Lindegren, U. Lammers et al. Astronomy & Astrophysics, Volume 538, id.A78, 47 pp. February 2012, DOI: 10.1051/0004-6361/201117905
– Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties A.G.A. Brown and Gaia Collaboration, Astronomy & Astrophysics, Volume 595, id.A2, 23 pp. November 2016, DOI: 10.1051/0004-6361/201629512
– Gaia Data Release 1. Astrometry: one billion positions, two million proper motions and parallaxes L. Lindegren, U. Lammers, et al. Astronomy & Astrophysics, Volume 595, id.A4, 32 pp. November 2016, DOI: 10.1051/0004-6361/201628714
– The Gaia mission in 2015. Interview with Uwe Lammers and Vik Nagjee , ODBMS Industry Watch, March 24, 2015
– The Gaia mission, one year later. Interview with William O’Mullane. ODBMS Industry Watch, January 16, 2013
– Objects in Space vs. Friends in Facebook. ODBMS Industry Watch, April 13, 2011
– Objects in Space. ODBMS Industry Watch, February 14, 2011
Follow us on Twitter: @odbmsorg
“Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.”–Nikita Ivanov.
I have interviewed Nikita Ivanov,CTO of GridGain.
Main topics of the interview are Apache Ignite, Apache Spark and MySQL, and how well they perform on big data analytics.
Q1. What are the main technical challenges of SaaS development projects?
Nikita Ivanov: SaaS requires that the applications be highly responsive, reliable and web-scale. SaaS development projects face many of the same challenges as software development projects including a need for stability, reliability, security, scalability, and speed. Speed is especially critical for modern businesses undergoing the digital transformation to deliver real-time services to their end users. These challenges are amplified for SaaS solutions which may have hundreds, thousands, or tens of thousands of concurrent users, far more than an on-premise deployment of enterprise software.
Fortunately, in-memory computing offers SaaS developers solutions to the challenges of speed, scale and reliability.
Q2. In your opinion, what are the limitations of MySQL® when it comes to big data analytics?
Nikita Ivanov: MySQL was originally designed as a single-node system and not with the modern data center concept in mind. MySQL installations cannot scale to accommodate big data using MySQL on a single node. Instead, MySQL must rely on sharding, or splitting a data set over multiple nodes or instances, to manage large data sets. However, most companies manually shard their database, making the creation and maintenance of their application much more complex. Manually creating an application that can then perform cross-node SQL queries on the sharded data multiplies the level of complexity and cost.
MySQL was also not designed to run complicated queries against massive data sets. MySQL optimizer is quite limited, executing a single query at a time using a single thread. A MySQL query can neither scale among multiple CPU cores in a single system nor execute distributed queries across multiple nodes.
Q3. What solutions exist to enhance MySQL’s capabilities for big data analytics?
Nikita Ivanov: For companies which require real-time analytics, they may attempt to manually shard their database. Tools such as Vitess, a framework YouTube released for MySQL sharding, or ProxySQL are often used to help implement sharding.
To speed up queries, caching solutions such as Memcached and Redis are often deployed.
Many companies turn to data warehousing technologies. These solutions require ETL processes and a separate technology stack which must be deployed and managed. There are many external solutions, such as Hadoop and Apache Spark, which are quite popular. Vertica and ClickHouse have also emerged as analytics solutions for MySQL.
Apache Ignite offers speed, scale and reliability because it was built from the ground up as a high performant and highly scalable distributed in-memory computing platform.
In contrast to the MySQL single-node design, Apache Ignite automatically distributes data across nodes in a cluster eliminating the need for manual sharding. The cluster can be deployed on-premise, in the cloud, or in a hybrid environment. Apache Ignite easily integrates with Hadoop and Spark, using in-memory technology to complement these technologies and achieve significantly better performance and scale. The Apache Ignite In-Memory SQL Grid is highly optimized and easily tuned to execute high performance ANSI-99 SQL queries. The In-Memory SQL Grid offer access via JDBC/ODBC and the Ignite SQL API for external SQL commands or integration with analytics visualization software such as Tableau.
Q4. What is exactly Apache® Ignite™?
Nikita Ivanov: Apache Ignite is a high-performance, distributed in-memory platform for computing and transacting on large-scale data sets in real-time. It is 1,000x faster than systems built using traditional database technologies that are based on disk or flash technologies. It can also scale out to manage petabytes of data in memory.
Apache Ignite includes the following functionality:
· Data grid – An in-memory key value data cache that can be queried
· SQL grid – Provides the ability to interact with data in-memory using ANSI SQL-99 via JDBC or ODBC APIs
· Compute grid – A stateless grid that provides high-performance computation in memory using clusters of computers and massive parallel processing
· Service grid – A service grid in which grid service instances are deployed across the distributed data and compute grids
· Streaming analytics – The ability to consume an endless stream of information and process it in real-time
· Advanced clustering – The ability to automatically discover nodes, eliminating the need to restart the entire cluster when adding new nodes
Q5. How Apache Ignite differs from other in-memory data platforms?
Nikita Ivanov: Most in-memory computing solutions fall into one of three types: in-memory data grids, in-memory databases, or a streaming analytics engine.
Apache Ignite is a full-featured in-memory computing platform which includes an in-memory data grid, in-memory database capabilities, and a streaming analytics engine. Furthermore, Apache Ignite supports distributed ACID compliant transactions and ANSI SQL-99 including support for DML and DDL via JDBC/ODBC.
Q6. Can you use Apache® Ignite™ for Real-Time Processing of IoT-Generated Streaming Data?
Nikita Ivanov: Yes, Apache Ignite can ingest and analyze streaming data using its streaming analytics engine which is built on a high-performance and scalable distributed architecture. Because Apache Ignite natively integrates with Apache Spark, it is also possible to deploy Spark for machine learning at in-memory computing speeds.
Apache Ignite supports both high volume OLTP and OLAP use cases, supporting Hybrid Transactional Analytical Processing (HTAP) use cases, while achieving performance gains of 1000x or greater over systems which are built on disk-based databases.
Q7. How do you stream data to an Apache Ignite cluster from embedded devices?
Nikita Ivanov: It is very easy to stream data to an Apache Ignite cluster from embedded devices.
The Apache Ignite streaming functionality allows for processing never-ending streams of data from embedded devices in a scalable and fault-tolerant manner. Apache Ignite can handle millions of events per second on a moderately sized cluster for embedded devices generating massive amounts of data.
Q8. Is this different then using Apache Kafka?
Nikita Ivanov: Apache Kafka is a distributed streaming platform that lets you publish and subscribe to data streams. Kafka is most commonly used to build a real-time streaming data pipeline that reliably transfers data between applications. This is very different from Apache Ignite, which is designed to ingest, process, analyze and store streaming data.
Q9. How do you conduct real-time data processing on this stream using Apache Ignite?
Nikita Ivanov: Apache Ignite includes a connector for Apache Kafka so it is easy to connect Apache Kafka and Apache Ignite. Developers can either push data from Kafka directly into Ignite’s in-memory data cache or present the streaming data to Ignite’s streaming module where it can be analyzed and processed before being stored in memory.
This versatility makes the combination of Apache Kafka and Apache Ignite very powerful for real-time processing of streaming data.
Q10. Is this different then using Spark Streaming?
Nikita Ivanov: Spark Streaming enables processing of live data streams. This is merely one of the capabilities that Apache Ignite supports. Although Apache Spark and Apache Ignite utilize the power of in-memory computing, they address different use cases. Spark processes but doesn’t store data. It loads the data, processes it, then discards it. Ignite, on the other hand, can be used to process data and it also provides a distributed in-memory key-value store with ACID compliant transactions and SQL support.
Spark is also for non-transactional, read-only data while Ignite supports non-transactional and transactional workloads. Finally, Apache Ignite also supports purely computational payloads for HPC and MPP use cases while Spark works only on data-driven payloads.
Spark and Ignite can complement each other very well. Ignite can provide shared storage for Spark so state can be passed from one Spark application or job to another. Ignite can also be used to provide distributed SQL with indexing that accelerates Spark SQL by up to 1,000x.
Qx. Is there anything else you wish to add?
Nikita Ivanov: The world is undergoing a digital transformation which is driving companies to get closer to their customers. This transformation requires that companies move from big data to fast data, the ability to gain real-time insights from massive amounts of incoming data. Whether that data is generated by the Internet of Things (IoT), web-scale applications, or other streaming data sources, companies must put architectures in place to make sense of this river of data. As companies make this transition, they will be moving to memory-first architectures which ingest and process data in-memory before offloading to disk-based datastores and increasingly will be applying machine learning and deep learning to make understand the data. Apache Ignite continues to evolve in directions that will support and extend the abilities of memory-first architectures and machine learning/deep learning systems.
Nikita IvanovFounder & CTO, GridGain,
Nikita Ivanov is founder of Apache Ignite project and CTO of GridGain Systems, started in 2007. Nikita has led GridGain to develop advanced and distributed in-memory data processing technologies – the top Java in-memory data fabric starting every 10 seconds around the world today. Nikita has over 20 years of experience in software application development, building HPC and middleware platforms, contributing to the efforts of other startups and notable companies including Adaptec, Visa and BEA Systems. He is an active member of Java middleware community, contributor to the Java specification. He’s also a frequent international speaker with over two dozen of talks on various developer conferences globally.
Follow ODBMS.org on Twitter: @odbmsorg
” I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices.” –Vint G. Cerf
I had the pleasure to interview Vinton G. Cerf. Widely known as one of the “Fathers of the Internet,” Cerf is the co-designer of the TCP/IP protocols and the architecture of the Internet. Main topic of the interview is the Internet of Things (IoT) and its challenges, especially the safety, security and privacy of IoT devices.
Vint is currently Chief Internet Evangelist for Google.
Q1. Do you like the Internet of Things (IoT)?
Vint Cerf: This question is far too general to answer. I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices. Penetration and re-purposing of these devices can lead to denial of service attacks (botnets), invasion of privacy, harmful dysfunction, serious security breaches and many other hazards. Consequently the makers and users of such devices have a great deal to be concerned about.
Q2. Who is going to benefit most from the IoT?
Vint Cerf: The makers of the devices will benefit if they become broadly popular and perhaps even mandated to become part of local ecosystem. Think “smart cities” for example. The users of the devices may benefit from their functionality, from the information they provide that can be analyzed and used for decision-making purposes, for example. But see Q1 for concerns.
Q3. One of the most important requirement for collections of IoT devices is that they guarantee physical safety and personal security. What are the challenges from a safety and privacy perspective that the pervasive introduction of sensors and devices pose? (e.g. at home, in cars, hospitals, wearables and ingestible, etc.)
Vint Cerf: Access control and strong authentication of parties authorized to access device information or control planes will be a primary requirement. The devices must be configurable to resist unauthorized access and use. Putting physical limits on the behavior of programmable devices may be needed or at least advisable (e.g., cannot force the device to operate outside of physically limited parameters).
Q5. Consumers want privacy. With IoT physical objects in our everyday lives will increasingly detect and share observations about us. How is it possible to reconcile these two aspects?
Vint Cerf: This is going to be a tough challenge. Videocams that help manage traffic flow may also be used to monitor individuals or vehicles without their permission or knowledge, for example (cf: UK these days). In residential applications, one might want (insist on) the ability to disable the devices manually, for example. One would also want assurances that such disabling cannot be defeated remotely through the software.
Q6. Let`s talk about more about security. It is reported that badly configured “smart devices” might provide a backdoor for hackers. What is your take on this?
Vint Cerf: It depends on how the devices are connected to the rest of the world. A particularly bad scenario would have a hacker taking over the operating system of 100,000 refrigerators. The refrigerator programming could be preserved but the hacker could add any of a variety of other functionality including DDOS capacity, virus/worm/Trojan horse propagation and so on.
One might want the ability to monitor and log the sources and sinks of traffic to/from such devices to expose hacked devices under remote control, for example. This is all a very real concern.
Q7. What measures can be taken to ensure a more “secure” IoT?
Vint Cerf: Hardware to inhibit some kinds of hacking (e.g. through buffer overflows) can help. Digital signatures on bootstrap programs checked by hardware to inhibit boot-time attacks. Validation of software updates as to integrity and origin. Whitelisting of IP addresses and identifiers of end points that are allowed direct interaction with the device.
Q8. Is there a danger that IoT evolves into a possible enabling platform for cyber-criminals and/or for cyber war offenders?
Vint Cerf: There is no question this is already a problem. The DYN Corporation DDOS attack was launched by a botnet of webcams that were readily compromised because they had no access controls or well-known usernames and passwords. This is the reason that companies must feel great responsibility and be provided with strong incentives to limit the potential for abuse of their products.
Q9. What are your personal recommendations for a research agenda and policy agenda based on advances in the Internet of Things?
Vint Cerf: Better hardware reinforcement of access control and use of the IOT computational assets. Better quality software development environments to expose vulnerabilities before they are released into the wild. Better software update regimes that reduce barriers to and facilitate regular bug fixing.
Q10. The IoT is still very much a work in progress. How do you see the IoT evolving in the near future?
Vint Cerf: Chaotic “standardization” with many incompatible products on the market. Many abuses by hackers. Many stories of bugs being exploited or serious damaging consequences of malfunctions. Many cases of “one device, one app” that will become unwieldy over time. Dramatic and positive cases of medical monitoring that prevents serious medical harms or signals imminent dangers. Many experiments with smart cities and widespread sensor systems.
Many applications of machine learning and artificial intelligence associated with IOT devices and the data they generate. Slow progress on common standards.
Vinton G. Cerf co-designed the TCP/IP protocols and the architecture of the Internet and is Chief Internet Evangelist for Google. He is a member of the National Science Board and National Academy of Engineering and Foreign Member of the British Royal Society and Swedish Royal Academy of Engineering, and Fellow of ACM, IEEE, AAAS, and BCS.
Cerf received the US Presidential Medal of Freedom, US National Medal of Technology, Queen Elizabeth Prize for Engineering, Prince of Asturias Award, Japan Prize, ACM Turing Award, Legion d’Honneur and 29 honorary degrees.
Follow us on Twitter: @odbsmorg
“Looking across the senior leadership in Government, very few top Civil Servants and Ministers have come from a technical background. Most Departments will have a CTO/CIO person who may or may not also been drawn from a relevant technical background. Those that are drawn from such a background and are empowered by their senior leadership, deliver a clear advantage to their organisation.”–Sarbjit Singh Bakhshi.
I have interviewed Sarbjit Singh Bakhshi, Director of Government Affairs at Maxeler Technologies. We covered in the interview the challenges and opportunities for the UK public sector in the Post-Brexit Era.
Q1. It has been suggested that some of the key challenges in government IT are: i) change aversion, ii) lack of technocratic leadership, and iii) processes that don’t scale down. What is your take on this?
Sarbjit Singh Bakhshi: Looking across the senior leadership in Government, very few top Civil Servants and Ministers have come from a technical background. Most Departments will have a CTO/CIO person who may or may not also been drawn from a relevant technical background. Those that are drawn from such a background and are empowered by their senior leadership, deliver a clear advantage to their organisation.
Where these people are not empowered, you will often find bad technical choices made by Departments that seem to be driven by short term commercial gain rather than long term interests. In the worst cases, they’d rather patch up systems knowing they will fail in the medium term than invest in a long term solution. This leads to rather inevitable problems when they can no longer pursue this strategy.
There are also issues in terms of understanding the total cost of computing. The cost of inefficient systems that consume vast amounts of electricity are often hidden from decision makers as running costs come out of an operational budget for the Department. So the decision makers will focus on a low purchase price for some systems even if the systems cost more in the long run to operate.
There needs to be a greater understanding of the challenges of running Government IT and there should be changes in leadership to support this. A focus on total cost of ownership (including transition costs) needs to be taken into account.
Q2. What are the main barriers to use new and innovative technologies in the UK Government?
Sarbjit Singh Bakhshi: Like most advanced countries that have been running big IT programmes from the 1950’s, the UK Government has a very heterogeneous IT estate and compatibility with older and archaic systems can still be a problem. There are further issues in terms of the threat of cyber warfare that need to be considered also when delivering new systems into this environment.
There are also issues around procurement which can stifle innovative and iterative approaches to technology deployment.
The recent move to create the G-Cloud does help in this respect, but we still see too many OJEU procedures which often crowd out smaller companies from the opportunities of working with Government.
Q3. How will Brexit affect the UK’s tech industry?
Sarbjit Singh Bakhshi: As Article 50 has just been invoked and nothing has been agreed it is a little too soon to tell.
Obviously, there are concerns in the UK’s Tech Industry around getting and recruiting the best staff from around the world in the UK, access to the Digital Single Market and any agreements the EU has around the safe storage of data. I’m sure these will all be top of mind for British negotiators.
Q4. What are the challenges and opportunities for the UK public sector in the Post-Brexit Era?
Sarbjit Singh Bakhshi: The challenges are manifold, the applicability of EU laws and the future of EU citizens in the UK and how that affects the administration of the country are probably paramount.
There are also opportunities for the British tech industry in the creation of parallel Governmental systems to replace the ones we are currently using that are from the EU.
Q5. What were your main motivations to leave the Government and go to Industry?
Sarbjit Singh Bakhshi: After many successful years of operation, Maxeler is emerging into an exciting new space. Maxeler has its first cloud based product with Amazon and is able to offer onsite and/or elastic cloud operations for the first time in its history. With more Government work moving to the cloud, this is an exciting time as we can bring the high levels of Maxeler performance to Government data sets at a reasonable cost.
We have applications in cyber security, big – complex data analysis and real time networking that are essential for Governments to deal with the emerging threats from around the world.
If one considers the wider context of Government – including research and scientific computing, Maxeler also has a good story to tell. We are in STFC Daresbury and can provide exceptionally performant computing at low energy cost for supercomputer environments. Complementing this we have an active university programme with over 150 top universities around the world where scientists are using DataFlow Engines for a variety of computational tasks that are outperforming traditional CPU setups.
Maxeler is ready to offer services to the UK Government as an approved G-Cloud supplier and can use its experience with large data sets to help Government move from on-premises hosting to take advantages of the cloud and its ultra-fast computing in line with the Government’s ‘Cloud First’ technology policy..
Q6. What are your aims and goals as new appointed Director of Government Affairs at Maxeler Technologies?
Sarbjit Singh Bakhshi: Improve our relations with Government in the UK and elsewhere. Maxeler technologies has experience of working with high pressure financial institutions, we’d like to offer Government the same opportunity to deal with its most pressing computational problems in a ultra-performant, energy efficient and easy to manage way.
Sarbjit Bakhshi joined Maxeler Technologies from a long career working for the British Government. Working mainly in areas of European Union Negotiations and Technology policy and promotion, Sarbjit’s appointment marks a new phase for Maxeler Technologies and its commitment to work with Government in the UK and overseas as part of its’ next phase of expansion.
Follow us on Twitter: @odbmsorg
“I do think that one of the fascinating outcomes of the progress of AI is that it gives us a new opportunity and new means of understanding the nature of human intelligence — a chance to better know ourselves. That’s a powerful thing, and a good thing.”–Brian Christian
I have interviewed Brian Christian, coauthor of the bestseller book Algorithms to Live By.
Q1. You have worked with cognitive scientist Tom Griffiths (professor of psychology and cognitive science at UC Berkeley) to show how algorithms used by computers can also untangle very human questions. What are the main lessons learned from such a joint work?
Brian Christian: I think ultimately there are three sets of insights that come out of the exploration of human decision-making from the perspective of computer science.
The first, quite simply, is that identifying the parallels between the problems we face in everyday life and some of the canonical problems of computer science can give us explicit strategies for real-life situations. So-called “explore/exploit” algorithms tell us when to go to our favorite restaurant and when to try something new; caching algorithms suggest — counterintuitively — that the messy pile of papers on your desk may in fact be the optimal structure for that information.
Second is that even in cases where there is no straightforward algorithm or easy answer, computer science offers us both a vocabulary for making sense of the problem, and strategies — using randomness, relaxing constraints — for making headway even when we can’t guarantee we’ll get the right answer every time.
Lastly and most broadly, computer science offers us a radically different picture of rationality than the one we’re used to seeing in, say, behavioral economics, where humans are portrayed as error-prone and irrational. Computer science shows us that being rational means taking the costs of computation — the costs of decision-making itself — into account. This leads to a much more human, and much more achievable picture of rationality: one that includes making mistakes and taking chances.
Q2. How did you get the idea to write a book that merges computational models with human psychology (*)?
Brian Christian: Tom and I have known each other for 12 years at this point, and I think both of us have been thinking about some of these questions our whole lives. My background is in computer science and philosophy, and my first book The Most Human Human uses my experience as a human “confederate” in the Turing test to ask a series of questions about how computer science is changing our sense of what it means to be human. Tom’s background is in psychology and machine learning, and his research focuses around developing mathematical models of human cognition. The idea of using computer science as a means for insights about human decision-making really emerged naturally as a consequence of those interests and inquiries. One night we were having dinner together and discussing our current projects and, long story short, we realized we were each writing the same book in parallel! It was immediately apparent that it should take the form of a single collaborative effort.
Q3 In your book you explore the idea of human algorithm design. What is it?
Brian Christian: The idea is, quite simply, to look for optimal ways of approaching everyday human decision making — and to do that by identifying the underlying computational structure of the problems we face in daily life.
“Optimal stopping” problems can teach us about when to look and when to leap; the explore/exploit tradeoff tells us when to try new things and when to stick with what we know and love; caching tells us how to manage our space; scheduling theory tells us how to manage our time.
Q4. What are the similarities between the workings of computers and the human mind?
Brian Christian: First I’ll turn that question on its head and highlight one of the biggest differences.
To paraphrase the UNM’s Dave Ackley, the imperative of a computer program is “the precisely correct answer, as prompt as possible,” whereas the imperative of animal cognition (including our own) is the reverse: “a prompt response, as correct as possible.”
As computer scientists explore systems with real-time constraints, and problems sufficiently difficult that grinding out the exact solution (no matter how long it takes) simply doesn’t make sense, we are starting to see that distinction begin to narrow.
Q5. Our lives are constrained by limited space and time, limits that give rise to a particular set of problems. Do computers, too, face the same constraints?
Brian Christian: Computers of course have space constraints — there are limits to how much data can be kept in the caches or the RAM, for instance, which gives rise to caching algorithms. These, in turn, offer us ways of thinking about how we manage the limited space in our own lives (in the book we compare some of Martha Stewart’s edicts about home organization to some of the canonical results in caching theory to see which hold water and which don’t).
Many systems, too, operate under time constraints, which offers us a whole other set of insights we can draw from.
For instance, algorithms for high-frequency trading must determine in a matter of microseconds whether to take an offer or to let it go by — do you hold out for a better price, risking that you may never get as good of a deal ever again?
Many human decisions take this form, across a wide range of domains. This type of “optimal stopping” structure underlies everything from buying and selling houses to our romantic lives. Modern operating systems also include what’s known as a “scheduler,” which determines the best way to make use of the CPU’s limited time. There have been a number of high-profile cases of scheduling failures, including the 1997 Mars Pathfinder mission, where the lander made it all the way to the Martian surface successfully, but then appeared to start procrastinating once it got there: whiling away on low-priority tasks while critical work languished. Studying these failures and the methods for avoiding them can in turn give us strategies for making the most of our own limited time.
Q6. Can computer algorithms help us to have better hunches?
Brian Christian: I think so. For instance, we have a chapter on inductive reasoning that focuses on Bayesian inference. There are some lovely rules of thumb that come out of that. For instance, if you need to predict how long something will last — whether it’s how long a romantic relationship will continue, how long a company will exist, or simply how long it will be before the next bus pulls up — the best you can do, absent any familiarity with the domain, is to assume you’re exactly halfway through, and so it will last exactly as long into the future as it’s lasted already. More broadly, one of the upshots of developing an understanding of Bayesian inference is that if you have experience in a domain, your hunches are likely to be quite good. That’s intuitive enough, but the problem comes when your experiences are not a representative sample of reality.
In the modern world, you can get situations where gun violence in reality is in decline, yet the representation of gun violence in the news is going up. For this reason, it’s probably harder to be a good Bayesian than it’s ever been.
Q7. What will happen if AI system becomes better than humans at most or all cognitive tasks?
Brian Christian: That’s a huge question, and the subject in part of my next book. I think a fundamental restructuring of society is likely to happen — and may be necessary. That’s likely to be a bumpy ride, but it will raise critical and important questions.
And I do think that one of the fascinating outcomes of the progress of AI is that it gives us a new opportunity and new means of understanding the nature of human intelligence — a chance to better know ourselves. That’s a powerful thing, and a good thing.
Brian Christian is the author of The Most Human Human, a Wall Street Journal bestseller, New York Times editors’ choice, and New Yorker favorite book of the year. His writing has appeared in The New Yorker, The Atlantic, Wired, The Wall Street Journal, The Guardian and The Paris Review, as well as scientific journals such as Cognitive Science, and has been translated into eleven languages. He lives in San Francisco.
Follow us on Twitter: @odbmsorg
“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah
I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop 3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.
Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?
Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.
Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?
Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.
Q3. What are the typical challenging business problems that world’s leading organisations have?
Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.
Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?
Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).
Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?
Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.
Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?
Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.
Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?
Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.
That said, the most notable features in Apache Spark 2.0 are:
1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.
2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.
3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.
Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.
Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?
Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).
Some of the key benefits of the Impala approach include:
* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL
Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.
That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.
Q8. What is Apache Kudu, and why is it relevant for Impala Users?
Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.
Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.
Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?
Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).
Other cool additions are ATS v2 and classpath isolation which you can read more about here
Q10. What is the roadmap ahead for Cloudera Enterprise?
Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.
Follow us on Twitter: @odbmsorg
“Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.”–Michael Henry
I have interviewed Michael Henry, Principal at KPMG LLP. In the interview we covered the challenges faced by financial institutions due to existing regulations standards, KPMG`s solution to automate the onboarding process for their clients, and the potential impact of Digital labor for the financial services sector.
Q1. The Organisation for Economic Co-operation and Development (OECD) proposed a Common Reporting Standard (CRS) for the Automatic Exchange of Information (AEOI) that implies a significant increase in the customer due diligence and reporting obligations of financial institutions across the world. What is the implication for your clients?
Michael Henry: The new reporting requirement will require financial institutions to collect and examine more information about their clients for the purposes of tax withholding and reporting. Banks and other regulated institutions will have to examine information from their clients to make sure they are reporting their true residence for tax purposes. This is similar to the US Internal Revenue Service’s FATCA requirements. And like FATCA, many banks will respond by asking for more documentation from their clients and adding staff to perform due diligence on that documentation.
Q2. Specifically, what is “client on boarding”? How is it normally implemented by large financial institutions?
Michael Henry: Client on boarding refers to the series of processes that a financial institution undergoes to determine whether or not it should move forward with conducting or renewing business with a given customer.
The term is inclusive of the underlying regulatory and compliance practices governed by anti-money laundering (AML) and know-your-customer (KYC) rules.
Many large financial institutions deploy thousands of staff, often in low cost offshore locations to perform this function. These staff are usually equipped with basic workflow and data management technology. At Tier 1 organizations this can cost hundreds of millions of dollars annually while pinning their reputations on the shoulders of junior resources making subjective compliance policy interpretations.
For this basic client identification and validation process, one of our clients employs thousands of people in an offshore location. Because this work is boring and repetitive, the client tells us that the attrition rate is more than 10% per month. This presents an enormous risk to the business, as banks entrust their client experience, business results, and reputations to cheap clerical labor that likely joined the bank only a few months ago.
Q3. What are the typical problems?
Michael Henry: The bank must collect information to identify the client and determine the risk that the client will engage in some kind of unlawful activity. To perform this function, the bank must process a large number of data that enter the bank electronically, or through documents. Reading and interpreting documents and trying to apply complex compliance rules using manual processes is time-consuming, error-prone, and expensive.
Technology – Workflow, case management, relational databases, and imaging technologies while mature and effective, still require human beings to read, transcribe, and interpret data.
Inconsistency – Human operators interpret complex decision-trees of rules. The risk of subjectivity grows with the size of the operation.
Accuracy – The majority of today’s onboarding representatives execute what amount to “stare and compare” and “stare, copy and enter” processes. Over the course of a business day in which hundreds of pages or documents will be read and thousands of keystrokes completed, it is inevitable that operator errors will occur.
Q4. You have worked on a solution as a service to automate the onboarding process for your clients. Can you explain in a nutshell how did you do it?
Michael Henry: The solution is comprised of multiple digital labor components to read documents and apply policy rules by machines instead of people.
Humans focus on exceptions, i.e., cases which really require human judgment. Because the exception rates are low, much of the activity becomes straight-through.
The technology uses a combination of robotics, big data, and natural language processing integrated for the solution of KYC, AML, Tax classification, and other compliance activities.
Q5. How difficult was to integrate domain knowledge into advanced technology?
Michael Henry: Domain knowledge is critical. KPMG invested significant regulatory and compliance expertise to reinvent this process for ourselves and our clients. The technology only works because of this investment.
We use advanced technology, but it is all commercially available. Our ability to define specific ontologies and compliance rules on that technology is the differentiator.
Q6. How do you capture information from SEC filings, blog entries, social media, text messages and other sources of structured and unstructured data without manual intervention?
Michael Henry: We capture information from structured and unstructured sources through a combination of technologies. Optical character recognition (OCR) and natural language processing (NLP) software drive our content enrichment process. This allows our platform to ingest unstructured documents (with or without metadata), identify them, and then extract the relevant content according to our ontological models. Some exception processing occurs at this stage, especially if the quality of the documentation is poor.
Q7. How do you integrate, organize and mine customer data?
Michael Henry: Customer data are ingested to the platform through system extracts, tying in to document repositories and the establishment of secure FTP sites. These data then pass through our content enrichment engine and ultimately reside in our MarkLogic NoSQL database.
Q8. Why did you choose MarkLogic’s Enterprise NoSQL database?
Michael Henry: First, we are solving mission-critical subjects for the world’s leading financial institutions. We needed to have an institutional-grade, enterprise-hardened database at the core of our platform.
Second is given the size of the data sets involved, we needed to have a highly scalable database that could handle petabytes of data while simultaneously staging and orchestrating multiple run-time sequences. Finally, we found MarkLogic very aligned to our vision and a good partner in bringing the solution to market.
Q9. How do you use semantics, text analytics and visualisation?
Michael Henry: Semantic analysis allows us to handle unstructured data in natural language formats. Extracting the list of beneficial owners from a 100-page trust document can take a human hours. The tools are so proficient now, that with the right ontological models we can obtain dozens of data from an unstructured document at high volumes with little human intervention. We have been able to ingest hundreds of individual loan documents and produce a data hierarchy by client, by loan, and by event.
Q10. What results did you obtain so far? What is the order of magnitude reduction in human efforts you obtained? As human involvement in the process declines, is the number of errors in reports also declining?
Michael Henry: Today, we serve more than 20 clients. In the tax compliance area, a human may spend more than an hour ingesting a W8 form and conducting due diligence. Most of this is reading KYC documents. Our platform has the ability to handle more than 10 of these per hour per human exception handler. If the task involves humans reading documents and applying validation or other policies, and the rate of actual exceptions is low, we can take 80-90% of the manual effort out. And the tools keep getting better.
More important than the productivity gain is the consistency and accuracy of the automation. No human operator can apply thousands of policy rules consistently. We continue to tune our models, and the machine never forgets.
Q11. In your opinion, what is the impact of the introduction of “Digital Labor”services for the job service market and for the society at large?
Michael Henry: Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.
Michael Henry Principal, Financial Services, KPMG LPP
Michael is a Principal in KPMG’s Digital Labor practice with more than 25 years’ experience in financial services. Michael specializes in the application of sophisticated technologies (big data, natural language processing, artificial intelligence, machine learning, workflow and robotics) to automate compliance processes. Michael has worked with global and regional banks, and his experience includes living and working in Europe and Asia.
– ￼FATCA Onboarding & Compliance Solution. KPMG, 2015 (LINK to .PDF)
Follow us on Twitter: @odbmsorg
“While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.” –Ofer Bengal.
I have interviewed Ofer Bengal, Co-Founder and CEO of Redis Labs, and Yiftach Shoolman, Co-Founder and CTO of Redis Labs.
Main topics of the interview are: How is the database market evolving, proprietary vs. open source software, in-memory/ key-value data stores, and the new features of Redis.
Q1. How do you see the database market evolving?
Ofer Bengal, Yiftach Shoolman: The main trends we identify today and believe will continue in upcoming years are:
1) Non-relational databases will continue to see growing adoption, because the schema framework is ineffective when it comes to unstructured data, change in data patterns, growing data volumes, more stringent performance requirements and the way modern apps are built.
2) Multiple database models as opposed to the absolute dominance of RDMS in the past few decades, each model solving the requirements of certain use cases.
Moreover, certain modern databases can run several database models (document, graph, etc.)
3) Multiple databases (different types or the same type) serving the same app. Modern applications are based on micro service architecture, in which each micro service works with the best database for its use case.
This creates new challenges for modern databases: (a) Instant provisioning – sometime hundreds or thousands of databases are provisioned within a second, and (b) Multi-tenancy, otherwise the cost associated with managing database infrastructure becomes extremely high.
4) Database-as-a-service is growing vs. self deployed and operated databases. With enterprises gradually moving to the cloud and having to deal with multiple type databases, it makes a lot of sense to outsource deployment and ongoing operations rather than building in-house practice of DBAs and Devops.
5) Hybrid transactional and analytical processing (HTAP). Driven by the need for application analytics to drive business decision making in real time, certain modern databases can handle those two different workloads simultaneously, eliminating the need for exporting transactional data to a separate dedicated analytical database.
Q2. Proprietary vs. open source software: what are the pros and cons?
Ofer Bengal, Yiftach Shoolman: From the community perspective, open source is great. If there is a vibrant community, it pushes innovation, problem solving and compatibility issues with different environments.
From users perspective, open source is “open”, accessible, can be used by anyone, transparent, and free of charge.
It often comes with less of a danger of vendor lock-in. It is very suitable for independent developers and startups. However enterprises using open source products may have certain challenges:
1. The product is not always suitable for enterprise workloads, especially when it comes to databases. Capabilities like infinite seamless scaling, high-availability with instant failover and stable performance at scale are not always the open source developer’s top priority.
2. Commercial support must be obtained and this typically comes with a price tag which is not much different than acquiring a commercial database product.
3. Commercial support is typically provided by a single company (most probably founded by the open source creators), which creates “vendor lock-in” by itself.
4. In the case of databases, using database-as-a-service may turn out to be lower in cost compared to provisioning cloud instances and running zero cost open source software on them, because commercial can be based on efficient multi-tenant architecture.
Q3. What is the current market for in-memory, key-value data stores?
Ofer Bengal: In-memory key-value data stores (sometimes called in-memory data grids (IMDGs)) have been around since more than a decade and have proven capable of supporting digital business needs for responsive, always-on user experience; real-time, actionable insights; and dynamic scaling. They are widely employed when you want to scale/modernize legacy applications without spending additional money on extremely expensive RDBMS licenses and hardware.This is achieved by providing a scalable and reliable in-memory datastore that enables low-latency transactional and analytical processing.
While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.
From a Redis perspective, our innovation in data structures brings about the ability to simplify development to the extent that now most Redis users use it as a first responder and primary datastore for substantial pieces of their data. Furthermore with Redis’ data-structures, users can run operational and analytical use cases on the same database.
In addition, acceleration of other in-memory platforms like Spark is possible with Redis.
Gartner estimates that, in 2015, the stand-alone IMDG market was worth approximately $600 million, having grown by about 30% from the previous year. Gartner expects the market to continue to grow in the double-digit range through 2020 and to exceed $1 billion by 2018. Redis, one of the leaders in this space, grew in just a few years to be one of the most popular databases used by developers and enterprises.
Q4. Amazon ElastiCache supports two open-source in-memory engines: Redis and Memcached. What does it mean in practice?
Yiftach Shoolman: In practice, Amazon ElastiCache is a simple caching service that simplifies a developer experience by providing these two open source in-memory engines. Legacy applications that use simple cache can use ElastiCache seamlessly.
However, ElastiCache is single-tenant, limited to caching use cases and cannot be used as a database, lacking enterprise-grade functionalities such as infinite seamless scalability, instant failover and predictable performance.
The Redis Labs equivalent service, called Redis Cloud provides all the benefits of an enterprise-class Redis.
Q5. What are the pros and cons of Memcached and Redis?
Yiftach Shoolman: Redis can be thought of as modern database while memcached is older technology designed specifically for ephemeral caching.
The most important difference is in persistence and HA – memcached is not persistent nor HA, while Redis can operate as a full-fledged in-memory database, highly available through both in-memory replication and data persistence. This reflects the fact that caches in older architectures were not required to be highly available, but in modern architectures, built for scale and volume, cache outages can significantly impact the business and user experience.
Redis, the newer and more versatile technology allows individual data elements to be manipulated while memcached often incurs serialization/deserialization overheads that makes the entire application processing much slower. This is because Memcached can handle only simple key value use cases, whereas Redis offers many more data structures (hashes, sets, sorted sets, lists, hyperloglog..) that simplify complex data processing, analysis and operational use cases with ease.
Even when used as a cache, Redis has more sophisticated eviction policies which can be both active or passive while memcached has only a simple LRU and lazy eviction.
Redis and Memcached are both very popular open source projects, but given its richer functionality, more advanced design, many potential uses, and greater cost efficiency at scale, Redis should be your first choice in nearly every case.
Q6. For very large data sets or analytics workloads, running everything in-memory might not be cost effective. What is your take on this?
Ofer Bengal, Yiftach Shoolman: For very large data sets or analytics workloads, it is advantageous to utilize alternative memory technologies(such as Flash memory, which is a tenth of the cost), as extensions of memory rather than impose a disk access penalty. We have extended enterprise Redis in this manner to take advantage of Flash memory, while using a tiered approach (keys and hot values are still in the fastest memory, while cold values are in “slower” Flash memory) to ensure that you still see sub-millisecond latencies with millions of ops/sec throughput.
Q7. Redis was created by Salvatore Sanfilippo in 2009. What is his role today?
Ofer Bengal: Salvatore is leading the development of open source Redis within Redis Labs. He works with a group of experienced developers on extending the capabilities of Redis. A good example of this collaborative works is the recent introduction of Redis Modules, which extend Redis to a variety of new modern use cases. Salvatore wrote the API and the other team members in a very short time created and tested a few modules, such as Redisearch (a full-text search engine) and Redis-ML (enhancing the performance of Spark machine learning capabilities). Salvatore’s role is to continue the community innovation around the Redis core, together with his team of Redis Labs developers.
Q8. What are the differences of Redis Labs` version of Redis with the original one developed in 2009?
Yiftach Shoolman: Redis Labs fully supports the open source Redis versions, but enhances them with a container-like layer that adds a proxy, cluster management and a shared nothing architecture. Taken together, Redis Labs provides a solid enterprise foundation to Redis, allowing it to scale seamlessly in memory across many hundreds of servers with the high availability through persistence, in-memory cross-rack/zone/region/datacenter replication and instant automatic failover. No retooling or re-architecting is required to move from open source Redis to enterprise Redis, the process is basically effortless and immediate. Redis Labs also offers various database modules, like a RediSearch, multiple probabilistic modules like Bloom Filter, TopK, CMS, Redis-ML for Machine Learning, Redis-TS for Time Series processing, JSON and Graph support.
Q9. What are the possible scenarios of using Redis for data analytics?
Ofer Bengal, Yiftach Shoolman: Redis data structures come with built-in simple analytic operations like counting, ranking, scoring, ranges and more. Over time, probabilistic data structures have added the ability to analytically estimate millions and trillions of events, without requiring memory to store all of the events.
Set operations have made it possible to simplify comparisons, intersections, unions of sets – analytics that are usually complicated with data stores. RQL (Redis SQL) and secondary indexing, allows executing complex SQL queries on an existing Redis database. And finally recent modules like RediSearch, Neural Redis and Redis-ML have added advanced search and machine learning capabilities – not naturally occurring in any other databases.
With all of these possibilities, and with the move to automated decision making, we see increasing usage of Redis for data analytics scenarios.
Q10. How safe is a Redis server?
Yiftach Shoolman: The Redis enterprise server comes with client-based SSL authentication, built-in cloud firewall support (when running on public clouds), password authentication and role-based authorization that enables customizing security levels.
Qx. Anything else you wish to add?
Ofer Bengal: Redis is a game -changer when it comes to databases, and its progression over the last seven years has demonstrated that the industry and market are demanding performance and increasing flexibility to deal with all types of data processing, storage and analytic scenarios. Redis’ core values have always included high performance, high throughput and very low latencies. With the visionary addition of modules. The community has turned it into an all purpose datastore – suitable for any scenario that needs a database.
Ofer Bengal – Co-Founder and CEO of Redis Labs
Ofer is a serial entrepreneur who has founded and led several companies in the areas of data communications, telecommunications, Internet, homeland security and medical devices. Ofer was founder & CEO of RIT Technologies (NASDAQ: RITT), a provider of sophisticated telecommunications and data communications systems to major world carriers. He began his career as an aerospace engineer in the Israeli Air Force and then built his own aerospace engineering consulting firm. As a hobby, he has also invented, developed and licensed toy concepts to companies such as Milton Bradley, Hasbro and Tomy. Ofer holds a Bachelor of Science (cum laude) in aerospace engineering from the Technion, Israel Institute of Technology.
Yiftach Shoolman – Co-Founder and CTO of Redis Labs
Yiftach is an experienced technologist, having held leadership engineering and product roles in diverse fields from application acceleration, cloud computing and software-as-a-service (SaaS), to broadband networks and metro networks. He was the founder, president and CTO of Crescendo Networks (acquired by F5, NASDAQ:FFIV), the vice president of software development at Native Networks (acquired by Alcatel, NASDAQ: ALU) and part of the founding team at ECI Telecom broadband division, where he served as vice president of software engineering. Yiftach holds a Bachelor of Science in Mathematics and Computer Science and has completed studies for Master of Science in Computer Science at Tel-Aviv University.
Follow us on Twitter: @odbmsorg