” I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices.” –Vint G. Cerf
I had the pleasure to interview Vinton G. Cerf. Widely known as one of the “Fathers of the Internet,” Cerf is the co-designer of the TCP/IP protocols and the architecture of the Internet. Main topic of the interview is the Internet of Things (IoT) and its challenges, especially the safety, security and privacy of IoT devices.
Vint is currently Chief Internet Evangelist for Google.
Q1. Do you like the Internet of Things (IoT)?
Vint Cerf: This question is far too general to answer. I like the idea behind programmable, communicating devices and I believe there is great potential for useful applications. At the same time, I am extremely concerned about the safety, security and privacy of such devices. Penetration and re-purposing of these devices can lead to denial of service attacks (botnets), invasion of privacy, harmful dysfunction, serious security breaches and many other hazards. Consequently the makers and users of such devices have a great deal to be concerned about.
Q2. Who is going to benefit most from the IoT?
Vint Cerf: The makers of the devices will benefit if they become broadly popular and perhaps even mandated to become part of local ecosystem. Think “smart cities” for example. The users of the devices may benefit from their functionality, from the information they provide that can be analyzed and used for decision-making purposes, for example. But see Q1 for concerns.
Q3. One of the most important requirement for collections of IoT devices is that they guarantee physical safety and personal security. What are the challenges from a safety and privacy perspective that the pervasive introduction of sensors and devices pose? (e.g. at home, in cars, hospitals, wearables and ingestible, etc.)
Vint Cerf: Access control and strong authentication of parties authorized to access device information or control planes will be a primary requirement. The devices must be configurable to resist unauthorized access and use. Putting physical limits on the behavior of programmable devices may be needed or at least advisable (e.g., cannot force the device to operate outside of physically limited parameters).
Q5. Consumers want privacy. With IoT physical objects in our everyday lives will increasingly detect and share observations about us. How is it possible to reconcile these two aspects?
Vint Cerf: This is going to be a tough challenge. Videocams that help manage traffic flow may also be used to monitor individuals or vehicles without their permission or knowledge, for example (cf: UK these days). In residential applications, one might want (insist on) the ability to disable the devices manually, for example. One would also want assurances that such disabling cannot be defeated remotely through the software.
Q6. Let`s talk about more about security. It is reported that badly configured “smart devices” might provide a backdoor for hackers. What is your take on this?
Vint Cerf: It depends on how the devices are connected to the rest of the world. A particularly bad scenario would have a hacker taking over the operating system of 100,000 refrigerators. The refrigerator programming could be preserved but the hacker could add any of a variety of other functionality including DDOS capacity, virus/worm/Trojan horse propagation and so on.
One might want the ability to monitor and log the sources and sinks of traffic to/from such devices to expose hacked devices under remote control, for example. This is all a very real concern.
Q7. What measures can be taken to ensure a more “secure” IoT?
Vint Cerf: Hardware to inhibit some kinds of hacking (e.g. through buffer overflows) can help. Digital signatures on bootstrap programs checked by hardware to inhibit boot-time attacks. Validation of software updates as to integrity and origin. Whitelisting of IP addresses and identifiers of end points that are allowed direct interaction with the device.
Q8. Is there a danger that IoT evolves into a possible enabling platform for cyber-criminals and/or for cyber war offenders?
Vint Cerf: There is no question this is already a problem. The DYN Corporation DDOS attack was launched by a botnet of webcams that were readily compromised because they had no access controls or well-known usernames and passwords. This is the reason that companies must feel great responsibility and be provided with strong incentives to limit the potential for abuse of their products.
Q9. What are your personal recommendations for a research agenda and policy agenda based on advances in the Internet of Things?
Vint Cerf: Better hardware reinforcement of access control and use of the IOT computational assets. Better quality software development environments to expose vulnerabilities before they are released into the wild. Better software update regimes that reduce barriers to and facilitate regular bug fixing.
Q10. The IoT is still very much a work in progress. How do you see the IoT evolving in the near future?
Vint Cerf: Chaotic “standardization” with many incompatible products on the market. Many abuses by hackers. Many stories of bugs being exploited or serious damaging consequences of malfunctions. Many cases of “one device, one app” that will become unwieldy over time. Dramatic and positive cases of medical monitoring that prevents serious medical harms or signals imminent dangers. Many experiments with smart cities and widespread sensor systems.
Many applications of machine learning and artificial intelligence associated with IOT devices and the data they generate. Slow progress on common standards.
Vinton G. Cerf co-designed the TCP/IP protocols and the architecture of the Internet and is Chief Internet Evangelist for Google. He is a member of the National Science Board and National Academy of Engineering and Foreign Member of the British Royal Society and Swedish Royal Academy of Engineering, and Fellow of ACM, IEEE, AAAS, and BCS.
Cerf received the US Presidential Medal of Freedom, US National Medal of Technology, Queen Elizabeth Prize for Engineering, Prince of Asturias Award, Japan Prize, ACM Turing Award, Legion d’Honneur and 29 honorary degrees.
Follow us on Twitter: @odbsmorg
“Looking across the senior leadership in Government, very few top Civil Servants and Ministers have come from a technical background. Most Departments will have a CTO/CIO person who may or may not also been drawn from a relevant technical background. Those that are drawn from such a background and are empowered by their senior leadership, deliver a clear advantage to their organisation.”–Sarbjit Singh Bakhshi.
I have interviewed Sarbjit Singh Bakhshi, Director of Government Affairs at Maxeler Technologies. We covered in the interview the challenges and opportunities for the UK public sector in the Post-Brexit Era.
Q1. It has been suggested that some of the key challenges in government IT are: i) change aversion, ii) lack of technocratic leadership, and iii) processes that don’t scale down. What is your take on this?
Sarbjit Singh Bakhshi: Looking across the senior leadership in Government, very few top Civil Servants and Ministers have come from a technical background. Most Departments will have a CTO/CIO person who may or may not also been drawn from a relevant technical background. Those that are drawn from such a background and are empowered by their senior leadership, deliver a clear advantage to their organisation.
Where these people are not empowered, you will often find bad technical choices made by Departments that seem to be driven by short term commercial gain rather than long term interests. In the worst cases, they’d rather patch up systems knowing they will fail in the medium term than invest in a long term solution. This leads to rather inevitable problems when they can no longer pursue this strategy.
There are also issues in terms of understanding the total cost of computing. The cost of inefficient systems that consume vast amounts of electricity are often hidden from decision makers as running costs come out of an operational budget for the Department. So the decision makers will focus on a low purchase price for some systems even if the systems cost more in the long run to operate.
There needs to be a greater understanding of the challenges of running Government IT and there should be changes in leadership to support this. A focus on total cost of ownership (including transition costs) needs to be taken into account.
Q2. What are the main barriers to use new and innovative technologies in the UK Government?
Sarbjit Singh Bakhshi: Like most advanced countries that have been running big IT programmes from the 1950’s, the UK Government has a very heterogeneous IT estate and compatibility with older and archaic systems can still be a problem. There are further issues in terms of the threat of cyber warfare that need to be considered also when delivering new systems into this environment.
There are also issues around procurement which can stifle innovative and iterative approaches to technology deployment.
The recent move to create the G-Cloud does help in this respect, but we still see too many OJEU procedures which often crowd out smaller companies from the opportunities of working with Government.
Q3. How will Brexit affect the UK’s tech industry?
Sarbjit Singh Bakhshi: As Article 50 has just been invoked and nothing has been agreed it is a little too soon to tell.
Obviously, there are concerns in the UK’s Tech Industry around getting and recruiting the best staff from around the world in the UK, access to the Digital Single Market and any agreements the EU has around the safe storage of data. I’m sure these will all be top of mind for British negotiators.
Q4. What are the challenges and opportunities for the UK public sector in the Post-Brexit Era?
Sarbjit Singh Bakhshi: The challenges are manifold, the applicability of EU laws and the future of EU citizens in the UK and how that affects the administration of the country are probably paramount.
There are also opportunities for the British tech industry in the creation of parallel Governmental systems to replace the ones we are currently using that are from the EU.
Q5. What were your main motivations to leave the Government and go to Industry?
Sarbjit Singh Bakhshi: After many successful years of operation, Maxeler is emerging into an exciting new space. Maxeler has its first cloud based product with Amazon and is able to offer onsite and/or elastic cloud operations for the first time in its history. With more Government work moving to the cloud, this is an exciting time as we can bring the high levels of Maxeler performance to Government data sets at a reasonable cost.
We have applications in cyber security, big – complex data analysis and real time networking that are essential for Governments to deal with the emerging threats from around the world.
If one considers the wider context of Government – including research and scientific computing, Maxeler also has a good story to tell. We are in STFC Daresbury and can provide exceptionally performant computing at low energy cost for supercomputer environments. Complementing this we have an active university programme with over 150 top universities around the world where scientists are using DataFlow Engines for a variety of computational tasks that are outperforming traditional CPU setups.
Maxeler is ready to offer services to the UK Government as an approved G-Cloud supplier and can use its experience with large data sets to help Government move from on-premises hosting to take advantages of the cloud and its ultra-fast computing in line with the Government’s ‘Cloud First’ technology policy..
Q6. What are your aims and goals as new appointed Director of Government Affairs at Maxeler Technologies?
Sarbjit Singh Bakhshi: Improve our relations with Government in the UK and elsewhere. Maxeler technologies has experience of working with high pressure financial institutions, we’d like to offer Government the same opportunity to deal with its most pressing computational problems in a ultra-performant, energy efficient and easy to manage way.
Sarbjit Bakhshi joined Maxeler Technologies from a long career working for the British Government. Working mainly in areas of European Union Negotiations and Technology policy and promotion, Sarbjit’s appointment marks a new phase for Maxeler Technologies and its commitment to work with Government in the UK and overseas as part of its’ next phase of expansion.
Follow us on Twitter: @odbmsorg
“I do think that one of the fascinating outcomes of the progress of AI is that it gives us a new opportunity and new means of understanding the nature of human intelligence — a chance to better know ourselves. That’s a powerful thing, and a good thing.”–Brian Christian
I have interviewed Brian Christian, coauthor of the bestseller book Algorithms to Live By.
Q1. You have worked with cognitive scientist Tom Griffiths (professor of psychology and cognitive science at UC Berkeley) to show how algorithms used by computers can also untangle very human questions. What are the main lessons learned from such a joint work?
Brian Christian: I think ultimately there are three sets of insights that come out of the exploration of human decision-making from the perspective of computer science.
The first, quite simply, is that identifying the parallels between the problems we face in everyday life and some of the canonical problems of computer science can give us explicit strategies for real-life situations. So-called “explore/exploit” algorithms tell us when to go to our favorite restaurant and when to try something new; caching algorithms suggest — counterintuitively — that the messy pile of papers on your desk may in fact be the optimal structure for that information.
Second is that even in cases where there is no straightforward algorithm or easy answer, computer science offers us both a vocabulary for making sense of the problem, and strategies — using randomness, relaxing constraints — for making headway even when we can’t guarantee we’ll get the right answer every time.
Lastly and most broadly, computer science offers us a radically different picture of rationality than the one we’re used to seeing in, say, behavioral economics, where humans are portrayed as error-prone and irrational. Computer science shows us that being rational means taking the costs of computation — the costs of decision-making itself — into account. This leads to a much more human, and much more achievable picture of rationality: one that includes making mistakes and taking chances.
Q2. How did you get the idea to write a book that merges computational models with human psychology (*)?
Brian Christian: Tom and I have known each other for 12 years at this point, and I think both of us have been thinking about some of these questions our whole lives. My background is in computer science and philosophy, and my first book The Most Human Human uses my experience as a human “confederate” in the Turing test to ask a series of questions about how computer science is changing our sense of what it means to be human. Tom’s background is in psychology and machine learning, and his research focuses around developing mathematical models of human cognition. The idea of using computer science as a means for insights about human decision-making really emerged naturally as a consequence of those interests and inquiries. One night we were having dinner together and discussing our current projects and, long story short, we realized we were each writing the same book in parallel! It was immediately apparent that it should take the form of a single collaborative effort.
Q3 In your book you explore the idea of human algorithm design. What is it?
Brian Christian: The idea is, quite simply, to look for optimal ways of approaching everyday human decision making — and to do that by identifying the underlying computational structure of the problems we face in daily life.
“Optimal stopping” problems can teach us about when to look and when to leap; the explore/exploit tradeoff tells us when to try new things and when to stick with what we know and love; caching tells us how to manage our space; scheduling theory tells us how to manage our time.
Q4. What are the similarities between the workings of computers and the human mind?
Brian Christian: First I’ll turn that question on its head and highlight one of the biggest differences.
To paraphrase the UNM’s Dave Ackley, the imperative of a computer program is “the precisely correct answer, as prompt as possible,” whereas the imperative of animal cognition (including our own) is the reverse: “a prompt response, as correct as possible.”
As computer scientists explore systems with real-time constraints, and problems sufficiently difficult that grinding out the exact solution (no matter how long it takes) simply doesn’t make sense, we are starting to see that distinction begin to narrow.
Q5. Our lives are constrained by limited space and time, limits that give rise to a particular set of problems. Do computers, too, face the same constraints?
Brian Christian: Computers of course have space constraints — there are limits to how much data can be kept in the caches or the RAM, for instance, which gives rise to caching algorithms. These, in turn, offer us ways of thinking about how we manage the limited space in our own lives (in the book we compare some of Martha Stewart’s edicts about home organization to some of the canonical results in caching theory to see which hold water and which don’t).
Many systems, too, operate under time constraints, which offers us a whole other set of insights we can draw from.
For instance, algorithms for high-frequency trading must determine in a matter of microseconds whether to take an offer or to let it go by — do you hold out for a better price, risking that you may never get as good of a deal ever again?
Many human decisions take this form, across a wide range of domains. This type of “optimal stopping” structure underlies everything from buying and selling houses to our romantic lives. Modern operating systems also include what’s known as a “scheduler,” which determines the best way to make use of the CPU’s limited time. There have been a number of high-profile cases of scheduling failures, including the 1997 Mars Pathfinder mission, where the lander made it all the way to the Martian surface successfully, but then appeared to start procrastinating once it got there: whiling away on low-priority tasks while critical work languished. Studying these failures and the methods for avoiding them can in turn give us strategies for making the most of our own limited time.
Q6. Can computer algorithms help us to have better hunches?
Brian Christian: I think so. For instance, we have a chapter on inductive reasoning that focuses on Bayesian inference. There are some lovely rules of thumb that come out of that. For instance, if you need to predict how long something will last — whether it’s how long a romantic relationship will continue, how long a company will exist, or simply how long it will be before the next bus pulls up — the best you can do, absent any familiarity with the domain, is to assume you’re exactly halfway through, and so it will last exactly as long into the future as it’s lasted already. More broadly, one of the upshots of developing an understanding of Bayesian inference is that if you have experience in a domain, your hunches are likely to be quite good. That’s intuitive enough, but the problem comes when your experiences are not a representative sample of reality.
In the modern world, you can get situations where gun violence in reality is in decline, yet the representation of gun violence in the news is going up. For this reason, it’s probably harder to be a good Bayesian than it’s ever been.
Q7. What will happen if AI system becomes better than humans at most or all cognitive tasks?
Brian Christian: That’s a huge question, and the subject in part of my next book. I think a fundamental restructuring of society is likely to happen — and may be necessary. That’s likely to be a bumpy ride, but it will raise critical and important questions.
And I do think that one of the fascinating outcomes of the progress of AI is that it gives us a new opportunity and new means of understanding the nature of human intelligence — a chance to better know ourselves. That’s a powerful thing, and a good thing.
Brian Christian is the author of The Most Human Human, a Wall Street Journal bestseller, New York Times editors’ choice, and New Yorker favorite book of the year. His writing has appeared in The New Yorker, The Atlantic, Wired, The Wall Street Journal, The Guardian and The Paris Review, as well as scientific journals such as Cognitive Science, and has been translated into eleven languages. He lives in San Francisco.
Follow us on Twitter: @odbmsorg
“What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).”–Amr Awadallah
I have interviewed Amr Awadallah, Chief Technology Officer at Cloudera.
Main topics of the interview are: the new developments in Apache Spark 2.0 Beta, and Hadoop 3.0.0-alpha1 release ; the lessons learned from Amr´s experience of using Hadoop at Yahoo!; and the business problems that world’s leading organisations do have.
Q1. Before Cloudera, you served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organisations to use Hadoop for data analysis and business intelligence. What are the main lessons you learned in that period?
Amr Awadallah: Couple of things. First, I learned that Hadoop is capable of solving all the business intelligence problems that I had at Yahoo.
(1) our systems weren’t scaling fast enough (we needed to cut down transformation times from hours to minutes),
(2) our systems weren’t economical on a $/TB basis thus making it hard to retain valuable data for longer time periods, and (3) we needed new methods to be able to store and analyze semi-structured (e.g. logs) and unstructured data (e.g. social media).
By implementing Hadoop in our team we saw first hand how it can address all these problems. The second lesson that I learned was that Hadoop, back then, was very rough to deploy and program against (it took us many months to deploy it and reprogram our transformations to run on it). It was these lessons that made it clear that there is room for a startup to focus on Hadoop since (1) it was solving a very real data problems that many organizations will face, and (2) it needed a lot of polish to make it work smoothly, securely, and reliably within the enterprise.
Q2. In 2008 you founded Cloudera together with Mike Olson (Oracle), Jeff Hammerbacher (Facebook) and Christophe Bisciglia (Google). What was your main motivation at that time?
Amr Awadallah: Pretty much to do what I describe above, we wanted to make the Hadoop technology easy to use for organizations. That included: (1) creating a distribution for Hadoop that bundles all the necessary open-source projects that make it work (we call that CDH, short for Cloudera Distribution for Apache Hadoop). (2) We also created a number of proprietary system management, security, and meta-data management tools around CDH to make it easier for organizations to deploy and operate Hadoop in production.
Q3. What are the typical challenging business problems that world’s leading organisations have?
Amr Awadallah: The technology we provide is very powerful and can be used to solve many problems across many industries, but we see four common themes: The first is simply using Hadoop as a faster, bigger, cheaper system for business intelligence and data analytics. i.e. a lot of organizations just use us to do things they have been doing already, just doing these things in a more economically scalable way.
The second use case is around deeper understanding of customers, i.e. moving away from segmenting all customers into a number of predefined buckets, but rather creating a dynamic micro-segment addressing each customer in a more precise way (thus reducing false positives).
The third use case is about using data to build better products and services, and this use-case is catalyzed by of the internet-of-things. Due to smart-sensors we are able to measure the real-world better than ever before; so this use-case is about taking all that data and leveraging it to either enhance our current product/service offerings, or build entirely new ones.
The fourth use case is about reducing business risk, and it manifests itself in a number of different sub-cases depending on the industry. For example, cyber-security is one of the key ways to reduce risk, and we have an open source project co-developed with Intel, called Apache Spot, which organizations can use to collect all their network flow data then use Spark machine learning algorithms to detect the anomalies in that data. Anti-money laundering and fraud detection is another way that our banking customers employ our platform to reduce risk within their businesses. Similarly, our insurance industry customers use our system to detect fraudulent claims, etc.
Q4. Can they be solved by analysing data? Can you give us some examples of how the use of advanced analytics drive business decisions?
Amr Awadallah: Yes, all the problems mentioned above can be solved with data. I want to highlight though that this isn’t necessarily about business decisions, which is what the Business Intelligence movement was about (we just help make that cheaper and faster). What this Big Data movement is about is using data to actually change our businesses in real-time (versus show the business leaders a report that they make a decision based on).
One of my favorite examples is a solution that one of our customers built to give voice to premature babies in neonatal intensive care units. They analyze the signals coming from the baby (sounds, blood pressure, heart rate, temperature, few brain signals), and based on that a message appears on the monitor above the infant showing the nurse if they are hungry, distressed from too much noise or light, etc.
That is really what we mean by using data to create new products and services that weren’t possible before (and not just reports/dashboard).
Q4. Graphs are important. Is it possible to do scalable graph analytics? If yes, how?
Amr Awadallah: Graphs are indeed important, a lot of our customer use-cases trace back to that (not just for social media analytics, but for example anti-money laundering requires analyzing relationships between many financial accounts for detecting bad behaviors, similarly for cyber security applications). I think scalability depends a fair bit on what’s being analyzed and how scalable we mean by scalable. But for most practical purposes I would say Spark’s GraphX is good enough. For example, you can compute PageRank fairly efficiently and scalably on a cluster using GraphX.
Q5. Data security is increasing important. The risk is due to the growing number of device endpoints. What solutions do exist to minimise such risk?
Amr Awadallah: A comprehensive enterprise data security strategy seeks to mitigate the risks presented by a growing number of potentially compromised endpoints connecting to corporate networks. Endpoint security will enable one or all of the following preventative controls:
The first is policy based enforcement of endpoint security configuration prior to granting and endpoint access to network based corporate assets. This ensures that any endpoint connected to corporate networks meets minimum requirements for endpoint security configuration.
The second measure is endpoint based anti-malware software (the existence of which may be a policy requirement to connect to the network per the first measure). Anti-malware prevents malicious code from infecting endpoints by monitoring for changes to system configuration and unusual activity or processes.
The third measure is endpoint encryption of corporate data on hard drives, folders and even removable media.
As mentioned above we also collaborate with Intel on Apache Spot, which tracks network flow patterns to detect anomalous communication behavior between different devices (including end point devices). Apache Spot just recently won InfoWorld 2017 Tech of the Year Award. Other advanced analytics security partners we closely work with are: CounterTack, Securonix, Niara, and Jask.
Q6. You recently announced the availability of an Apache Spark 2.0 Beta release for users of the Cloudera platform. How does it work? And how does it differ from the Hadoop-based data platform?
Amr Awadallah: First, at a meta-level, Hadoop (MapReduce specifically) was very good at achieving scalable computation by spreading jobs across many CPU cores and hard disk spindles. That said, MapReduce wasn’t very efficient in how it leveraged memory to optimize the performance of data processing pipelines that have many stages or iterations.
The main power of Spark, that made it take over from MapReduce, was how it truly leveraged memory to achieve better performance in deep or iterative data pipelines. That coupled with a simpler developer API made Spark take over very quickly from MapReduce.
Most of our new customer implementations for data processing or data science tend to be in Spark these days, versus MapReduce.
I should clarify however that this doesn’t mean that Hadoop is dead as some say. Apache Hadoop is comprised of three key subsystems: (1) MapReduce for computation, (2) YARN for resource scheduling, and (3) HDFS for storage. Spark only replaces MapReduce, we still rely heavily on both YARN and HDFS.
That said, the most notable features in Apache Spark 2.0 are:
1) Dataset API: It is a new API that represents the distributed collections of objects processed by Spark’s execution engine. It is an extension of Spark’s Dataframe API. It improves upon the Dataframe API by providing type-safe, object oriented programming interfaces. Users can now write User-Defined Functions and Lambda functions that provide compile time type safety. With the Dataset API, users benefit from optimized operations (like sort, join, hash, etc) in the SparkSQL engine, while also getting compile time type safety for user defined functions.
2) Model & Pipeline Persistence in Spark’s ML library: Machine learning Pipelines built with Spark’s ML library can now be serialized to a file and read back in.
The ability to save and reload these pipelines makes it easy for users to perform version control on the pipelines and safely distribute the pipelines. This helps in operationalizing them in production systems.
3) Structured Streaming: New stream processing API and engine that provides SQL like abstractions for authoring operations on data streams, and also improves performance by using the SparkSQL engine for processing the data streams. However, this is still an experimental API and not ready for production usage yet.
Besides the above 3 notable enhancements, there are a bunch of performance and scalability improvements across the board.
Q7. Apache Impala vs. Amazon Redshift: How Does Redshift Compare to Impala?
Amr Awadallah: Apache Impala is an analytic database engine architecturally designed to perform high-performance highly-concurrent SQL analytics on scalable, open data platforms like Hadoop’s HDFS and Amazon S3.
Impala decouples data storage from compute and lets users query data without having to move/load data specifically into an Impala storage-engine (it doesn’t have one). This architectural difference uniquely enables Impala to deliver a more flexible Business Intelligence experience than traditional database architectures like Redshift (which requires pre-loading the data).
Some of the key benefits of the Impala approach include:
* On-demand resources that are immediately ready to query existing S3 data without loading to a different data silo
* Ability to elastically grow/shrink clusters as needed due to decoupled storage and compute
* More predictable, multi-tenant isolation due to the ability to have multiple Impala clusters sharing a common S3 data repository
* Ability to share common data not only amongst Impala clusters, but also any application that runs on cloud-native S3 storage (for example, you can have both Apache Impala and Apache Spark run against the same data asset in S3, while it isn’t possible to have Apache Spark easily access the data stored in Redshift, it has to go through SQL first).
* Greater flexibility to explore new use cases, analytics, and data by directly querying S3 without rigid traditional data models and ETL
Not only does Impala deliver this additional flexibility, it does so at greater cost-performance and scalability compared to Redshift. See the following benchmark for data on that.
That said, Redshift’s sweet spot is in a different target as a smaller datamart as most Redshift installations are in the dozen of nodes range where Redshift’s limitations in scalability, elasticity, flexibility, and requirement to maintain separate copies of data are less critical.
Q8. What is Apache Kudu, and why is it relevant for Impala Users?
Amr Awadallah: Historically we had two storage engines in our distribution: (1) HDFS which is optimized for high-throughput analytics, but doesn’t support updates/inserts and (2) HBase which is optimized for low-latency updates/inserts but isn’t good for doing high-throughput queries. To build a proper data warehouse or time-series analytics system, you typically still need to make updates/inserts and that was why we created Apache Kudu.
Kudu is a new storage system that combines the benefits of both HDFS and HBase into one: it allows for low-latency updates/inserts, but also supports high-throughput analytical queries (i.e. fast analytics on fast moving data).
Unlike HDFS, Kudu is not a file-system, it is a record-based system, so the unit of storage is a record as opposed to a file. This allows Kudu to unlock Impala for real-time streaming applications that were not possible with HDFS.
In HDFS the data would only be visible to Impala after we finish closing the file, which typically happens after a large number of records are accumulated (that adds latency between when records are written to when they become visible to the analytical engine). With Kudu as soon as a record is written it is immediately visible to the Impala analytical engine. Finally, just like HDFS and HBase, the Kudu storage engine is fully integrated with our entire stack, not just Impala.
For example, you can also use Apache Spark for machine-learning jobs directly against Kudu.
Q9. The Apache Hadoop project recently announced its 3.0.0-alpha1 release. What is it?
Amr Awadallah: HDFS Erasure Encoding is really the main exciting new feature in Hadoop 3. Traditionally HDFS required three replicas, by default, for every data block to achieve durability, concurrent performance, and availability. Using erasure encoding techniques, HDFS in Hadoop 3 allows us to significantly reduce the storage overhead from 3x (i.e. 200%) to just 20% extra bits for parity. This will allow us to achieve the same durability benefits of 3x replication, but comes at the cost of potentially lower concurrent performance (when more than one job are trying to access the same block at same time) and lower availability resilience in face of top-of-rack switch failures (less of an issue these days).
Other cool additions are ATS v2 and classpath isolation which you can read more about here
Q10. What is the roadmap ahead for Cloudera Enterprise?
Amr Awadallah: We don’t discuss details of our product roadmap publicly, but there are three guiding themes for us in 2017: The first theme is fast-analytics on fast-moving data (which I covered above in regards to Kudu).
The second theme is cloud, which is making Cloudera Enterprise work better in cloud environments, and make it easier to move workloads (and skill sets) from on-premise clusters to transient cloud clusters in AWS, Azure, and/or Google Cloud.
The third theme is simplifying data-science and machine learning development, especially reducing the time from when a new algorithm is developed to how it can be deployed into production (stay tuned for more on that front).
Amr Awadallah, Ph.D. Chief Technology Officer, Cloudera
Before co-founding Cloudera in 2008, Amr (@awadallah) was an Entrepreneur-in-Residence at Accel Partners. Prior to joining Accel he served as Vice President of Product Intelligence Engineering at Yahoo!, and ran one of the very first organizations to use Hadoop for data analysis and business intelligence. Amr joined Yahoo after they acquired his first startup, VivaSmart, in July of 2000. Amr holds a Bachelor’s and Master’s degrees in Electrical Engineering from Cairo University, Egypt, and a Doctorate in Electrical Engineering from Stanford University.
Follow us on Twitter: @odbmsorg
“Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.”–Michael Henry
I have interviewed Michael Henry, Principal at KPMG LLP. In the interview we covered the challenges faced by financial institutions due to existing regulations standards, KPMG`s solution to automate the onboarding process for their clients, and the potential impact of Digital labor for the financial services sector.
Q1. The Organisation for Economic Co-operation and Development (OECD) proposed a Common Reporting Standard (CRS) for the Automatic Exchange of Information (AEOI) that implies a significant increase in the customer due diligence and reporting obligations of financial institutions across the world. What is the implication for your clients?
Michael Henry: The new reporting requirement will require financial institutions to collect and examine more information about their clients for the purposes of tax withholding and reporting. Banks and other regulated institutions will have to examine information from their clients to make sure they are reporting their true residence for tax purposes. This is similar to the US Internal Revenue Service’s FATCA requirements. And like FATCA, many banks will respond by asking for more documentation from their clients and adding staff to perform due diligence on that documentation.
Q2. Specifically, what is “client on boarding”? How is it normally implemented by large financial institutions?
Michael Henry: Client on boarding refers to the series of processes that a financial institution undergoes to determine whether or not it should move forward with conducting or renewing business with a given customer.
The term is inclusive of the underlying regulatory and compliance practices governed by anti-money laundering (AML) and know-your-customer (KYC) rules.
Many large financial institutions deploy thousands of staff, often in low cost offshore locations to perform this function. These staff are usually equipped with basic workflow and data management technology. At Tier 1 organizations this can cost hundreds of millions of dollars annually while pinning their reputations on the shoulders of junior resources making subjective compliance policy interpretations.
For this basic client identification and validation process, one of our clients employs thousands of people in an offshore location. Because this work is boring and repetitive, the client tells us that the attrition rate is more than 10% per month. This presents an enormous risk to the business, as banks entrust their client experience, business results, and reputations to cheap clerical labor that likely joined the bank only a few months ago.
Q3. What are the typical problems?
Michael Henry: The bank must collect information to identify the client and determine the risk that the client will engage in some kind of unlawful activity. To perform this function, the bank must process a large number of data that enter the bank electronically, or through documents. Reading and interpreting documents and trying to apply complex compliance rules using manual processes is time-consuming, error-prone, and expensive.
Technology – Workflow, case management, relational databases, and imaging technologies while mature and effective, still require human beings to read, transcribe, and interpret data.
Inconsistency – Human operators interpret complex decision-trees of rules. The risk of subjectivity grows with the size of the operation.
Accuracy – The majority of today’s onboarding representatives execute what amount to “stare and compare” and “stare, copy and enter” processes. Over the course of a business day in which hundreds of pages or documents will be read and thousands of keystrokes completed, it is inevitable that operator errors will occur.
Q4. You have worked on a solution as a service to automate the onboarding process for your clients. Can you explain in a nutshell how did you do it?
Michael Henry: The solution is comprised of multiple digital labor components to read documents and apply policy rules by machines instead of people.
Humans focus on exceptions, i.e., cases which really require human judgment. Because the exception rates are low, much of the activity becomes straight-through.
The technology uses a combination of robotics, big data, and natural language processing integrated for the solution of KYC, AML, Tax classification, and other compliance activities.
Q5. How difficult was to integrate domain knowledge into advanced technology?
Michael Henry: Domain knowledge is critical. KPMG invested significant regulatory and compliance expertise to reinvent this process for ourselves and our clients. The technology only works because of this investment.
We use advanced technology, but it is all commercially available. Our ability to define specific ontologies and compliance rules on that technology is the differentiator.
Q6. How do you capture information from SEC filings, blog entries, social media, text messages and other sources of structured and unstructured data without manual intervention?
Michael Henry: We capture information from structured and unstructured sources through a combination of technologies. Optical character recognition (OCR) and natural language processing (NLP) software drive our content enrichment process. This allows our platform to ingest unstructured documents (with or without metadata), identify them, and then extract the relevant content according to our ontological models. Some exception processing occurs at this stage, especially if the quality of the documentation is poor.
Q7. How do you integrate, organize and mine customer data?
Michael Henry: Customer data are ingested to the platform through system extracts, tying in to document repositories and the establishment of secure FTP sites. These data then pass through our content enrichment engine and ultimately reside in our MarkLogic NoSQL database.
Q8. Why did you choose MarkLogic’s Enterprise NoSQL database?
Michael Henry: First, we are solving mission-critical subjects for the world’s leading financial institutions. We needed to have an institutional-grade, enterprise-hardened database at the core of our platform.
Second is given the size of the data sets involved, we needed to have a highly scalable database that could handle petabytes of data while simultaneously staging and orchestrating multiple run-time sequences. Finally, we found MarkLogic very aligned to our vision and a good partner in bringing the solution to market.
Q9. How do you use semantics, text analytics and visualisation?
Michael Henry: Semantic analysis allows us to handle unstructured data in natural language formats. Extracting the list of beneficial owners from a 100-page trust document can take a human hours. The tools are so proficient now, that with the right ontological models we can obtain dozens of data from an unstructured document at high volumes with little human intervention. We have been able to ingest hundreds of individual loan documents and produce a data hierarchy by client, by loan, and by event.
Q10. What results did you obtain so far? What is the order of magnitude reduction in human efforts you obtained? As human involvement in the process declines, is the number of errors in reports also declining?
Michael Henry: Today, we serve more than 20 clients. In the tax compliance area, a human may spend more than an hour ingesting a W8 form and conducting due diligence. Most of this is reading KYC documents. Our platform has the ability to handle more than 10 of these per hour per human exception handler. If the task involves humans reading documents and applying validation or other policies, and the rate of actual exceptions is low, we can take 80-90% of the manual effort out. And the tools keep getting better.
More important than the productivity gain is the consistency and accuracy of the automation. No human operator can apply thousands of policy rules consistently. We continue to tune our models, and the machine never forgets.
Q11. In your opinion, what is the impact of the introduction of “Digital Labor”services for the job service market and for the society at large?
Michael Henry: Digital labor is the name for a new class of tools that can automate routine cognitive tasks. The benefits of automation are similar to previous waves. Many years ago I helped automate a reconciliation function for a large asset manager. Humans took authorization reports from their investment control system and matched them against the confirmations coming from their counterparts. This was a terrible job, and luckily no one does this anymore.
Digital labor has the potential to improve the financial services sector by improving compliance, providing more analytics for risk and control functions, and improving efficiency.
Michael Henry Principal, Financial Services, KPMG LPP
Michael is a Principal in KPMG’s Digital Labor practice with more than 25 years’ experience in financial services. Michael specializes in the application of sophisticated technologies (big data, natural language processing, artificial intelligence, machine learning, workflow and robotics) to automate compliance processes. Michael has worked with global and regional banks, and his experience includes living and working in Europe and Asia.
– ￼FATCA Onboarding & Compliance Solution. KPMG, 2015 (LINK to .PDF)
Follow us on Twitter: @odbmsorg
“While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.” –Ofer Bengal.
I have interviewed Ofer Bengal, Co-Founder and CEO of Redis Labs, and Yiftach Shoolman, Co-Founder and CTO of Redis Labs.
Main topics of the interview are: How is the database market evolving, proprietary vs. open source software, in-memory/ key-value data stores, and the new features of Redis.
Q1. How do you see the database market evolving?
Ofer Bengal, Yiftach Shoolman: The main trends we identify today and believe will continue in upcoming years are:
1) Non-relational databases will continue to see growing adoption, because the schema framework is ineffective when it comes to unstructured data, change in data patterns, growing data volumes, more stringent performance requirements and the way modern apps are built.
2) Multiple database models as opposed to the absolute dominance of RDMS in the past few decades, each model solving the requirements of certain use cases.
Moreover, certain modern databases can run several database models (document, graph, etc.)
3) Multiple databases (different types or the same type) serving the same app. Modern applications are based on micro service architecture, in which each micro service works with the best database for its use case.
This creates new challenges for modern databases: (a) Instant provisioning – sometime hundreds or thousands of databases are provisioned within a second, and (b) Multi-tenancy, otherwise the cost associated with managing database infrastructure becomes extremely high.
4) Database-as-a-service is growing vs. self deployed and operated databases. With enterprises gradually moving to the cloud and having to deal with multiple type databases, it makes a lot of sense to outsource deployment and ongoing operations rather than building in-house practice of DBAs and Devops.
5) Hybrid transactional and analytical processing (HTAP). Driven by the need for application analytics to drive business decision making in real time, certain modern databases can handle those two different workloads simultaneously, eliminating the need for exporting transactional data to a separate dedicated analytical database.
Q2. Proprietary vs. open source software: what are the pros and cons?
Ofer Bengal, Yiftach Shoolman: From the community perspective, open source is great. If there is a vibrant community, it pushes innovation, problem solving and compatibility issues with different environments.
From users perspective, open source is “open”, accessible, can be used by anyone, transparent, and free of charge.
It often comes with less of a danger of vendor lock-in. It is very suitable for independent developers and startups. However enterprises using open source products may have certain challenges:
1. The product is not always suitable for enterprise workloads, especially when it comes to databases. Capabilities like infinite seamless scaling, high-availability with instant failover and stable performance at scale are not always the open source developer’s top priority.
2. Commercial support must be obtained and this typically comes with a price tag which is not much different than acquiring a commercial database product.
3. Commercial support is typically provided by a single company (most probably founded by the open source creators), which creates “vendor lock-in” by itself.
4. In the case of databases, using database-as-a-service may turn out to be lower in cost compared to provisioning cloud instances and running zero cost open source software on them, because commercial can be based on efficient multi-tenant architecture.
Q3. What is the current market for in-memory, key-value data stores?
Ofer Bengal: In-memory key-value data stores (sometimes called in-memory data grids (IMDGs)) have been around since more than a decade and have proven capable of supporting digital business needs for responsive, always-on user experience; real-time, actionable insights; and dynamic scaling. They are widely employed when you want to scale/modernize legacy applications without spending additional money on extremely expensive RDBMS licenses and hardware.This is achieved by providing a scalable and reliable in-memory datastore that enables low-latency transactional and analytical processing.
While modernizing legacy applications used to be a key reason for deploying in-memory, key-value data stores, we see that this is changing. New applications particularly those that are highly interactive need to bring a user experience that is very responsive under all conditions. For such new applications, an in-memory datastore, particularly one that can simplify run time analytics like counting, scoring, managing lists and sets, is becoming a key ingredient for low latency responses and high throughput.
From a Redis perspective, our innovation in data structures brings about the ability to simplify development to the extent that now most Redis users use it as a first responder and primary datastore for substantial pieces of their data. Furthermore with Redis’ data-structures, users can run operational and analytical use cases on the same database.
In addition, acceleration of other in-memory platforms like Spark is possible with Redis.
Gartner estimates that, in 2015, the stand-alone IMDG market was worth approximately $600 million, having grown by about 30% from the previous year. Gartner expects the market to continue to grow in the double-digit range through 2020 and to exceed $1 billion by 2018. Redis, one of the leaders in this space, grew in just a few years to be one of the most popular databases used by developers and enterprises.
Q4. Amazon ElastiCache supports two open-source in-memory engines: Redis and Memcached. What does it mean in practice?
Yiftach Shoolman: In practice, Amazon ElastiCache is a simple caching service that simplifies a developer experience by providing these two open source in-memory engines. Legacy applications that use simple cache can use ElastiCache seamlessly.
However, ElastiCache is single-tenant, limited to caching use cases and cannot be used as a database, lacking enterprise-grade functionalities such as infinite seamless scalability, instant failover and predictable performance.
The Redis Labs equivalent service, called Redis Cloud provides all the benefits of an enterprise-class Redis.
Q5. What are the pros and cons of Memcached and Redis?
Yiftach Shoolman: Redis can be thought of as modern database while memcached is older technology designed specifically for ephemeral caching.
The most important difference is in persistence and HA – memcached is not persistent nor HA, while Redis can operate as a full-fledged in-memory database, highly available through both in-memory replication and data persistence. This reflects the fact that caches in older architectures were not required to be highly available, but in modern architectures, built for scale and volume, cache outages can significantly impact the business and user experience.
Redis, the newer and more versatile technology allows individual data elements to be manipulated while memcached often incurs serialization/deserialization overheads that makes the entire application processing much slower. This is because Memcached can handle only simple key value use cases, whereas Redis offers many more data structures (hashes, sets, sorted sets, lists, hyperloglog..) that simplify complex data processing, analysis and operational use cases with ease.
Even when used as a cache, Redis has more sophisticated eviction policies which can be both active or passive while memcached has only a simple LRU and lazy eviction.
Redis and Memcached are both very popular open source projects, but given its richer functionality, more advanced design, many potential uses, and greater cost efficiency at scale, Redis should be your first choice in nearly every case.
Q6. For very large data sets or analytics workloads, running everything in-memory might not be cost effective. What is your take on this?
Ofer Bengal, Yiftach Shoolman: For very large data sets or analytics workloads, it is advantageous to utilize alternative memory technologies(such as Flash memory, which is a tenth of the cost), as extensions of memory rather than impose a disk access penalty. We have extended enterprise Redis in this manner to take advantage of Flash memory, while using a tiered approach (keys and hot values are still in the fastest memory, while cold values are in “slower” Flash memory) to ensure that you still see sub-millisecond latencies with millions of ops/sec throughput.
Q7. Redis was created by Salvatore Sanfilippo in 2009. What is his role today?
Ofer Bengal: Salvatore is leading the development of open source Redis within Redis Labs. He works with a group of experienced developers on extending the capabilities of Redis. A good example of this collaborative works is the recent introduction of Redis Modules, which extend Redis to a variety of new modern use cases. Salvatore wrote the API and the other team members in a very short time created and tested a few modules, such as Redisearch (a full-text search engine) and Redis-ML (enhancing the performance of Spark machine learning capabilities). Salvatore’s role is to continue the community innovation around the Redis core, together with his team of Redis Labs developers.
Q8. What are the differences of Redis Labs` version of Redis with the original one developed in 2009?
Yiftach Shoolman: Redis Labs fully supports the open source Redis versions, but enhances them with a container-like layer that adds a proxy, cluster management and a shared nothing architecture. Taken together, Redis Labs provides a solid enterprise foundation to Redis, allowing it to scale seamlessly in memory across many hundreds of servers with the high availability through persistence, in-memory cross-rack/zone/region/datacenter replication and instant automatic failover. No retooling or re-architecting is required to move from open source Redis to enterprise Redis, the process is basically effortless and immediate. Redis Labs also offers various database modules, like a RediSearch, multiple probabilistic modules like Bloom Filter, TopK, CMS, Redis-ML for Machine Learning, Redis-TS for Time Series processing, JSON and Graph support.
Q9. What are the possible scenarios of using Redis for data analytics?
Ofer Bengal, Yiftach Shoolman: Redis data structures come with built-in simple analytic operations like counting, ranking, scoring, ranges and more. Over time, probabilistic data structures have added the ability to analytically estimate millions and trillions of events, without requiring memory to store all of the events.
Set operations have made it possible to simplify comparisons, intersections, unions of sets – analytics that are usually complicated with data stores. RQL (Redis SQL) and secondary indexing, allows executing complex SQL queries on an existing Redis database. And finally recent modules like RediSearch, Neural Redis and Redis-ML have added advanced search and machine learning capabilities – not naturally occurring in any other databases.
With all of these possibilities, and with the move to automated decision making, we see increasing usage of Redis for data analytics scenarios.
Q10. How safe is a Redis server?
Yiftach Shoolman: The Redis enterprise server comes with client-based SSL authentication, built-in cloud firewall support (when running on public clouds), password authentication and role-based authorization that enables customizing security levels.
Qx. Anything else you wish to add?
Ofer Bengal: Redis is a game -changer when it comes to databases, and its progression over the last seven years has demonstrated that the industry and market are demanding performance and increasing flexibility to deal with all types of data processing, storage and analytic scenarios. Redis’ core values have always included high performance, high throughput and very low latencies. With the visionary addition of modules. The community has turned it into an all purpose datastore – suitable for any scenario that needs a database.
Ofer Bengal – Co-Founder and CEO of Redis Labs
Ofer is a serial entrepreneur who has founded and led several companies in the areas of data communications, telecommunications, Internet, homeland security and medical devices. Ofer was founder & CEO of RIT Technologies (NASDAQ: RITT), a provider of sophisticated telecommunications and data communications systems to major world carriers. He began his career as an aerospace engineer in the Israeli Air Force and then built his own aerospace engineering consulting firm. As a hobby, he has also invented, developed and licensed toy concepts to companies such as Milton Bradley, Hasbro and Tomy. Ofer holds a Bachelor of Science (cum laude) in aerospace engineering from the Technion, Israel Institute of Technology.
Yiftach Shoolman – Co-Founder and CTO of Redis Labs
Yiftach is an experienced technologist, having held leadership engineering and product roles in diverse fields from application acceleration, cloud computing and software-as-a-service (SaaS), to broadband networks and metro networks. He was the founder, president and CTO of Crescendo Networks (acquired by F5, NASDAQ:FFIV), the vice president of software development at Native Networks (acquired by Alcatel, NASDAQ: ALU) and part of the founding team at ECI Telecom broadband division, where he served as vice president of software engineering. Yiftach holds a Bachelor of Science in Mathematics and Computer Science and has completed studies for Master of Science in Computer Science at Tel-Aviv University.
Follow us on Twitter: @odbmsorg
“New regulations such as MIFID II indeed aim at increasing transparency, which in turn requires more precise reporting. These reports require a lot of data to be stored and data capture to be ultra accurate.”– Michael Hay and Oskar Mencer.
Hitachi Data Systems and Maxeler Technologies announced a cooperation around High-performance Compliance Capture and Analytics Solution for Financial Institutions. I have interviewed Michael Hay, VP & CHIEF ENGINEER – HITACHI DATA SYSTEMS, and Oskar Mencer, CEO, CTO, Maxeler Technologies Inc.
Q1. What is Multi-scale Dataflow Computing?
O. Mencer: Generally, Multiscale Dataflow Computing is a computing paradigm aimed at optimizing operational efficiency of computing by computing data as it is moving through a system. We use Dataflow to minimize the sum of all distances that the data has to travel. We overlay Dataflow with a Multiscale approach of vertically optimizing the algorithm, the architecture and arithmetic.
Q2. There is an emerging EU Financial Services directive called MIFID II. This EU directive, and its associated regulation, was designed to help the regulators better handle High Frequency Trading (HFT) and so called Dark Pools, in other words, to increase transparency in the markets. What are the technological demands posed by these new financial legislation and compliance regulations?
M. Hay, O. Mencer: New regulations such as MIFID II indeed aim at increasing transparency, which in turn requires more precise reporting. These reports require a lot of data to be stored and data capture to be ultra accurate. It is an ideal environment for Hitachi data solutions to be combined with Maxeler’s low latency capability.
Q3. To address these challenges, Maxeler Technologies Inc. announced a collaboration with Hitachi Data Systems to offer a high-performance compliance capture and analytics solution. Can you please explain what this solution is about?
M. Hay, O. Mencer: We are combining programmable low latency compute with high capacity “Dataflow-like storage” and modern analytics software. This allows us to attack even the toughest customer challenges and provide competitive advantage within modest development time.
Q4. How can this solution help financial institutions achieve high-frequency, transaction-related record keeping mandated in European Union MiFID II and US Dodd-Frank regulations?
M. Hay: Hitachi’s Data Lake solutions can help to unify the wide range of regulatory data challenges faced by today’s financial institutions. With high end filtering and analytics capability added to the system, we can address regulation but also integration and security issues all within a single system.
Q5. In this cooperation, you have accomplished an operational prototype through the use of Maxeler’s DFE (Data Flow Engine) network cards, Dataflow based capture/decode capability executing on Dataflow hardware, a hardware accelerated NFS client, Hitachi’s CB500, Pentaho, and Hitachi Unified Storage (HUS). Can you explain how this architecture works?
M. Hay, O. Mencer: Our architecture accomplishes tight integration between realtime on-the-wire compute and storage. The realtime computing ability and reliability of the storage ensure that no data is lost and reports can be generated on time and on budget.
Q6. With your Multiscale Dataflow technology data is streamed from memory onto a chip where the data moves directly from one functional unit to another, without being written to off-chip memory until the entire process is complete. What is the advantage of this solution with respect to a classical ETL process?
O. Mencer: In a classical ETL process the database is in the critical loop. With the Multiscale Dataflow approach we remove the database from the critical loop and utilize an in-memory copy of the data for ultrafast access and in-memory analytics.
Q7. The overall system from packet capture to NFS write does not use a single server side CPU cycle. What does it mean in practice?
O. Mencer: We use a special substrate to create a dataflow computer by connecting vast numbers of arithmetic units, and implement networking state machines right down on the hardware level. This means that the packet flow through the system is in a tight hardware loop and only metadata travels through conventional CPUs. Additionally, on the storage side Hitachi’s Unified Storage also uses Dataflow-like structures to implement a full set of Network File Serving, a Filesystem and smart object caching for file system object I/O. In this way usage of general CPU cycles if further minimized.
The impact to customers is decreased space needed for the solution coupled to significant performance improvements.
Q8. You claim that dataflow computing can accelerate and run different applications orders of magnitude faster than conventional CPUs. Do you have any benchmarking results to share?
O. Mencer: Benchmarks are not applications and there is no claim that we can accelerate tiny benchmarks.
Our technology enables complete applications with a purpose in the real world to run orders of magnitude faster. For example, in 2011 a Tier 1 investment bank won the American Finance Technology Award for their installation of a machine from Maxeler, which reduced the time to calculate risk from 8 hours down to 2 minutes.
Q9. The Maxeler-Hitachi Data Systems solution leverages the new Amazon AWS F1 instance. Why? Can you please elaborate on this?
M. Hay, O. Mencer: Our joint hardware solution complements the F1 instance for on-premise activities in a hybrid cloud setting. It helps that the latest Maxeler generation (MAX5) is fully compatible with F1 and it is therefore easy to build a hybrid cloud solution with a single code base. If the reader would like to learn more we’re open and able to entertain discussions about finding relevant problems to engage on.
MICHAEL HAY | マイケル ヘイ
VP & CHIEF ENGINEER – HITACHI DATA SYSTEMS. GENERAL MGR, DIGITAL SOLUTIONS BUSINESS DEVELOPMENT – HITACHI, SPBD
As Vice President and Chief Engineer at Hitachi Data Systems and a General Manager of the Service Business Platform Division in Japan, Michael leads a global team that contemplates and enacts the future of Hitachi’s expanding ICT and Social Innovation portfolios. Michael engages a variety R&D teams, using a clear understanding of market requirements, to guide direction and inspire innovation. Michael joined HDS in 2001 after serving as CEO and owner of a consultancy company focused on complex Enterprise and Systems management design and deployments. His professional background spans over 20 years and includes stints at IBM, IBM partners, and other IT start-up companies. These roles have helped Michael develop a capacity to define solutions for tomorrow’s problems. Michael holds a Masters in Industrial Engineering with a focus in Human Factors from San Jose State and a Bachelors degree in Electrical Engineering from the University of New Mexico, in Albuquerque, NM.
Oskar Mencer. Prior to founding Maxeler, Oskar was Member of Technical Staff at the Computing Sciences Center at Bell Labs in Murray Hill, leading the effort in “Stream Computing”. He joined Bell Labs after receiving a PhD from Stanford University. Besides driving Maximum Performance Computing (MPC) at Maxeler, Oskar was Consulting Professor in Geophysics at Stanford University and he is also affiliated with the Computing Department at Imperial College London, having received two Best Paper Awards, an Imperial College Research Excellence Award in 2007 and a Special Award from Com.sult in 2012 for “revolutionising the world of computers”.
– Video: What is OpenSPL? Professor Michael J Flynn, Stanford University
OpenSPL is an open standard for a novel Spatial Programming Language. It is based on the core concept that a program executes in space, rather than in time sequence. All operations are assumed to be parallel unless specified to be sequential. This is similar to a factory floor where all operations execute in parallel, but each operation executes a different part of the overall process. Temporal Programming is a recipe for the execution of actions, whereas Spatial Programming builds a factory to execute the recipe.
Follow us on Twitter: @odbmsorg
“I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. “–Rob Winters
I have interviewed Rob Winters,Head of Business Intelligence at TravelBird. The interview covers Rob`s projects experience with data analytics and HPE Vertica.
Q1. What is the business of TravelBird?
Rob Winters: TravelBird builds and provides a daily selection of inspirational holiday offerings in twelve markets across Europe. Our goal is to create packages which excite the imagination and bring simplicity and joy to the act of travelling. These packages are then shared with our travellers via email, our website, and our iOS and Android applications.
Q2. What are the current data projects at TravelBird?
Rob Winters: TravelBird’s journey with being data driven is relatively short, beginning our initial Business Intelligence buildout in mid-2015. Currently our BI team is engaged in a number of projects, both more traditional BI and advanced analytics, including:
– Building data sources and training an organization in self-service BI
– Replacing our generic daily selections with personalized content selection models
– Optimizing pricing of packages based on product price volatility and customer demand
– Adjusting email frequency and timing to improve customer engagement and lifetime value
Q3. What is your experience in using predictive analytics?
Rob Winters: I have been working in the predictive analytics field for six years now across a variety of problem areas – customer service, retail, gaming, and now travel. From a technology standpoint I originally worked heavily with commercial solutions (Teradata, SAS) but for the last four years have used almost exclusively open source software including Hadoop, Spark, R, and Python.
Q4. How do you evaluate if your discovering insights are “good”?
Rob Winters: During the initial development of our algorithms we will typically follow a basic version of CRISP-DM to build an initial working model for our problem. To test models, we always use an A/B test and typically follow a two phase process: first the model is split-test against the current operational process/human selection, then when the model consistently outperforms the status quo, we will test future model iterations against the control.
Q5. Can you tell us a bit about the work you did in designing and implementing a fully automated, machine learning based content selection platform?
Rob Winters: To provide context, every day our planning team creates six unique product offerings for their target market of 50-500k customers to be shared via web, iOS/Android app, and email. Our goal was to replace that model with one that selects six unique products for each recipient based on past browsing and travel behavior. To do so, we designed an ensemble model consisting of several components:
– A customer preference model (user-item recommendation model)
– A product similarity model (item-item similarity)
– A “hotness” model to promote destinations which are trending/outperforming/expected to do well
– A portfolio model to select the right diversity for each recipient based on recommendation confidence, lifecycle state, and yield optimization of cannibalization vs product fit for a recipient
The data to feed these models is based on observing dozens of events per recipient per day, positive and negative feedback events of the recipient, all observable product features, and human expert input. The models are also able to improve themselves by continuously tuning the input parameters of each model based on recommendation split testing.
Q6. What are the primary technologies you are using?
Rob Winters: Our technology stack consists of the following:
-Data warehousing: HPE Vertica
-Operations DBs: MySQL (web services) + Postgres (internal services)
-Recommendations serving: Redis
-Modeling/Analysis: Python, Spark via PySpark
Q7. What is your experience in using HPE Vertica?
Rob Winters: I have been using Vertica for five years in a number of organizations and facilitated the first rollout in the Netherlands. During that time I have been primarily an end user/data analyst but have also been the DBA for my deployments for the last two years.
Q8: Can you give us some more technical details of what was this first rollout in the Netherlands? What challenges did you solve in using HPE Vertica? What business benefits did you obtain?
Rob Winters: The objective of our rollout was to implement a centralized company datawarehouse to unify several production databases plus external API data.
The existing platform was Postgres (row-based solution) and relatively limited in performance. Primary gains were significantly faster analytics, the ability to add in several terabytes of event data (which was not possible on the prior platform), and new insights into the email database regarding churn, conversion, and customer value.
Q9: What were the main criteria for you to choose HPE Vertica? Did you do any performance test for HPE Vertica?
Rob Winters: We considered a number of alternatives including Microsoft PDW, Greenplum, and Infobright.
The primary considerations were price/performance, scalability, and analytical functionality. We found Vertica to be the best options across those aspects. Regarding performance testing, we did compare Infobright and Vertica and found the latter to be both more performant and easier to work with.
Q10. What specific functionalities of HPE Vertica do you find particularly useful in your job?
Rob Winters: There are a number of aspects which I find extremely beneficial, including:
-Ease of administration
-Performance tunability is very good, much higher than (for example) Redshift
-Analytical function extensions enable extremely powerful analyses directly via SQL
-The ability to load JSON data allows very rapid data integration from new sources
Q11. Do you think is it possible to turn an employee into a data analyst?
Rob Winters: Absolutely, I’ve managed several employees who have successfully transitioned from an operations role to an analytics role. In fact, some of them have become my best analysts because they have brought a deeper domain knowledge to their analyses than someone approaching from the outside may have done. The biggest drivers for success in the transitition have been:
– Attitude/eagerness to learn
– Close collaboration with a more experienced analyst, either their supervisor or a more senior peer
– Making their initial projects in areas where they are unable to fall back on domain knowledge
Rob Winters, Head of Business Intelligence at TravelBird.
Rob has been working with and leading analytics teams since 2006 across a number of industries including telco, gaming, retail, and travel. His primary focus since 2011 has been green-field implementations of technology and team creation for both traditional business intelligence and predictive analytics; full details are listed on my linkedin profile. He holds a bachelor’s in economics and an MBA with a IT concentration.
– Data-X: Video lectures on very practical and applied Data Analytics. Data-X is a project to produce a collection of video lectures on very practical and applied data analytics.
Follow us on Twitter: @odbmsorg
“I think we’re just beginning to grapple with implications of data as an economic asset” –Steve Lohr.
My last interview for this year is with Steve Lohr. Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting. We discussed Big Data and how it influences the new Artificial Intelligence awakening.
Wishing you all the best for the Holiday Season and a healthy and prosperous New Year!
Steve Lohr: Both Google and Microsoft are contributing their tools to expand and enlarge the AI community, which is good for the world and good for their businesses. But I also think the move is a recognition that algorithms are not where their long-term advantage lies. Data is.
Q2. What are the implications of that for both business and policy?
Steve Lohr: The companies with big data pools can have great economic power. Today, that shortlist would include Google, Microsoft, Facebook, Amazon, Apple and Baidu.
I think we’re just beginning to grapple with implications of data as an economic asset. For example, you’re seeing that now with Microsoft’s plan to buy LinkedIn, with its personal profiles and professional connections for more than 400 million people. In the evolving data economy, is that an antitrust issue of concern?
Q3. In this competing world of AI, what is more important, vast data pools, sophisticated algorithms or deep pockets?
Steve Lohr: The best answer to that question, I think, came from a recent conversation with Andrew Ng, a Stanford professor who worked at GoogleX, is co-founder of Coursera and is now chief scientist at Baidu. I asked him why Baidu, and he replied there were only a few places to go to be a leader in A.I. Superior software algorithms, he explained, may give you an advantage for months, but probably no more. Instead, Ng said, you look for companies with two things — lots of capital and lots of data. “No one can replicate your data,” he said. “It’s the defensible barrier, not algorithms.”
Q4. What is the interplay and implications of big data and artificial intelligence?
Steve Lohr: The data revolution has made the recent AI advances possible. We’ve seen big improvements in the last few years, for example, in AI tasks like speech recognition and image recognition, using neural network and deep learning techniques. Those technologies have been around for decades, but they are getting a huge boost from the abundance of training data because of all the web image and voice data that can be tapped now.
Q5. Is data science really only a here-and-now version of AI?
Steve Lohr: No, certainly not only. But I do find that phrase a useful way to explain to most of my readers — intelligent people, but not computer scientists — the interplay between data science and AI. To convey that rudiments of data-driven AI are already all around us. It’s not — surely not yet — robot armies and self-driving cars as fixtures of everyday life. But it is internet search, product recommendations, targeted advertising and elements of personalized medicine, to cite a few examples.
Q6. Technology is moving beyond increasing the odds of making a sale, to being used in higher-stakes decisions like medical diagnosis, loan approvals, hiring and crime prevention. What are the societal implications of this?
Steve Lohr: The new, higher-stakes decisions that data science and AI tools are increasingly being used to make — or assist in making — are fundamentally different than marketing and advertising. In marketing and advertising, a decision that is better on average is plenty good enough. You’ve increased sales and made more money. You don’t really have to know why.
But the other decisions you mentioned are practically and ethically very different. These are crucial decisions about individual people’s lives. Better on average isn’t good enough. For these kinds of decisions, issues of accuracy, fairness and discrimination come into play.
That, I think, argues for two things. First, some sort of auditing tool; the technology has to be able to explain itself, to explain how a data-driven algorithm came to the decision or recommendation that it did.
Second, I think it argues for having a “human in the loop” for most of these kinds of decisions for the foreseeable future.
Q7. Will data analytics move into the mainstream of the economy (far beyond the well known, born-on-the-internet success stories like Google, Facebook and Amazon)?
Steve Lohr: Yes, and I think we’re seeing that now in nearly every field — health care, agriculture, transportation, energy and others. That said, it is still very early. It is a phenomenon that will play out for years, and decades.
Recently, I talked to Jeffrey Immelt, the chief executive of General Electric, America’s largest industrial company. GE is investing heavily to put data-generating sensors on its jet engines, power turbines, medical equipment and other machines — and to hire software engineers and data scientists.
Immelt said if you go back more than a century to the origins of the company, dating back to Thomas Edison‘s days, GE’s technical foundation has been materials science and physics. Data analytics, he said, will be the third fundamental technology for GE in the future.
I think that’s a pretty telling sign of where things are headed.
Steve Lohr has covered technology, business, and economics for the New York Times for more than twenty years and writes for the Times’ Bits blog. In 2013 he was part of the team awarded the Pulitzer Prize for Explanatory Reporting.
He was a foreign correspondent for a decade and served as an editor, and has written for national publications such as the New York Times Magazine, the Atlantic, and the Washington Monthly. He is the author of Go To: The Story of the Math Majors, Bridge Players, Engineers, Chess Wizards, Maverick Scientists, Iconoclasts—the Programmers Who Created the Software Revolution and Data-ism The Revolution Transforming Decision Making, Consumer Behavior, and Almost Everything Else.
He lives in New York City.
Follow us on Twitter:@odbmsorg