Skip to content

"Trends and Information on Big Data, Data Science, New Data Management Technologies, and Innovation."

This is the Industry Watch blog. To see the complete ODBMS.org
website with useful articles, downloads and industry information, please click here.

Jun 27 16

On Using HP Vertica. Interview with Eva Donaldson.

by Roberto V. Zicari

“After you have built out your data lake, use it. Ask it questions. You will begin to see patterns where you want to dig deeper. The Hadoop ecosystem doesn’t allow for that digging and not at a speed that is customer facing. For that, you need some sort of analytical database.”– Eva Donaldson.

I have interviewed Eva Donaldson, software engineer and data architect at iContact. Main topic of the interview is her experience in using HP Vertica.

RVZ

Q1. What is the business of iContact?

Eva Donaldson: iContact is a provider of cloud based email marketing, marketing automation and social media marketing products. We offer expert advice, design services, and an award-winning Salesforce email integration and Google Analytics tracking features specializing in small and medium sized businesses and nonprofits in U.S. and internationally.

Q2. What kind of information are your customers asking for?

Eva Donaldson: Marketing analytics including but not limited to how customers reached them, interaction with individual messages, targeting of marketing based on customer identifiers.

Q3. What are the main technical challenges you typically face when performing email marketing for small and medium businesses?

Eva Donaldson: Largely our technical challenges are based on sheer size and scope of data processing. We need to process multiple data points on each customer interaction, on each customer individually and on landing page interaction.

Q4. You attempted to build a product on Infobright. Why did you choose Infobright? What was your experience?

Eva Donaldson: We started with Infobright because we were using it for log processing and review. It worked okay for that since all the logs are always referenced by date which would come in order. For anything but the most basic querying by date Infobright failed. Tables could not be joined. Selection by any column not in order was impossible in the size data we were processing. For really large datasets some rows would just not be inserted without warning or explanation.

Q5. After that, you deployed a solution using HPE Vertica. Why did you choose HPE Vertica? Why didn`t you instead consider another open source solution?

Eva Donaldson: Once we determined that Infobright was not the correct solution, we knew already that we needed an analytical style database. I asked anyone and everyone who was working with true analytics at scale what database backend they were using and if they were happy. Three products come to the forefront: Vertica, Teradata and Oracle. The people using Oracle who were happy were complete Oracle shops. Since we do not use Oracle for anything this was not the solution for us. We decided to review Vertica, Teradata and Netezza. Of the three Vertica for our needs came out the clear winner.

Vertica installs on commodity hardware which meant we could deploy it immediately on servers we had on hand already. Scaling out is horizontal since Vertica clusters natively which meant it fit exactly in with the way we already handled our scaling practices.

After the POC with Vertica’s free version and seeing the speed and accuracy of queries, there was no doubt we had picked the right one for our needs. Continued use and expansion of the cluster has continued to prove that Vertica stands up to everything we throw at it. We have been able to easily put in a new node, migrate nodes to beefy boxes when we needed to. Performance on queries has been unequaled. We are able to return complex analytical queries in milliseconds.

As to other open source tools, we did consider them. I looked at Greenplum and I don’t remember what all other columnar data stores. There are loads of them out there. But they are all limited in one way or another and most of them are very similar in ability to Infobright. They just don’t scale to what we needed.

The other place people always think is Hadoop. Hadoop and all the related ecosystem is a great place to put stuff while you are wondering what questions you can ask. It is nice to have Hadoop (Hive, Hbase, etc.) to have a place to stick EVERYTHING without question. Then from there you can begin to do some very broad analysis to see what you have. But nothing coming out of a basic file system is going to get you the nitty-gritty analysis to answer the real questions in a timely manner. After you have built out your data lake, use it. Ask it questions. You will begin to see patterns where you want to dig deeper. The Hadoop ecosystem doesn’t allow for that digging and not at a speed that is customer facing. For that, you need some sort of analytical database.

Q6. Can you give us some technical details on how you use HPE Vertica? What are the specific features of HPE Vertica you use and for what?

Eva Donaldson: We have Vertica installed on Ubuntu 12.04 in a three node cluster. We load data via the bulk upload methods available from the JDBC driver. Querying includes many of the advanced analytical functions available in the language as well as standard SQL statements.

We use the Management Console to get insight into query performance, system health, etc. Management Console also provides a tool to suggest and build projections based on queries that have been run in the past. We run the database designer on a fairly regular basis to keep things tuned to how it is actively being used.

We do most of our loading via Pentaho DI and quite a lot of querying from that as well. We also have connectors from Pentaho reports. We have some PHP applications that reach that data as well.

Q7. To query the database, did you have as requirement to use a standard SQL interface? Or it does not really matter which query language you use?

Eva Donaldson: Yes, we required a standard SQL interface and availability of a JDBC driver to integrate the database with our other tools and applications.

Q8. Did you perform any benchmark to measure the query performance you obtain with HPE Vertica? If yes, can you tell us how did you perform such benchmark (e.g. what workloads did you use, what kind of queries did you consider, etc,)

Eva Donaldson: To perform benchmarks we loaded our biggest fact table and its related dimensions. We took our most expensive queries and a handful of “like to have” queries that did not work at all in Infobright and pushed them through Vertica. I no longer have the results of those tests but obviously we were pleased as we chose the product.

Q9. What about updates? Do you have any measures for updates as well?

Eva Donaldson: We do updates regularly with both UPDATE and MERGE statements. MERGE is a very powerful utility. I do not have specific times but again Vertica performs splendidly. Updates on millions of rows performs accurately and within seconds.

Q10. What is your experience of using various Business Intelligence, Visualization and ETL tools in their environment with HPE Vertica?

Eva Donaldson: The only BI tools we use are all part of the Pentaho suite. We use Report Designer, Analyzer and Data Integration. Since Pentaho comes with Vertica connectors it was very easy to begin working with it as the backend of our jobs and reports.

Qx Anything else you wish to add?

Eva Donaldson: If you are looking for an easy to build and maintain, performant analytical database nothing beats Vertica, hands down. If you are working with enough data that you are wondering how to process it all having an analytical database to be able to actually process the data, aggregate it, ask complicated questions from is priceless. We have gained enormous insight into our information because we can ask it questions in so many different ways and because we can get the data back in a performant manner.

———————

Eva Donaldson is a software engineer and data architect with 15+ years of experience building robust applications to both gather data and return it to assist in solving business challenges in marketing and medical environments. Her experience includes both OLAP and OLTP style databases using SQL Server, Oracle, MySQL, Infobright and HP Vertica. In addition, she has architected and developed the data consumption middle and front end tiers in PHP, C#, VB.Net and Java.

Resources

– What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016.

– Column store database formats like ORC and Parquet reach new levels of performance. Steve Sarsfield, HP Vertica. ODBMS.org, JUNE 15, 2016

Taking on some of big data’s biggest challenges.   , HP Vertica. ODBMS.org, June 2016

Related Posts

– On the Internet of Things. Interview with Colin Mahony. ODBMS Industry Watch, March 14, 2016.

– On Big Data Analytics. Interview with Shilpa LawandeODBMS Industry Watch, December 10, 2015.

Follow us on Twitter: @odbmsorg

##

Jun 7 16

On Data Interoperability. Interview with Julie Lockner.

by Roberto V. Zicari

“From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care?”– Julie Lockner.

I have interviewed Julie Lockner.  Julie leads data platform product marketing for InterSystems. Main topics of the interview are Data Interoperability and InterSystems` data platform strategy.

RVZ

Q1. Everybody is talking about Big Data — is the term obsolete?

Julie Lockner: Well, there is no doubt that the sheer volume of data is exploding, especially with the proliferation of smart devices and the Internet of Things (IoT). An overlooked aspect of IoT is the enormous volume of data generated by a variety devices, and how to connect, integrate and manage it all.

The real challenge, though, is not just processing all that data, but extracting useful insights from the variety of device types. Put another way, not all data is created using a common standard. You want to know how to interpret data from each device, know which data from what type of device is important, and which trends are noteworthy. Better information can create better results when it can be aggregated and analyzed consistently, and that’s what we really care about. Better, higher quality outcomes, not bigger data.

Q2. If not Big Data, where do we go from here?

Julie Lockner: We always want to be focusing on helping our customers build smarter applications to solve real business challenges, such as helping them to better compete on service, roll out high-quality products quicker, simplify processes – not build solutions in search of a problem. A canonical example is in retail. Our customers want to leverage insight from every transaction they process to create a better buying experience online or at the point of sale. This means being able to aggregate information about a customer, analyze what the customer is doing while on the website, and make an offer at transaction time that would delight them. That’s the goal – a better experience – because that is what online consumers expect.

From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care? That implies we are analyzing not just more data, but better data that comes in all shapes and sizes, and that changes more frequently. It really points to the need for data interoperability.

Q3. What are the challenges software developers are telling you they have in today’s data-intensive world?

Julie Lockner: That they have too many database technologies to choose from and prefer to have a simple data platform architecture that can support multiple data models and multiple workloads within a single development environment.
We understand that our customers need to build applications that can handle a vast increase in data volume, but also a vast array of data types – static, non-static, local, remote, structured and non-structured. It must be a platform that coalesces all these things, brings services to data, offers a range of data models, and deals with data at any volume to create a more stable, long-term foundation. They want all of these capabilities in one platform – not a platform for each data type.

For software developers today, it’s not enough to pick elements that solve some aspect of a problem and build enterprise solutions around them; not all components scale equally. You need a common platform without sacrificing scalability, security, resilience, rapid response. Meeting all these demands with the right data platform will create a successful application.
And the development experience is significantly improved and productivity drastically increased when they can use a single platform that meets all these needs. This is why they work with InterSystems.

Q4. Traditionally, analytics is used with structured data, “slicing and dicing” numbers. But the traditional approach also involves creating and maintaining a data warehouse, which can only provide a historical view of data. Does this work also in the new world of Internet of Things?

Julie Lockner: I don’t think so. It is generally possible to take amorphous data and build it into a structured data model, but to respond effectively to rapidly changing events, you need to be able to take data in the form in which it comes to you.

If your data platform lacks certain fields, if you lack schema definition, you need to be able to capitalize on all these forms without generating a static model or a refinement process. With a data warehouse approach, it can take days or weeks to create fully cleansed, normalized data.
That’s just not fast enough in today’s always-on world – especially as machine-generated data is not conforming to a common format any time soon. It comes back to the need for a data platform that supports interoperability.

Q5. How hard is it to make decisions based on real-time analysis of structured and unstructured data?

Julie Lockner: It doesn’t have to be hard. You need to generate rules that feed rules engines that, in turn, drive decisions, and then constantly update those rules. That is a radical enhancement of the concept of analytics in the service of improving outcomes, as more real-time feedback loops become available.

The collection of changes we describe as Big Data will profoundly transform enterprise applications of the future. Today we can see the potential to drive business in new ways and take advantage of a convergence of trends, but it is not happening yet. Where progress has been made is the intelligence of devices and first-level data aggregation, but not in the area of services that are needed. We’re not there yet.

Q6. What’s next on the horizon for InterSystems in meeting the data platform requirements of this new world?

Julie Lockner: We continually work on our data platform, developing the most innovative ways we can think of to integrate with new technologies and new modes of thinking. Interoperability is a hugely important component. It may seem a simple task to get to the single most pertinent fact, but the means to get there may be quite complex. You need to be able to make the right data available – easily – to construct the right questions.

Data is in all forms and at varying levels of completeness, cleanliness, and accuracy. For data to be consumed as we describe, you need measures of how well you can use it. You need to curate data so it gets cleansed and you can cull what is important. You need flexibility in how you view data, too. Gathering data without imposing an orthodoxy or structure allows you to gain access to more data. Not all data will conform to a schema a priori.

Q7. Recently you conducted a benchmark test of an application based on InterSystems Caché®. Could you please summarize the main results you have obtained?

Julie Lockner: One of our largest customers is Epic Systems, one of the world’s top healthcare software companies.
Epic relies on Caché as the data platform for electronic medical record solutions serving more than half the U.S. patient population and millions of patients worldwide.

Epic tested the scalability and performance improvements of Caché version 2015.1. Almost doubling the scalability of prior versions, Caché delivers what Epic President Cark Dvorak has described as “a key strategic advantage for our user organizations that are pursuing large-scale medical informatics programs as well as aggressive growth strategies in preparation for the volume-to-value transformation in healthcare.”

Qx Anything else you wish to add?

Julie Lockner: The reason why InterSystems has succeeded in the market for so many years is a commitment to the success of those who depend on our technology. A recent Gartner Magic Quadrant report found we had the highest number of customers surveyed – 85% – who would buy from us again. That is the highest number of any vendor participating in that study.

The foundation of the company’s culture is all about helping our customers succeed. When our customers come to us with a challenge, we all pitch in to solve it. Many times our solutions may address an unusual problem that could benefit others – which then becomes the source of many of our innovations. It is one of the ways we are using problem-solving skills as a winning strategy to benefit others. When our customers are successful at using our engine to solve the world’s most important challenges, we all win.

——————-

Julie Lockner leads data platform product marketing for InterSystems. She has more than 20 years of experience in IT product marketing management and technology strategy, including roles at analyst firm ESG as well as Informatica and EMC.

—————–

Resources

“InterSystems Unveils Major New Release of Caché,” Feb. 25, 2015.

“Gartner Magic Quadrant for Operational DBMS, Donald Feinberg, Merv Adrian, Nick Heudecker, Adam M. Ronthal, and Terilyn Palanca, October 12, 2015, ID: G00271405.

– White Paper: Big Data Healthcare: Data Scalability with InterSystems Caché® and Intel® Processors (LINK to .PDF)

Related Posts

– A Grand Tour of Big Data. Interview with Alan Morrison. ODBMs Industry Watch, February 25, 2016

–  RIP Big Data. By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, JANUARY 6, 2016.

What is data blending. By Oleg Roderick, David Sanchez, Geisinger Data Science. ODBMS.org, November 2015

Follow us on Twitter: @odbmsorg

##

May 24 16

On Data Analytics and the Enterprise. Interview with Narendra Mulani.

by Roberto V. Zicari

“A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.”–Narendra Mulani

I have interviewed Narendra MulaniChief Analytics Officer, Accenture Analytics. Main topics of our interview are: Data Analytics, Big Data, the Internet of Things, and their repercussion for the enterprise.

RVZ

Q1. What is your role at Accenture?

Narendra Mulani: I’m the Chief Analytics Officer at Accenture Analytics and I am responsible for building and inspiring a culture of analytics and driving Accenture’s strategic agenda for growth across the business. I lead a team of analytics professionals around the globe that are dedicated to helping clients transform into insight-driven enterprises and focused on creating value through innovative solutions that combine industry and functional knowledge with analytics and technology.

With the constantly increasing amount of data and new technologies becoming available, it truly is an exciting time for Accenture and our clients alike. I’m thrilled to be collaborating with my team and clients and taking part, first-hand, in the power of analytics and the positive disruption it is creating for businesses around globe.

Q2. What are the main drivers you see in the market for Big Data Analytics?

Narendra Mulani: Companies across industries are fighting to secure or keep their lead in the marketplace.
To excel in this competitive environment, they are looking to exploit one of their growing assets: Data.
Organizations see big data as a catalyst for their transformation into digital enterprises and as a way to secure an insight-driven competitive advantage. In particular, big data technologies are enabling companies with greater agility as it helps them to analyze data comprehensively and take more informed actions at a swifter pace. We’ve already passed the transition point with big data – instead of discussing the possibilities with big data, many are already experiencing the actual insight-driven benefits from it, including increased revenues, a larger base of loyal customers, and more efficient operations. In fact, we see our clients looking for granular solutions that leverage big data, advanced analytics and the cloud to address industry specific problems.

Q3. Analytics and Mobility: how do they correlate?

Narendra Mulani: Analytics and mobility are two digital areas that work hand-in-hand on many levels.
As an example, mobile devices and the increasingly connected world through the Internet of Things (IoT) have become two key drivers for big data analytics. As mobile devices, sensors, and the IoT are constantly creating new data sources and data types, big data analytics is being applied to transform the increasing amount of data into important and actionable insight that can create new business opportunities and outcomes. Also, this view can be reversed, where analytics feeds insight into mobile devices such as tablets to workers in offices or out in the field to enable them to make real-time decisions that could benefit their business.

Q4. Data explosion: What does it create ? Risks, Value or both?

Narendra Mulani: The data explosion that’s happening today and will continue to happen due to the Internet of Things creates a lot of opportunity for businesses. While organizations recognize the value that the data can generate, the sheer amount of data – internal data, external data, big data, small data, etc – can be overwhelming and create an obstacle for analytics adoption, project completion, and innovation. To overcome this challenge and pursue actionable insights and outcomes, organizations shouldn’t look to analyze all of the data that’s available, but identify the right data needed to solve the current project or challenge at hand to create value.

It’s also important for companies to manage the potential risk associated with the influx of data and take the steps needed to optimize and protect it. They can do this by aligning IT and business leads to jointly develop and maintain data governance and security strategies. At a high level, the strategies would govern who uses the data and how the data is analyzed and leveraged, define the technologies that would manage and analyze the data, and ensure the data is secured with the necessary standards. Suitable governance and security strategies should be requirements for insight-driven businesses. Without them, organizations could experience adverse and counter-productive results.

Q5. You introduced the concept of the “Modern Data Supply Chain”? How does it differ from the traditional Supply Chain?

Narendra Mulani: As companies’ data ecosystems are usually very complex with many data silos, a modern data supply chain helps them to simplify their data environment and generate the most value from their data. In brief, when data is treated as a supply chain, it can flow swiftly, easily and usefully through the entire organization— and also through its ecosystem of partners, including customers and suppliers.

To establish an effective modern data supply chain, companies should create a hybrid technology environment that enables a data service platform with emerging big data technologies. As a result, businesses will be able to access, manage, move, mobilize and interact with broader and deeper data sets across the organization at a much quicker pace than previously possible and place action on the attained analytics insights that could help it to more effectively deliver to its consumers, develop new innovative solutions, and differentiate in its market.

Q6. You talked about “Retooling the Enterprise”. What do you mean by this?

Narendra Mulani: Some businesses today are no longer just using analytics, they are taking the next step by transforming into insight-driven enterprises. To achieve “insight-driven enterprise” status, organizations need to retool themselves for optimization. They can pursue an insight-driven transformation by:

· Establishing a center of gravity for analytics – a center of gravity for analytics often takes the shape of a Center of Excellence or a similar concentration of talent and resources.
· Employing agile governance – build horizontal governance structures that are focused on outcomes and speed to value, and take a “test and learn” approach to rolling out new capabilities. A secure governance foundation could also improve the democratization of data throughout a business.
· Creating an inter-disciplinary high performing analytics team — field teams with diverse skills, organize talent effectively, and create innovative programs to keep the best talent engaged.
· Deploying new capabilities faster – deploy new, modern and agile technologies, as well as hybrid architectures and specifically designed toolsets, to help revolutionize how data has been traditionally managed, curated and consumed, to achieve speed to capability and desired outcomes. When appropriate, cloud technologies should be integrated into the IT mix to benefit from cloud-based usage models.
· Raising the company’s analytics IQ – have a vision of what would be your “intelligent enterprise” and implement an Analytics Academy that provides analytics training for functional business resources in addition to the core management training programs.

Q7. What are the risks from the Internet of Things? And how is it possible to handle such risks?

Narendra Mulani: The IoT is prompting an even greater focus on data security and privacy. As a company’s machines, employees and ecosystems of partners, providers, and customers become connected through the IoT, securing the data that is flowing across the IoT grid can be increasingly complex. Today’s sophisticated cyber attackers are also amplifying this complexity as they are constantly evolving and leveraging data technology to challenge a company’s security efforts.

To establish strong, effective real-time cyber defense strategy, security teams will need to employ innovative technologies to identify threat behavioral patterns — including artificial intelligence, automation, visualisation, and big data analytics – and an agile and fluid workforce to leverage the opportunities presented by technology innovations. They should also establish policies to address privacy issues that arise out of all the personal data that are being collected. Through this combination of efforts, companies will be able to strengthen its approach to cyber defense in today’s highly connected IoT world and empower cyber defenders to help their companies better anticipate and respond to cyber attacks.

Q8. What are the main lessons you have learned in implementing Big Data Analytic projects?

Narendra Mulani: Organizations should explore the entire big data technology ecosystem, take an outcome-focused approach to addressing specific business problems, and establish precise success metrics before an analytics project even begins. The big data landscape is in a constant state of change with new data sources and emerging big data technologies appearing every day that could offer a company a new value-generating opportunity. A hybrid technology infrastructure that combines existing analytics architecture with new big data technologies can help companies to achieve superior outcomes.
An outcome-focused strategy that embraces analytics experimentation and explores the possible data and technology that can help a company meet its goals and has checkpoints for measuring performance will be very valuable, as this strategy will help the analytics team to know if they should continue on course or need to make a course correction to attain the desired outcome.

Q9. Is Data Analytics only good for businesses? What about using (Big) Data for Societal issues?

Narendra Mulani: Analytics is helping businesses across industries and governments as well to make more informed decisions for effective outcomes, whether it might be to improve customer experience, healthcare or public safety.
As an example, we’re working with a utility company in the UK to help them leverage analytics insights to anticipate equipment failures and respond in near real-time to critical situations, such as leaks or adverse weather events. We are also working with a government agency to analyze its video monitoring feeds to identify potential public safety risks.

Qx Anything else you wish to add?

Narendra Mulani: Another area that’s on the rise is Artificial Intelligence – we define it as a collection of multiple technologies that enable machines to sense, comprehend, act and learn, either on their own or to augment human activities. The new technologies include machine learning, deep learning, natural language processing, video analytics and more. AI is disrupting how businesses operate and compete and we believe it will also fundamentally transform and improve how we work and live. When an organization is pursuing an AI project, it’s our belief that it should be business-oriented, people-focused, and technology rich for it to be most effective.

———

As Chief Analytics Officer and Head Geek – Accenture Analytics, Narendra Mulani is responsible for creating a culture of analytics and driving Accenture’s strategic agenda for growth across the business. He leads a dedicated team of 17,000 Analytic professionals that serve clients around the globe, focusing on value creation through innovative solutions that combine industry and functional knowledge with analytics and technology.

Narendra has held a number of leadership roles within Accenture since joining in 1997. Most recently, he was the managing director – Products North America, where he was responsible for creating value for our clients across a number of industries. Prior to that, he was managing director – Supply Chain, Accenture Management Consulting, leading a global practice responsible for defining and implementing supply chain capabilities at a diverse set of Fortune 500 clients.

Narendra graduated from Bombay University in 1978 with a Bachelor of Commerce, and received an MBA in Finance in 1982 as well as a PhD in 1985 focused on Multivariate Statistics, both from the University of Massachusetts.

Outside of work, Narendra is involved with various activities that support education and the arts. He lives in Connecticut with his wife Nita and two children, Ravi and Nikhil.

———-

Resources

– Ducati is Analytics Driven. Analytics takes Ducati around the world at speed and precision.

Accenture Analytics. Launching an insights-driven transformation.  Download the point of view on analytics operating models to better understand how high performing companies are organizing their capabilities.

– Accenture Cyber Intelligence Platform. Analytics helping organizations to continuously predict, detect and combat cyber attacks.

–  Data Acceleration: Architecture for the Modern Data Supply Chain, Accenture

Related Posts

On Big Data and Data Science. Interview with James KobielusSource: ODBMS Industry Watch,  2016-04-19

On the Internet of Things. Interview with Colin Mahony Source: ODBMS Industry Watch, 2016-03-14

A Grand Tour of Big Data. Interview with Alan MorrisonSource: ODBMS Industry Watch, 2016-02-25

On the Industrial Internet of Things. Interview with Leon GuzendaSource: ODBMS Industry Watch,  2016-01-28

On Artificial Intelligence and Society. Interview with Oren EtzioniSource: ODBMS Industry Watch,  2016-01-15

 

Follow us on Twitter: @odbmsorg

##

May 17 16

On data analytics for finance. Interview with Jason S.Cornez.

by Roberto V. Zicari

“Understanding human language remains a difficult problem. The challenges here are not only technical, but there is also a perception from popular culture that computers today perform at the level we see in science fiction. So there is a gap between what is expected and what is possible.”–Jason S.Cornez.

I have interviewed Jason S.Cornez, Chief Technology Officer, RavenPack. Main topic of the interview is unstructured data analytics for finance.

RVZ

Q1. What is the business of RavenPack?

Jason S.Cornez: We specialize in the systematic analysis of unstructured data for finance. RavenPack Analytics transforms unstructured big data sets,such as traditional news and social media, into structured granular data and indicators to help financial services firms improve their performance. RavenPack addresses the challenges posed by the characteristics of Big Data – volume, variety, veracity and velocity – by converting unstructured content into a format that can be more effectively analyzed, manipulated and deployed in financial applications.

Q2. How is Deutsche Bank using RavenPack News Analytics as an overlay to a pairs trading strategy?

Jason S.Cornez: The profits and risks from trading stock pairs are very much related to the type of information event which creates divergence. If divergence is caused by a piece of news related specifically to one constituent of the pair, there is a good chance that prices will diverge further. On the other hand, if divergence is caused by random price movements or a differential reaction to common information, convergence is more likely to follow after the initial divergence. To test the effects of news on a pairs trading strategy, Deutsche Bank used two aggregated indicators based on RavenPack’s Big Data analytics derived from news and social media data measuring sentiment and media attention.
Specifically, using the two indicators, Deutsche Bank created a filter that would ignore trades where divergence was supported by negative sentiment and abnormal news volume.
Overall, Deutsche Bank finds that applying a news analytics overlay can help differentiate between “good” price divergence (which is likely to converge) and “bad” divergence. More importantly, such ability provides significant improvements to the performance of a traditional pairs trading strategy, especially by reducing divergence risk.

Q3. Who needs sentiment analytics in finance and why?

Jason S.Cornez: Sentiment analytics can help improve performance of trading strategies,reduce risk, and monitor compliance. Quantitative investors often subscribe to RavenPack Analytics granular data. This provides them with the ability to detect relevant, novel and unexpected events – be they corporate, macroeconomic or geopolitical -so they can enter new positions, or protect existing ones. These events, and the sentiment associated with them, help drive alpha generation as a novel factor in automated trading models.

Traditional Asset Managers, such as those managing hedge funds, mutual funds, pension funds and family offices may subscribe to RavenPack Indicators to help run portfolio optimization. The Indicators provide snapshots of sentiment and information density for an entity or instrument that can be used alongside fundamental or technical indicators to build portfolios with better risk/return profiles.

Brokerage and Market Makers can leverage RavenPack sentiment data to manage risk and generate trade ideas. They rely on RavenPack’s detection of relevant, novel and unexpected events – be they corporate, macroeconomic or geopolitical – to create circuit breakers protecting them from event risk.

Risk and Compliance Managers use RavenPack data to monitor accumulation of adverse sentiment or detect headline risk. The data help risk managers locate accumulations of risk and volatility, or changes in liquidity – either by aggregating sentiment, identifying event-driven regime shifts, or by creating alerts for when sentiment indicators reach extremes. As well, RavenPack event data also aids surveillance analysts to receive fewer false positives from market abuse alerts.

Finally, Professional and Academic researchers use RavenPack data to better understand how news and social media affect markets. They want to inform their clients how to find new sources of value and, hence, research and write about how quantitative investment managers find value in the data. RavenPack’s granular data is a great source of unique data for academics to enhance their published research – be it presenting a new way to use the data or controlling for news and social media in their work.

Q4. What are the main challenges and opportunities for Big Data analytics for financial markets?

Jason S.Cornez: Much of the work so far in Big Data analytics has been confined to structured data. These are sets of labeled and elementized values, such as what you might find in a traditional database table. Tools like Hadoop and Spark have helped to make structured big data analyitcs approachable.

RavenPack has always focused on unstructured data, primarily English-language text. Doing analytics here isn’t just about data mining, it requires more sophisticated processing for each document. Understanding human language remains a difficult problem. The challenges here are not only technical, but there is also a perception from popular culture that computers today perform at the level we see in science fiction. So there is a gap between what is expected and what is possible. One of our goals here is certainly to help make computers a little smarter.

Things start to get really interesting when you produce analytics by marrying structured data with unstructured data. A simple example could be a news story where an analyst expects mortgage rates to hit 4% by summer. It is certainly great if a computer understands that this is a story about interest rate guidance, but so much better if the computer is able to combine this with historical mortgage rates to know that the rates are currently rising, but still far below historical norms. As an industry, I don’t think too much has been done here yet, but that we’ll be seeing more activity here in the coming years.

Financial markets rely on information in order to be efficient. Big Data analytics promises to provide more information, and more types of information, faster than was previously possible. A more efficient market could help to level the playing field, as it were. And even if markets never become truly efficient, the financial industry sees that Big Data analytics can certainly help them. Several of these opportunities were addressed in the answer to the previous question.

Q5. What are your practical experience in building an infrastructure for Big Data Analytics of mostly unstructured text content, in realtime?

Jason S.Cornez: RavenPack has been processing Big Data since before Cloud Computing was a practical reality. We noticed that most competitors in the news analytics space were offering software solutions, whereas RavenPack has always been a service provider. We sell data, not software. As such we invested in our own infrastructure maintained at trusted hosting facilities. This was perhaps not the easiest or cheapest route, but it leads to compelling products that are relatively easy for a customer to adopt.

From the beginning, we’ve built a distributed system where collection, storage, classification, analytics, publication, and monitoring all run on distinct machines connected by a high-speed network. We learned virtualization technologies so that we could leverage our hardware investments more efficiently. We’ve been rigorous about maintaining a separation of concerns and establishing well-defined interfaces between our components. This not only makes our system robust, but it also allows us to choose the best technologies for each task.

In recent years, we’ve migrated to Cloud Computing and our early investments in distributed systems are really paying off. Most of our components work directly in the cloud and also scale without additional engineering work.

Q6. How do you manage to have a very low latency?

Jason S.Cornez: Low latency has always been a requirement of the system. Starting with low-latency, realtime processing in mind led to many of the architectural decisions that I mentioned above – especially about being distributed and being able to leverage big hardware. It’s painful to think about re-engineering an existing system that wasn’t designed with low latency in mind.

A specific observation is that storage, especially magnetic based storage, is far slower than CPU and also far slower than networking. So we have a heavily multi-threaded system where all storage tasks are delegated to background threads and the flow of data in the realtime system never needs to wait on a database.

Speaking of multi-threading, RavenPack performs various types of classification on each document. Many of these are independent and can be performed in parallel. As well, within a single document and single type of classification, many aspects work only on local information, such as a paragraph. This work can also be done in parallel. As more powerful, multi-core machines continue to appear, our system can continue to improve.

Of course, low latency really begins with good algorithms and good tools. We measure the system as a whole on a daily basis and we profile our code for both speed and space on a regular basis. At times, there is a trade-off between a feature and doing it feasibly. We often sacrifice a new feature until we can solve how to implement it without negatively impacting the performance of our system.

Q7. What are the main technological challenges you are currently facing?

Jason S.Cornez: There are many challenges ahead. Some of the obvious ones are about branching out from English into other languages, or from plain text to other media formats.

On the purely technical side, we see that cloud computing and big data are still very young fields. Cloud resources are much more ephemeral than those in a controlled, hosted environment. We must adapt software to work well in the face of disappearing machines and inaccessible resources. One example is startup time of a system. Traditionally, startup is a rare event and our servers run for a long time. But now that changes, and system startup is much more frequent and hence must be made more efficient. We are evolving rapidly in these areas right now.

Perhaps the biggest challenge remains the perception gap that I mentioned earlier. I’m very proud of the system we’ve built, but it remains possible for a human to find an entity or an event in a document that our system misses. I don’t think this problem will ever go away, but I’m confident RavenPack is making great strides here.

Q8. Why and how do you use Allegro Common Lisp?

Jason S.Cornez: RavenPack has been using Franz Allegro Common Lisp since we began. It is the primary language we use for analysis and classification of unstructured text. Common Lisp is an excellent language for both exploratory programming and high performance computing.

Common Lisp is a multi-paradigm language, or even a paradigm-neutral language. So the engineer has the flexibility to map from concept to code in the most natural way possible. Some concepts map naturally to an object-oriented design, others to a functional design, and other to an imperative design. The language naturally supports all of these so you never need to map from your concept into the philosophy of the language. And further, lisp is a programmable programming language, so as new paradigms come along, they can be added to the language by any developer. This is so easy and natural in Common Lisp that you often do it even when there is only a single use case in mind.

Common Lisp also shines for deploying and maintaining production software. Of course, it supports native OS threads, native machine compilation, and high performance garbage collection. But as well, you can attach to, inspect, modify and patch live systems.

Q9. What are the main lessons you learned so far?

Jason S.Cornez: It’s been a long and interesting journey, and nearly everything we know now has been learned along the way. One way I like to think about the main lessons learned is to consider what I believe to be the barriers that might make it difficult for a competitor or potential client to replicate what we’ve done.

A significant selling-point of our product that provides lots of value to our clients is our extensive historical archive of analytics. This of course is derived from our archive of content. The curation of such an archive is much harder than most people imagine. There is the minor issue of implementing the spec that the provider supplies. But the fun begins as you realize that the archive is incomplete and in multiple incompatible formats, some of them not documented at all. There are multiple timestamps, many with no timezone. The realtime feed looks different from the historical archive. The list goes on.

None of this is meant as a complaint about our content partners – this is the nature of things. And even having learned this lesson, there isn’t much we could have done differently. Of course, we now have a checklist of questions we give to any new content provider – and they often improve their offering as a result of working with us. But if we hear that incorporating someone’s content will be easy, we now know to take this with a grain of salt.

Qx Anything else you wish to add?

Jason S.Cornez: Thanks for this opportunity. I hope it has been helpful.

———————————
Jason S.Cornez, Chief Technology Officer, RavenPack.
Jason joined RavenPack in 2003 and is responsible for the design and implementation of the RavenPack software platform. He is a hands-on technology leader, with a consistent record of delivering break-through products. 
A Silicon Valley start-up veteran with 20 years of professional experience, Jason combines technical know-how with an understanding of business needs to turn vision into reality. Jason holds a Master’s Degree in Computer Science, along with undergraduate degrees in Mathematics and EECS, from the Massachusetts Institute of Technology.

——————————

Resources

–  Common Lisp Educational Resources:  list of books, Lisp-oriented web sites and tutorials.

–  Basic Lisp Techniques: The PDF file provides an introduction to the Common Lisp language.

–  Mean Reversion II: Pairs Trading Strategies (LINK to .PDF) – Registration required-, Deutsche Bank, Feb. 16, 2016.  In this paper, Deutsche Bank shows how to use RavenPack News Analytics as an overlay to a pairs trading strategy.

Related Posts

– Enterprise Information Extraction. BY Yunyao Li, Research Manager, Scalable Natural Language Processing (SNaP) Group, IBM Research–Almaden.ODBMS.org, MAY 9, 2016.

– Big Data: Content and Technology. BY Gio Wiederhold, ODBMS.org, May 2016

– Above the Clouds: What Modern IT Portends. BY Filippo Balestrieri and Bernardo A. Huberman, Hewlett Packard Labs, ODBMS.org, MARCH 30, 2016.

– “Civility in the Age of Artificial Intelligence”. BY STEVE LOHR, technology reporter for The New York Times and the author of “Data-ism”.ODBMS.org, FEBRUARY 6, 2016.

Follow ODBMS.org on Twitter: @odbmsorg
##

May 2 16

Using NoSQL for Ireland’s Online Tax Research Database.

by Roberto V. Zicari

“When the Institute began to look for a new platform, it became apparent that a relational database was not the best solution to effectively manage and deliver our XML content.”–Martin Lambe.

The Irish Tax Institute is the leading representative and educational body for Ireland’s AITI Chartered Tax Advisers (CTA) and is the only professional body exclusively dedicated to tax. One of their service is TaxFind – Ireland’s Leading Online Tax Research Database, offering Search to 200,000 pages of tax content, over 8,000 pages of Irish tax legislation, Irish Tax Institute tax technical papers, over 25 leading tax commentary publications, and 1000s of Irish Tax Review articles.

I did a joint interview with Martin Lambe, CEO of the Irish Tax Institute and Sam Herbert, Client Services Director at 67 Bricks.
Main topics of the interview are the data challenges they currently face, and the implementation of TaxFind using MarkLogic.

RVZ

Q1. What are the main data challenges you currently have at the Irish Tax Institute?

Martin Lambe: The Irish Tax Institute moved its publication workflow to an XML-based process in 2009 and we have a large archive of valuable tax information contained in quite complex XML format. The main challenge was to find a solution that could store the repository of data (XML and other formats) and provide a simple search interface that directs users very quickly to the most relevant result. The “findability” of relevant content is crucial.

Q2. What is the TaxFind research database?

Martin Lambe: The Irish Tax Institute is the main provider of tax information in Ireland and TaxFind is the Institute’s online tax research database. TaxFind offers subscribers access to Irish tax legislation and guidance that includes tax technical papers from seminars and conferences, as well as over 30 tax commentary publications. It is used by thousands of CTAs in Ireland on a daily basis to assist in their tax research.

Q3. Who are the members that benefit from this TaxFind research database?

Martin Lambe: TaxFind serves the Chartered Tax Adviser (CTA) community in Ireland and other tax professionals such as those in the global accounting firms.

Q4. Why did you discard your previous implementation with a relational database system?

Martin Lambe: The previous database was literally creaking at the seams. Users were increasingly frustrated with difficulties accessing the database on different browsers and the old platform did not support mobile devices or tablets. When the Institute began to look for a new platform, it became apparent that a relational database was not the best solution to effectively manage and deliver our XML content. XML content stored in a NoSQL document database is indexed specifically for the search engine and this means the performance of our search engine and the relevancy of results is dramatically improved.

Q5. Why did you select MarkLogic`s NoSQL database platform?

Sam Herbert: MarkLogic is scalable to support fast querying across large amounts of data, it deals with XML content very well (and most of the tax data is either in XML, or in HTML that can be treated as XHTML), and has good searching. It is also a good environment to develop in – it has excellent documentation, and good tooling. It helps that it uses XQuery as one of its query languages, rather than a proprietary database-specific language.

Q6. Is SQL still important for you?

Sam Herbert: I don’t think it’s true to say that any particular type of technology is “important” to ITI – it’s all about how it can benefit users. From a 67 Bricks perspective, we work with relational databases, NoSQL databases, and graph databases depending on what shape the data is and what the needs are around querying it.

Q7 Why not choose an open source solution?

Sam Herbert: We’re using Open Source components in other parts of the system, and we’re keen on using Open Source where possible. However, for the data store, there aren’t any Open Source alternatives that have the combination of good scalability, good support for XML content, a standard query language, and powerful searching that we were looking for.

Q8. Can you tell us a bit about the architecture of the new implementation of the TaxFind research database

Sam Herbert: There are three major components:

– a frontend display and service layer written using the Play framework
– the MarkLogic data store
– a semantic enrichment component using Semaphore SmartLogic and the ITI taxonomy

The Play component is what users interact with – both for human users coming to the web site, and automated use of the web services. The bulk of the data retrieval and manipulation is done via a set of XQuery functions defined within the MarkLogic store. When new data is uploaded, it is processed within the Play code, enriched using Semaphore SmartLogic, and then stored in MarkLogic.

Q9. How do you manage to integrate Irish Tax Institute`s tax data, bringing together in excess of 300,000 pages of tax content including archive material in Word, PDF, XML and HTML?

Sam Herbert: The most complex part of the data is the XML content. These are very large XML files representing legislation, books, and other tax materials, that are inter-related in complex ways, and with a lot of deeply nested hierarchy. An important part of managing the data was splitting these into appropriately sized fragments, and then identifying the linking between different files – for example a piece of legislation will refer to other legislation, and commentary will refer to that legislation, and a new piece of legislation may supersede an earlier piece.

The non-XML content is larger in volume, but each individual document is smaller and is structurally simpler. Managing this content was largely a matter of loading it in and letting it be indexed.

Q10. How do you capture and digitize information in various formats and make it searchable?

Sam Herbert: Making it searchable is straightforward – it’s making it searchable in ways that support the expectations of the users that’s much more difficult.

A good search experience requires both subject matter expertise and good automated tests.

The basic search is using MarkLogic’s full text search. The next step was to work with tax experts within and outside the ITI to identify appropriate facets within the content with which to group the results – based on a combination of what the user requirements were and what was supported by the data.

There were additional complexities around weighting the search results to make the “best” results come at the top in as many circumstances as possible – for example, weighting terms within headings, weighting more recent content, weighting content based on its category so legislation is more important than commentary, and weighting content higher based on its popularity. The semantic enrichment based on tax terms from the ITI taxonomy also enhances the searching.

Q11. How do you ensure that this solution is scalable?

Sam Herbert: The solution is deployed to a load-balanced cluster using Amazon Web Services. The Play frontend is purely stateless REST. This means that we can scale to support more users easily by spinning up more servers – and using AWS makes this easy. Overall, using AWS has been a big win for us, in terms of being able to get servers running easily, being able to increase and decrease things like their memory size easily, and the various ancillary services it provides like DNS and load balancing. By making sure we can scale to support additional data, we can use MarkLogic effectively.

————-

Martin Lambe is Chief Executive of the Irish Tax Institute. His previous role within the Institute was that of Director of Finance.

Sam Herbert is Client Services Director at 67 Bricks, a company that works with information owners (particularly publishers) who want to enrich their content to make it more structured, granular, flexible and reusable.
67 Bricks utilises its deep understanding of the content enrichment challenge to help publishers develop systems and capabilities to increase the value of their content. With expertise in XML, business analysis, semantic tagging and software development, 67 Bricks works closely with its clients to develop and implement content enrichment capabilities and enriched content digital products.

————-
Resources

Irish Tax Institute

TaxFind

67 Bricks

MarkLogic

Related Posts

The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI). ODBMS.org

Unthink: Moving Beyond the Constraints of Relational Databases. by Tom McGrath, MarkLogic. ODBMS.org March 14, 2016.

MarkLogic Case Study: Royal Society of Chemistry.ODBMS.org

On making information accessible. Interview with David Leeming. ODBMS Industry Watch, on July 30, 2014

Follow us on Twitter: @odbmsorg

##

Apr 19 16

On Big Data and Data Science. Interview with James Kobielus

by Roberto V. Zicari

“One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.”– James Kobielus

On the topics of Big Data, and Data Science, I have interviewed James Kobielus, IBM Big Data Evangelist.

RVZ

Q1. What kind of companies generate Big Data, besides the Internet giants?

James Kobielus: Big data isn’t something you “generate.” Rather, the term refers to the ability to achieve differentiated value from advanced analytics on trustworthy data at any scale. In other words, it’s a best practice, not a specific type of data or even a specific scale of data (measured in volume, velocity, and/or variety).

When considered in this light, you can identify big data analytic applications in every industry. Every C-level executive has strategic applications of big data. Here are just a smattering:

  • Chief Marketing Officers have been the prime movers on many big data initiatives that involve Hadoop, NoSQL, and other approaches. Their primary applications consist of marketing campaign optimization, customer churn and loyalty, upsell and cross-sell analysis, targeted offers, behavioral targeting, social media monitoring, sentiment analysis, brand monitoring, influencer analysis, customer experience optimization, content optimization, and placement optimization
  • Chief Information Officers use big data platforms for data discovery, data integration, business analytics, advanced analytics, exploratory data science.
  • Chief Operations Officers rely on big data for supply chain optimization, defect tracking, sensor monitoring, and smart grid, among other applications.
  • Chief Information Security Officer run security incident and event management, anti-fraud detection, and other sensitive applications on big data.
  • Chief Technology Officers do IT log analysis, event analytics, network analytics, and other systems monitoring, troubleshooting, and optimization applications on big data.
  • Chief Financial Officers run complex financial risk analysis and mitigation modeling exercises on big data platforms.

Q2. What are the most challenging problems you are facing when analysing Big Data?

James Kobielus: Searching for actionable intelligence in big data involves building and testing advanced-analytics models against large volumes of complex data that may be flowing in at high velocities.

At these scales, it’s easy to get overwhelmed in your analysis unless you automate the end-to-end processes of extracting intelligence at scale. Automation can also help control the cost of managing a growing volume of algorithmic models against ever expanding big-data collections. The key processes that need automating are data discovery, profiling, sampling, and preparation, as well as model building, scoring, and deployment.

Q3. How do you typically handle them?

James Kobielus: Automating the modeling process will boost data scientist productivity by an order of magnitude, freeing them from drudgery so that they can focus on the sorts of exploration, modeling, and visualization challenges that demand expert human judgment. Data scientists can accelerate their modeling automation initiatives by following these steps:

  • Virtualize access to data, metadata, rules, and predictive models, as well as to data integration, data warehousing, and advanced analytic applications through a BI semantic virtualization layer;
  • Unify access, governance, orchestration, automation, and administration across these resources within a service-oriented architecture;
  • Explore commercial tools that support maximum automation of model development, scoring, deployment, and execution;
  • Consolidate, accelerate, and deepen predictive analytics through integration into big-data platforms with scalable in-database execution; and
  • Migrate existing analytical data marts into multidomain big-data platforms with unified data, metadata, and model governance within service-oriented virtualization framework.

Q4. What are in your experience the typical mistakes made in large scale data projects?

James Kobielus: One of the most typical mistakes in large-scale data projects is losing sight of the biases that may skew the insights you extract.

Even if you accept that a data scientist’s integrity is rock-solid, intentions pure, skills stellar, and discipline rigorous, there’s no denying that bias may creep inadvertently into their work with big data. The biases may be minor or major, episodic or systematic, tangential or material to their findings and recommendations. Whatever their nature, the biases must be understood and corrected as fully as possible.

Here are some of the key sources of bias that may crop up in a data scientist’s work with big data:

  • Cognitive bias: This is the tendency to make skewed decisions based on pre-existing cognitive and heuristic factors–such as a misunderstanding of probabilities–rather than on the data and other hard evidence. You might say that the educated intuition that drives data science is rife with cognitive bias, but that’s not always a bad thing.
  • Selection bias: This is the tendency to skew your choice of data sources to those that may be most available, convenient, and cost-effective for your purposes, as opposed to being necessarily the most valid and relevant for your study. Clearly, data scientists do not have unlimited budgets, may operate under tight deadlines, and don’t use data for which they lack authorization. These constraints may introduce an unconscious bias in the big-data collections they are able to assemble.
  • Sampling bias: This is the tendency to skew the sampling of data sets toward subgroups of the population most relevant to the initial scope of a data-science project, thereby making it unlikely that you will uncover any meaningful correlations that may apply to other segments. Another source of sampling bias is “data dredging,” in which the data scientist uses regression techniques that may find correlations in samples but that may not be statistically significant in the wider population. Consequently, you’re likely to spuriously confirm your initial model for the segments that happen to make the sampling cut.
  • Modeling bias: Beyond the biases just discussed, this is the tendency to skew data-science models by starting with a biased set of project assumptions that drive selection of the wrong variables, the wrong data, the wrong algorithms, and the wrong metrics of fitness. In addition, overfitting of models to past data without regard for predictive lift is a common bias. Likewise, failure to score and iterate models in a timely fashion with fresh observational data also introduces model decay, hence bias.
  • Funding bias: This may be the most silent but pernicious bias in data-scientific studies of all sorts. It’s the unconscious tendency to skew all modeling assumptions, interpretations, data, and applications to favor the interests of the party–employer, customer, sponsor, etc.–that employs or otherwise financially supports the data-science initiative. Funding bias makes it highly unlikely that data scientists will uncover disruptive insights that will “break the rice bowl” in which they make their living.

Q5. How do you measure “success” when analysing data?

James Kobielus: You measure success in your ability to distill useful insights in a timely fashion from the data at your disposal.

Q6. What skills are required to be an effective Data Scientist?

James Kobielus: Data science’s learning curve is formidable. To a great degree, you will need a degree, or something substantially like it, to prove you’re committed to this career. You will need to submit yourself to a structured curriculum to certify you’ve spent the time, money and midnight oil necessary for mastering this demanding discipline.

Sure, there are run-of-the-mill degrees in data-science-related fields, and then there are uppercase, boldface, bragging-rights “DEGREES.” To some extent, it matters whether you get that old data-science sheepskin from a traditional university vs. an online school vs. a vendor-sponsored learning program. And it matters whether you only logged a year in the classroom vs. sacrificed a considerable portion of your life reaching for the golden ring of a Ph.D. And it certainly matters whether you simply skimmed the surface of old-school data science vs. pursued a deep specialization in a leading-edge advanced analytic discipline.

But what matters most to modern business isn’t that every data scientist has a big honking doctorate. What matters most is that a substantial body of personnel has a common grounding in core curriculum of skills, tools and approaches. Ideally, you want to build a team where diverse specialists with a shared foundation can collaborate productively.

Big data initiatives thrive if all data scientists have been trained and certified on a curriculum with the following foundation:

  • Paradigms and practices: Every data scientist should acquire a grounding in core concepts of data science, analytics and data management. They should gain a common understanding of the data science lifecycle, as well as the typical roles and responsibilities of data scientists in every phase. They should be instructed on the various role(s) of data scientists and how they work in teams and in conjunction with business domain experts and stakeholders. And they learn a standard approach for establishing, managing and operationalizing data science projects in the business.
  • Algorithms and modeling: Every data scientist should obtain a core understanding of linear algebra, basic statistics, linear and logistic regression, data mining, predictive modeling, cluster analysis, association rules, market basket analysis, decision trees, time-series analysis, forecasting, machine learning, Bayesian and Monte Carlo Statistics, matrix operations, sampling, text analytics, summarization, classification, primary components analysis, experimental design, unsupervised learning constrained optimization.
  • Tools and platforms: Every data scientist should master a core group of modeling, development and visualization tools used on your data science projects, as well as the platforms used for storage, execution, integration and governance of big data in your organization. Depending on your environment, and the extent to which data scientists work with both structured and unstructured data, this may involve some combination of data warehousing, Hadoop, stream computing, NoSQL and other platforms. It will probably also entail providing instruction in MapReduce, R and other new open-source development languages, in addition to SPSS, SAS and any other established tools.
  • Applications and outcomes: Every data scientist should learn the chief business applications of data science in your organization, as well as in how to work best with subject-domain experts. In many companies, data science focuses on marketing, customer service, next best offer, and other customer-centric applications. Often, these applications require that data scientists understand how to leverage customer data acquired from structured survey tools, sentiment analysis software, social media monitoring tools and other sources. It also essential that every data scientist gain an understanding of the key business outcomes–such as maximizing customer lifetime value–that should focus their modeling initiatives.

Classroom instruction is important, but a curriculum that is 100 percent devoted to reading books, taking tests and sitting through lectures is insufficient. Hands-on laboratory work is paramount for a truly well-rounded data scientist. Make sure that your data scientists acquire certifications and degrees that reflect them actually developing statistical models that use real data and address substantive business issues.

A business-oriented data-science curriculum should produce expert developers of statistical and predictive models. It should not degenerate into a program that produces analytics geeks with heads stuffed with theory but whose diplomas are only fit for hanging on the wall.

Q7. Hadoop vs. Spark: what are the pros and cons?

James Kobielus: Big data analytics infrastructures are growing more hybridized than ever. Every new technology—such as Hadoop, in-memory databases, and graph databases—finds its specific niche in terms of use cases, deployment modes, and applications for which it is best suited.

Even as Apache Spark pushes more deeply into big-data environments, it won’t substantially change this trend. Yes, of course Spark is on the fast track to ubiquity in big-data analytics. This is especially true for the next generation of machine-learning applications that feed on growing in-memory pools and require low-latency distributed computations for streaming and graph analytics. But those use cases aren’t the sum total of big-data analytics and never will be.

As we all grow more infatuated with Spark, it’s important to continually remind ourselves of what it’s not suitable for. If, for example, one considers all the critical data management, integration, and preparation tasks that must be performed prior to modeling in Spark, it’s clear that these will not be executed in any of the Spark engines (Spark SQL, Spark Streaming, GraphX). Instead, they’ll be carried out in the data platforms and elastic clusters (HDFS, Cassandra, HBase, Mesos, cloud services, etc.) upon which those engines run. Likewise, you’d be hardpressed to find anyone who’s seriously considering Spark in isolation for data warehousing, data governance, master data management, or operational business intelligence.

Above all else, Spark is the new power tool for data scientists who are pushing boundaries in the emerging era of in-memory big data analytics in low-latency scenarios of all types. Spark is proving its value as a development tool for the new generation of data scientists building the in-memory statistical models upon which it all will depend.

Let’s not fall into the delusion that everything is converging toward Spark, as if it were the ravenous maw that will devour every other big-data analytics tool and platform. Spark is just another approach that’s being fitted to and optimized for specific purposes.

And let’s resist the hype that treats Spark as Hadoop’s “successor.” This implies that Hadoop and other big-data approaches are “legacy,” rather than what they are, which is foundational. For example, no one is seriously considering doing “data lakes,” “data reservoirs,” or “data refineries” on anything but Hadoop or NoSQL.

——————–

James Kobielus is an industry veteran and serves as IBM Big Data Evangelist; Senior Program Director for Product Marketing in Big Data Analytics; and Team Lead, Technical Marketing, IBM Big Data & Analytics Hub. He spearheads thought leadership activities across the IBM Analytics solution portfolio. He has spoken at such leading industry events as IBM Insight, Hadoop Summit, and Strata. He has published several business technology books and is a very popular provider of original commentary on blogs and many social media.

Resources

–  Master of Information and Data Science,  UC Berkeley School of Information.

– MS in Data Science, NYU Center for Data Science.

– Free data science curriculum, kdnuggets.com

Data Science | Coursera

– Master of Science in Data Science – Data Science Institute

Data Mining and Applications Graduate Certificate, Stanford

The European Data Science Academy (EDSA) designs curricula for data science training and data science education across the European Union (EU).

-The EDISON project will focus on activities to establish the new profession of ‘Data Scientist’, following the emergence of Data Science technologies (also referred to as Data Intensive or Big Data technologies) which changes the way research is done, how scientists think and how the research data are used and shared. This includes definition of the required skills, competences framework/profile, corresponding Body Of Knowledge and model curriculum. It will develop a sustainability/business model to ensure a sustainable increase of Data Scientists, graduated from universities and trained by other professional education and training institutions in Europe. 
EDISON will facilitate the establishment of a Data Science education and training infrastructure at major European universities by promoting experience of ‘champion’ universities involving them into coordinated development and implementation of the model curriculum and creation of cooperative educational and training infrastructure.

Related Posts

– RIP Big Data, By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, January  2016

Open Source Software and IBM’s Big Data platform. By Cynthia M. Saracco, senior solutions architect at IBM’s Silicon Valley Laboratory. ODBMS.org, April 2016.

Looking back at Big Data in 2015, By Cynthia M. Saracco, IBM Senior Solution Architect, ODBMS.org. November 2015

–  Heuristics for a Data Scientist: A common sense approach. BY Silvia Dassiè, Data Scientist at Ryanair. ODBMS.org, December 2015

The rise of immutable data stores. By Alan Morrison, Senior Manager, PwC Center for technology and innovation. ODBMS.org. October 2015

Follow us on Twitter: @odbmsorg

##

Apr 2 16

Recruit Institute of Technology. Interview with Alon Halevy

by Roberto V. Zicari

” A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.”–Alon Halevy

I have interviewed Alon Halevy, CEO at Recruit Institute of Technology.

RVZ

Q1. What is the mission of the Recruit Institute of Technology?

Alon Halevy: Before I describe the mission, I should introduce our parent company Recruit Holdings to those who may not be familiar with it. Recruit (founded in 1960), is a leading “life-style” information services and human resources company in Japan with services in the areas of recruitment, advertising, employment placement, staffing, education, housing and real estate, bridal, travel, dining, beauty, automobiles and others. The company is currently expanding worldwide and operates similar businesses in the U.S., Europe and Asia. In terms of size, Recruit has over 30,000 employees and its revenues are similar to those of Facebook at this point in time.

The mission of R.I.T is threefold. First, being the lab of Recruit Holdings, our goal is to develop technologies that improve the products and services of our subsidiary companies and create value for our customers from  the vast collections of data we have. Second, our mission is to advance scientific knowledge by contributing to the research community through publications in top-notch venues. Third, we strive to use technology for social good. This latter goal may be achieved through contributing to open-source software, working on digital artifacts that would be of general use to society, or even working with experts in a particular domain to contribute to a cause.

Q2. Isn`t similar to the mission of the Allen Institute for Artificial Intelligence?

Alon Halevy: The Allen Institute is a non-profit whose admirable goal is to make fundamental contributions to Artificial Intelligence. While R.I.T strives to make fundamental contributions to A.I and related areas such as data management, we plan to work closely with our subsidiary companies and to impact the world through their products.

Q3. Driverless cars, digital Personal Assistants (e.g. Siri), Big Data, the Internet of Things, Robots: Are we on the brink of the next stage of the computer revolution?

Alon Halevy: I think we are seeing many applications in which AI and data (big or small) are starting to make a real difference and affecting people’s lives. We will see much more of it in the next few years as we refine our techniques. A revolution will happen when tools like Siri can truly serve as your personal assistant and you start relying on such an assistant throughout your day. To get there, these systems need more knowledge about your life and preferences, more knowledge about the world, better conversational interfaces and at least basic commonsense reasoning capabilities. We’re still quite far from achieving these goals.

Q4. You were for more than 10 years senior staff research scientist at Google, leading the Structured Data Group in Google Research. Was it difficult to leave Google?

Alon Halevy: It was extremely difficult leaving Google! I struggled with the decision for quite a while, and waving goodbye to my amazing team on my last day was emotionally heart wrenching. Google is an amazing company and I learned so much from my colleagues there. Fortunately, I’m very excited about my new colleagues and the entrepreneurial spirit of Recruit.
One of my goals at R.I.T is to build a lab with the same culture as that of Google and Google Research. So in a sense, I’m hoping to take Google with me. Some of my experiences from a decade at Google that are relevant to building a successful research lab are described in a blog post I contributed to the SIGMOD blog in September, 2015.

Q5. What is your vision for the next three years for the Recruit Institute of Technology?

Alon Halevy: I want to build a vibrant lab with world-class researchers and engineers. I would like the lab to become a world leader in the broad area of making data usable, which includes data discovery, cleaning, integration, visualization and analysis.
In addition, I would like the lab to build collaborations with disciplines outside of Computer Science where computing techniques can make an even broader impact on society.

Q6. What are the most important research topics you intend to work on?

Alon Halevy: One of the roadblocks to applying AI and analysis techniques more widely within enterprises is data preparation.
Before you can analyze data or apply AI techniques to it, you need to be able to discover which datasets exist in the enterprise, understand the semantics of a dataset and its underlying assumptions, and to combine disparate datasets as needed. We plan to work on the full spectrum of these challenges with the goal of enabling many more people in the enterprise to explore their data.

Recruit being a lifestyle company, another  fundamental question we plan to investigate is whether technology can help people make better life decisions. In particular, can technology help you take into consideration many factors in your life as you make decisions and steer you towards decisions that will make you happier over time. Clearly, we’ll need more than computer scientists to even ask the right questions here.

Q7. If we delegate decisions to machines, who will be responsible for the consequences? What are the ethical responsibilities of designers of intelligent systems?

Alon Halevy: You got an excellent answer from Oren Etzioni to this question in a recent interview. I agree with him fully and could not say it any better than he did.

Qx Anything you wish to add?

Alon Halevy: Yes. We’re hiring! If you’re a researcher or strong engineer who wants to make real impact on products and services in the fascinating area of lifestyle events and decision making, please consider R.I.T!

———-

Alon Halevy is the Executive Director of the Recruit Institute of Technology. From 2005 to 2015 he headed the Structured Data Management Research group at Google. Prior to that, he was a professor of Computer Science at the University of Washington in Seattle, where he founded the Database Group. In 1999, Dr. Halevy co-founded Nimble Technology, one of the first companies in the Enterprise Information Integration space, and in 2004, Dr. Halevy founded Transformic, a company that created search engines for the deep web, and was acquired by Google.
Dr. Halevy is a Fellow of the Association for Computing Machinery, received the Presidential Early Career Award for Scientists and Engineers (PECASE) in 2000, and was a Sloan Fellow (1999-2000). Halevy is the author of the book “The Infinite Emotions of Coffee”, published in 2011, and serves on the board of the Alliance of Coffee Excellence.
He is also a co-author of the book “Principles of Data Integration”, published in 2012.
Dr. Halevy received his Ph.D in Computer Science from Stanford University in 1993 and his Bachelors from the Hebrew University in Jerusalem.

Resources

– Civility in the Age of Artificial Intelligence,  by STEVE LOHR, technology reporter for The New York Times, ODBMS.org

The threat from AI is real, but everyone has it wrong, by Robert Munro, CEO Idibon, ODBMS.org

Related Posts

On Artificial Intelligence and Society. Interview with Oren Etzioni, ODBMS Industry Watch.

– On Big Data and Society. Interview with Viktor Mayer-Schönberger ODBMS Industry Watch.

Follow us on Twitter: @odbmsorg

##

Mar 14 16

On the Internet of Things. Interview with Colin Mahony

by Roberto V. Zicari

“Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data.”– Colin Mahony

I have interviewed Colin Mahony, SVP & General Manager, HPE Big Data Platform. Topics of the interview are: The challenges of the Internet of Things, the opportunities for Data Analytics, the positioning of HPE Vertica and HPE Cloud Strategy.

RVZ

Q1. Gartner says 6.4 billion connected “things” will be in use in 2016, up 30 percent from 2015.  How do you see the global Internet of Things (IoT) market developing in the next years?

Colin Mahony: As manufacturers connect more of their “things,” they have an increased need for analytics to derive insight from massive volumes of sensor or machine data. I see these manufacturers, particularly manufacturers of commodity equipment, with a need to provide more value-added services based on their ability to provide higher levels of service and overall customer satisfaction. Data analytics platforms are key to making that happen. Also, we could see entirely new analytical applications emerge, driven by what consumers want to know about their devices and combine that data with, say, their exercise regimens, health vitals, social activities, and even driving behavior, for full personal insight.
Ultimately, the Internet of Things will drive a need for the Analyzer of Things, and that is our mission.

Q2. What Challenges and Opportunities bring the Internet of Things (IoT)? 

Colin Mahony: Frankly, manufacturers are terrified to flood their data centers with these unprecedented volumes of sensor and network data. The reason? Traditional data warehouses were designed well before the Internet of Things, or, at least before OT (operational technology) like medical devices, industrial equipment, cars, and more were connected to the Internet. So, having an analytical platform to provide the scale and performance required to handle these volumes is important, but customers are taking more of a two- or three-tier approach that involves some sort of analytical processing at the edge before data is sent to an analytical data store. Apache Kafka is also becoming an important tier in this architecture, serving as a message bus, to collect and push that data from the edge in streams to the appropriate database, CRM system, or analytical platform for, as an example, correlation of fault data over months or even years to predict and prevent part failure and optimize inventory levels.

Q3. Big Data: In your opinion, what are the current main demands/needs in the market?

Colin Mahony: All organizations want – and need – to become data-driven organizations. I mean, who wants to make such critical decisions based on half answers and anecdotal data? That said, traditional companies with data stores and systems going back 30-40 years don’t have the same level playing field as the next market disruptor that just received their series B funding and only knows that analytics is the life blood of their business and all their critical decisions.
The good news is that whether you are a 100-year old insurance company or the next Uber or Facebook, you can become a data-driven organization by taking an open platform approach that uses the best tool for the job and can incorporate emerging technologies like Kafka and Spark without having to bolt on or buy all of that technology from a single vendor and get locked in.  Understanding the difference between an open platform with a rich ecosystem and open source software as one very important part of that ecosystem has been a differentiator for our customers.

Beyond technology, we have customers that establish analytical centers of excellence that actually work with the data consumers – often business analysts – that run ad-hoc queries using their preferred data visualization tool to get the insight need for their business unit or department. If the data analysts struggle, then this center of excellence, which happens to report up through IT, collaborates with them to understand and help them get to the analytical insight – rather than simply halting the queries with no guidance on how to improve.

Q4. How do you embed analytics and why is it useful? 

Colin Mahony: OEM software vendors, particularly, see the value of embedding analytics in their commercial software products or software as a service (SaaS) offerings.  They profit by creating analytic data management features or entirely new applications that put customers on a faster path to better, data-driven decision making. Offering such analytics capabilities enables them to not only keep a larger share of their customer’s budget, but at the same time greatly improve customer satisfaction. To offer such capabilities, many embedded software providers are attempting unorthodox fixes with row-oriented OLTP databases, document stores, and Hadoop variations that were never designed for heavy analytic workloads at the volume, velocity, and variety of today’s enterprise. Alternatively, some companies are attempting to build their own big data management systems. But such custom database solutions can take thousands of hours of research and development, require specialized support and training, and may not be as adaptable to continuous enhancement as a pure-play analytics platform. Both approaches are costly and often outside the core competency of businesses that are looking to bring solutions to market quickly.

Because it’s specifically designed for analytic workloads, HPE Vertica is quite different from other commercial alternatives. Vertica differs from OLTP DBMS and proprietary appliances (which typically embed row-store DBMSs) by grouping data together on disk by column rather than by row (that is, so that the next piece of data read off disk is the next attribute in a column, not the next attribute in a row). This enables Vertica to read only the columns referenced by the query, instead of scanning the whole table as row-oriented databases must do. This speeds up query processing dramatically by reducing disk I/O.

You’ll find Vertica as the core analytical engine behind some popular products, including Lancope, Empirix, Good Data, and others as well as many HPE offerings like HPE Operations Analytics, HPE Application Defender, and HPE App Pulse Mobile, and more.

Q5. How do you make a decision when it is more appropriate to “consume and deploy” Big Data on premise, in the cloud, on demand and on Hadoop?

Colin Mahony: The best part is that you don’t need to choose with HPE. Unlike most emerging data warehouses as a service where your data is trapped in their databases when your priorities or IT policies change, HPE offers the most complete range of deployment and consumption models. If you want to spin up your analytical initiative on the cloud for a proof-of-concept or during the holiday shopping season for e-retailers, you can do that easily with HPE Vertica OnDemand.
If your organization finds that due to security or confidentiality or privacy concerns you need to bring your analytical initiative back in house, then you can use HPE Vertica Enterprise on-premises without losing any customizations or disruption to your business. Have petabyte volumes of largely unstructured data where the value is unknown? Use HPE Vertica for SQL on Hadoop, deployed natively on your Hadoop cluster, regardless of the distribution you have chosen. Each consumption model, available in the cloud, on-premise, on-demand, or using reference architectures for HPE servers, is available to you with that same trusted underlying core.

Q6. What are the new class of infrastructures called “composable”? Are they relevant for Big Data?

Colin Mahony: HPE believes that a new architecture is needed for Big Data – one that is designed to power innovation and value creation for the new breed of applications while running traditional workloads more efficiently.
We call this new architectural approach Composable Infrastructure. HPE has a well-established track record of infrastructure innovation and success. HPE Converged Infrastructure, software-defined management, and hyper-converged systems have consistently proven to reduce costs and increase operational efficiency by eliminating silos and freeing available compute, storage, and networking resources. Building on our converged infrastructure knowledge and experience, we have designed a new architecture that can meet the growing demands for a faster, more open, and continuous infrastructure.

Q7. What is HPE Cloud Strategy? 

Colin Mahony: Hybrid cloud adoption is continuing to grow at a rapid rate and a majority of our customers recognize that they simply can’t achieve the full measure of their business goals by consuming only one kind of cloud.
HPE Helion not only offers private cloud deployments and managed private cloud services, but we have created the HPE Helion Network, a global ecosystem of service providers, ISVs, and VARs dedicated to delivering open standards-based hybrid cloud services to enterprise customers. Through our ecosystem, our customers gain access to an expanded set of cloud services and improve their abilities to meet country-specific data regulations.

In addition to the private cloud offerings, we have a strategic and close alliance with Microsoft Azure, which enables many of our offerings, including Haven OnDemand, in the public cloud. We also work closely with Amazon because our strategy is not to limit our customers, but to ensure that they have the choices they need and the services and support they can depend upon.

Q8. What are the advantages of an offering like Vertica in this space?

Colin Mahony: More and more companies are exploring the possibility of moving their data analytics operations to the cloud. We offer HPE Vertica OnDemand, our data warehouse as a service, for organizations that need high-performance enterprise class data analytics for all of their data to make better business decisions now. Built by design to drastically improve query performance over traditional relational database systems, HPE Vertica OnDemand is engineered from the same technology that powers the HPE Vertica Analytics Platform. For organizations that want to select Amazon hardware and still maintain the control over the installation, configuration, and overall maintenance of Vertica for ultimate performance and control, we offer Vertica AMI (Amazon Machine Image). The Vertica AMI is a bring-your-own-license model that is ideal for organizations that want the same experience as on-premise installations, only without procuring and setting up hardware. Regardless of which deployment model to choose, we have you covered for “on demand” or “enterprise cloud” options.

Q9. What is HPE Vertica Community Edition?

Colin Mahony: We have had tens of thousands of downloads of the HPE Vertica Community Edition, a freemium edition of HPE Vertica with all of the core features and functionality that you experience with our core enterprise offering. It’s completely free for up to 1 TB of data storage across three nodes. Companies of all sizes prefer the Community Edition to download, install, set-up, and configure Vertica very quickly on x86 hardware or use our Amazon Machine Image (AMI) for a bring-your-own-license approach to the cloud.

Q10. Can you tell us how Kiva.org, a non-profit organization, uses on-demand cloud analytics to leverage the internet and a worldwide network of microfinance institutions to help fight poverty? 

Colin Mahony: HPE is a major supporter of Kiva.org, a non-profit organization with a mission to connect people through lending to alleviate poverty. Kiva.org uses the internet and a worldwide network of microfinance institutions to enable individuals lend as little as $25 to help create opportunity around the world. When the opportunity arose to help support Kiva.org with an analytical platform to further the cause, we jumped at the opportunity. Kiva.org relies on Vertica OnDemand to reduce capital costs, leverage the SaaS delivery model to adapt more quickly to changing business requirements, and work with over a million lenders, hundreds of field partners and volunteers, across the world. To see a recorded Webinar with HPE and Kiva.org, see here.

Qx Anything else you wish to add?

Colin Mahony: We appreciate the opportunity to share the features and benefits of HPE Vertica as well as the bright market outlook for data-driven organizations. However, I always recommend that any organization that is struggling with how to get started with their analytics initiative to speak and meet with peers to learn best practices and avoid potential pitfalls. The best way to do that, in my opinion, is to visit with the more than 1,000 Big Data experts in Boston from August 29 – September 1st at the HPE Big Data Conference. Click here to learn more and join us for 40+ technical deep-dive sessions.

————-

Colin Mahony, SVP & General Manager, HPE Big Data Platform

Colin Mahony leads the Hewlett Packard Enterprise Big Data Platform business group, which is responsible for the industry leading Vertica Advanced Analytics portfolio, the IDOL Enterprise software that provides context and analysis of unstructured data, and Haven OnDemand, a platform for developers to leverage APIs and on demand services for their applications.
In 2011, Colin joined Hewlett Packard as part of the highly successful acquisition of Vertica, and took on the responsibility of VP and General Manager for HP Vertica, where he guided the business to remarkable annual growth and recognized industry leadership. Colin brings a unique combination of technical knowledge, market intelligence, customer relationships, and strategic partnerships to one of the fastest growing and most exciting segments of HP Software.

Prior to Vertica, Colin was a Vice President at Bessemer Venture Partners focused on investments primarily in enterprise software, telecommunications, and digital media. He established a great network and reputation for assisting in the creation and ongoing operations of companies through his knowledge of technology, markets and general management in both small startups and larger companies. Prior to Bessemer, Colin worked at Lazard Technology Partners in a similar investor capacity.

Prior to his venture capital experience, Colin was a Senior Analyst at the Yankee Group serving as an industry analyst and consultant covering databases, BI, middleware, application servers and ERP systems. Colin helped build the ERP and Internet Computing Strategies practice at Yankee in the late nineties.

Colin earned an M.B.A. from Harvard Business School and a bachelor’s degrees in Economics with a minor in Computer Science from Georgetown University.  He is an active volunteer with Big Brothers Big Sisters of Massachusetts Bay and the Joey Fund for Cystic Fibrosis.

Resources

What’s in store for Big Data analytics in 2016, Steve Sarsfield, Hewlett Packard Enterprise. ODBMS.org, 3 FEB, 2016

What’s New in Vertica 7.2?: Apache Kafka Integration!, HPE, last edited February 2, 2016

Gartner Says 6.4 Billion Connected “Things” Will Be in Use in 2016, Up 30 Percent From 2015, Press release, November 10, 2015

The Benefits of HP Vertica for SQL on Hadoop, HPE, July 13, 2015

Uplevel Big Data Analytics with Graph in Vertica – Part 5: Putting graph to work for your business , Walter Maguire, Chief Field Technologist, HP Big Data Group, ODBMS.org, 2 Nov, 2015

HP Distributed R ,ODBMS.org,  19 FEB, 2015.

Understanding ROS and WOS: A Hybrid Data Storage Model, HPE, October 7, 2015

Related Posts

On Big Data Analytics. Interview with Shilpa LawandeSource: ODBMS Industry Watch, Published on December 10, 2015

On HP Distributed R. Interview with Walter Maguire and Indrajit RoySource: ODBMS Industry Watch, Published on April 9, 2015

Follow us on Twitter: @odbmsorg

##

Feb 25 16

A Grand Tour of Big Data. Interview with Alan Morrison

by Roberto V. Zicari

“Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.”–Alan Morrison.

I have interviewed Alan Morrison, senior research fellow at PwC, Center for Technology and Innovation.
Main topic of the interview is how the Big Data market is evolving.

RVZ

Q1. How do you see the Big Data market evolving? 

Alan Morrison: We should note first of all how true Big Data and analytics methods emerged and what has been disruptive. Over the course of a decade, web companies have donated IP and millions of lines of code that serves as the foundation for what’s being built on top.  In the process, they’ve built an open source culture that is currently driving most big data-related innovation. As you mentioned to me last year, Roberto, a lot of database innovation was the result of people outside the world of databases changing what they thought needed to be fixed, people who really weren’t versed in the database technologies to begin with.

Enterprises and the database and analytics systems vendors who serve them have to constantly adjust to the innovation that’s being pushed into the open source big data analytics pipeline. Open source machine learning is becoming the icing on top of that layer cake.

Q2. In your opinion what are the challenges of using Big Data technologies in the enterprise?

Alan Morrison: Traditional enterprise developers were thrown for a loop back in the late 2000s when it comes to open source software, and they’re still adjusting. The severity of the problem differs depending on the age of the enterprise. In our 2012 issue of the Forecast on DevOps, we made clear distinctions between three age classes of companies: legacy mainstream enterprises, pre-cloud enterprises and cloud natives. Legacy enterprises could have systems that are 50 years old or more still in place and have simply added to those. Pre-cloud enterprises are fighting with legacy that’s up to 20 years old. Cloud natives don’t have to fight legacy and can start from scratch with current tech.

DevOps (dev + ops) is an evolution of agile development that focuses on closer collaboration between developers and operations personnel. It’s a successor to agile development, a methodology that enables multiple daily updates to operational codebases and feedback-response loop tuning by making small code changes and see how those change user experience and behaviour. The linked article makes a distinction between legacy, pre-cloud and cloud native enterprises in terms of their inherent level of agility:

Fig1
 Most enterprises are in the legacy mainstream group, and the technology adoption challenges they face are the same regardless of the technology. To build feedback-response loops for a data-driven enterprise in a legacy environment is more complicated in older enterprises. But you can create guerilla teams to kickstart the innovation process.

Q3. Is the Hadoop ecosystem now ready for enterprise deployment at large scale? 

Alan MorrisonHadoop is ten years old at this point, and Yahoo, a very large mature enterprise, has been running Hadoop on 10,000 nodes for years now. Back in 2010, we profiled a legacy mainstream media company who was doing logfile analysis from all of its numerous web properties on a Hadoop cluster quite effectively. Hadoop is to the point where people in their dens and garages are putting it on Raspberry Pi systems. Lots of companies are storing data in or staging it from HDFS. HDFS is a given. MapReduce, on the other hand, has given way to Spark.

HDFS preserves files in their original format immutably, and that’s important. That innovation was crucial to data-driven application development a decade ago. But Hadoop isn’t the end state for distributed storage, and NoSQL databases aren’t either. It’s best to keep in mind that alternatives to Hadoop and its ecosystem are emerging.

I find it fascinating what folks like LinkedIn and Metamarkets are doing data architecture wise with the Kappa architecture–essentially a stream processing architecture that also works for batch analytics, a system where operational and analytical data are one and the same. That’s appropriate for fully online, all-digital businesses.  You can use HDFS, S3, GlusterFS or some other file system along with a database such as Druid. On the transactional side of things, the nascent IPFS (the Interplanetary File System) anticipates both peer-to-peer and the use of blockchains in environments that are more and more distributed. Here’s a diagram we published last year that describes this evolution to date:
Fig2

From PWC Technology Forecast 2015

People shouldn’t be focused on Hadoop, but what Hadoop has cleared a path for that comes next.

Q4. What are in your opinion the most innovative Big Data technologies?

Alan Morrison: The rise of immutable data stores (HDFS, Datomic, Couchbase and other comparable databases, as well as blockchains) was significant because it was an acknowledgement that data history and permanence matters, the technology is mature enough and the cost is low enough to eliminate the need to overwrite. These data stores also established that eliminating overwrites also eliminates a cause of contention. We’re moving toward native cloud and eventually the P2P fog (localized, more truly distributed computing) that will extend the footprint of the cloud for the Internet of things.

Unsupervised machine learning has made significant strides in the past year or two, and it has become possible to extract facts from unstructured data, building on the success of entity and relationship extraction. What this advance implies is the ability to put humans in feedback loops with machines, where they let machines discover the data models and facts and then tune or verify those data models and facts.

In other words, large enterprises now have the capability to build their own industry- and organization-specific knowledge graphs and begin to develop cognitive or intelligent apps on top those knowledge graphs, along the lines of what Cirrus Shakeri of Inventurist envisions.

Fig3

From Cirrus Shakeri, “From Big Data to Intelligent Applications,”  post, January 2015 

At the core of computable semantic graphs (Shakeri’s term for knowledge graphs or computable knowledge bases) is logically consistent semantic metadata. A machine-assisted process can help with entity and relationship extraction and then also ontology generation.

Computability = machine readability. Semantic metadata–the kind of metadata cognitive computing apps use–can be generated with the help of a well-designed and updated ontology. More and more, these ontologies are uncovered in text rather than hand built, but again, there’s no substitute for humans in the loop. Think of the process of cognitive app development as a continual feedback-response loop process. The use of agents can facilitate the construction of these feedback loops.

Q5. In a recent note Carl Olofson, Research Vice President, Data Management Software Research, IDC, predicted the RIP of “Big Data” as a concept. What is your view on this?

Alan Morrison: I agree the term is nebulous and can be misleading, and we’ve had our fill of it. But that doesn’t mean it won’t continue to be used. Here’s how we defined it back in 2009:

Big Data is not a precise term; rather, it is a characterization of the never-ending accumulation of all kinds of data, most of it unstructured. It describes data sets that are growing exponentially and that are too large, too raw, or too unstructured for analysis using relational database techniques. Whether terabytes or petabytes, the precise amount is less the issue than where the data ends up and how it is used. (See https://www.pwc.com/us/en/technology-forecast/assets/pwc-tech-forecast-issue3-2010.pdf, pg. 6.)

For that issue of the Forecast, we focused on how Hadoop was being piloted in enterprises and the ecosystem that was developing around it. Hadoop was the primary disruptive technology, as well as NoSQL databases. It helps to consider the data challenge of the 2000s and how relational databases and enterprise data warehousing techniques were falling short at that point.  Hadoop has reduced the cost of analyzing data by an order of magnitude and allows processing of very large unstructured datasets. NoSQL has made it possible to move away from rigid data models and standard ETL.

“Big Data” can continue to be shorthand for petabytes of unruly, less structured data. But why not talk about the system instead of just the data? I like the term that George Gilbert of Wikibon latched on to last year. I don’t know if he originated it, but he refers to the System of Intelligence. That term gets us beyond the legacy, pre-web “business intelligence” term, more into actionable knowledge outputs that go beyond traditional reporting and into the realm of big data, machine learning and more distributed systems. The Hadoop ecosystem, other distributed file systems, NoSQL databases and the new analytics capabilities that rely on them are really at the heart of a System of Intelligence.

Q6. How many enterprise IT systems do you think we will need to interoperate in the future? 

Alan Morrison: I like Geoffrey Moore‘s observations about a System of Engagement that emerged after the System of Record, and just last year George Gilbert was adding to that taxonomy with a System of Intelligence. But you could add further to that with a System of Collection that we still need to build. Just to be consistent, the System of Collection articulates how the Internet of Things at scale would function on the input side. The System of Engagement would allow distribution of the outputs. For the outputs of the System of Collection to be useful, that system will need to interoperate in various ways with the other systems.

To summarize, there will actually be four enterprise IT systems that will need to interoperate, ultimately. Three of these exist, and one still needs to be created.

The fuller picture will only emerge when this interoperation becomes possible.

Q7. What are the  requirements, heritage and legacy of such systems?

Alan Morrison: The System of Record (RDBMSes) still relies on databases and tech with their roots in the pre-web era. I’m not saying these systems haven’t been substantially evolved and refined, but they do still reflect a centralized, pre-web mentality. Bitcoin and Blockchain make it clear that the future of Systems of Record won’t always be centralized. In fact, microtransaction flows in the Internet of Things at scale will depend on the decentralized approaches,  algorithmic transaction validation, and immutable audit trail creation which blockchain inspires.

The Web is only an interim step in the distributed system evolution. P2P systems will eventually complemnt the web, but they’ll take a long time to kick in fully–well into the next decade. There’s always the S-curve of adoption that starts flat for years. P2P has ten years of an installed base of cloud tech, twenty years of web tech and fifty years plus of centralized computing to fight with. The bitcoin blockchain seems to have kicked P2P in gear finally, but progress will be slow through 2020.

The System of Engagement (requiring Web DBs) primarily relies on Web technnology (MySQL and NoSQL) in conjunction with traditional CRM and other customer-related structured databases.

The System of Intelligence (requiring Web file systems and less structured DBs) primarily relies on NoSQL, Hadoop, the Hadoop ecosystem and its successors, but is built around a core DW/DM RDBMS analytics environment with ETLed structured data from the System of Record and System of Engagement. The System of Intelligence will have to scale and evolve to accommodate input from the System of Collection.

The System of Collection (requiring distributed file systems and DBs) will rely on distributed file system successors to Hadoop and HTTP such as IPFS and the more distributed successors to MySQL+ NoSQL. Over the very long term, a peer-to-peer architecture will emerge that will become necessary to extend the footprint of the internet of things and allow it to scale.

Q8. Do you already have the piece parts to begin to build out a 2020+ intersystem vision now?

Alan Morrison: Contextual, ubiquitous computing is the vision of the 2020s, but to get to that, we need an intersystem approach. Without interoperation of the four systems I’ve alluded to, enterprises won’t be able to deliver the context required for competitive advantage. Without sufficient entity and relationship disambiguation via machine learning in machine/human feedback loops, enterprises won’t be able to deliver the relevance for competitive advantage.

We do have the piece parts to begin to build out an intersystem vision now. For example, interoperation is a primary stumbling block that can be overcome now. Middleware has been overly complex and inadequate to the current-day task, but middleware platforms such as EnterpriseWeb are emerging that can reach out as an integration fabric for all systems, up and down the stack. Here’s how the integration fabric becomes an essential enabler for the intersystem approach:

Fig4
PwC, 2015

A lot of what EnterpriseWeb (full disclosure: a JBR partner of PwC) does hinges on the creation and use of agents and semantic metadata that enable the data/logic virtualization. That’s what makes the desiloing possible. One of the things about the EnterpriseWeb platform is that it’s a full stack virtual integration and application platform, using methods that have data layer granularity, but process layer impact. Enterprise architects can tune their models and update operational processes at the same time. The result: every change is model-driven and near real-time. Stacks can all be simplified down to uniform, virtualized composable entities using enabling technologies that work at the data layer. Here’s how they work:

Fig5
PwC, 2015

So basically you can do process refinement across these systems, and intersystem analytics views thus also become possible.

Qx anything else you wish to add? 

Alan Morrison: We always quote science fiction writer William Gibson, who said,

“The future is already here — it’s just not very evenly distributed.”

Enterprises would do best to remind themselves what’s possible now and start working with it. You’ve got to grab onto that technology edge and let it pull you forward. If you don’t understand what’s possible, most relevant to your future business success and how to use it, you’ll never make progress and you’ll always be reacting to crises. Leading enterprises have a firm grasp of the technology edge that’s relevant to them. Better data analysis and disambiguation through semantics is central to how they gain competitive advantage today.

We do a ton of research to get to the big picture and find the real edge, where tech could actually have a major business impact. And we try to think about what the business impact will be, rather than just thinking about the tech. Most folks who are down in the trenches are dismissive of the big picture, but the fact is they aren’t seeing enough of the horizon to make an informed judgement. They are trying to use tools they’re familiar with to address problems the tools weren’t designed for. Alongside them should be some informed contrarians and innovators to provide balance and get to a happy medium.

That’s how you counter groupthink in an enterprise. Executives need to clear a path for innovation and foster a healthy, forward-looking, positive and tolerant mentality. If the workforce is cynical, that’s an indication that they lack a sense of purpose or are facing systemic or organizational problems they can’t overcome on their own.

—————–
Alan Morrison (@AlanMorrison) is a senior research fellow at PwC, a longtime technology trends analyst and an issue editor of the firm’s Technology Forecast

Resources

Data-driven payments. How financial institutions can win in a networked economy, BY, Mark Flamme, Partner; Kevin Grieve, Partner;  Mike Horvath, Principal Strategy&. FEBRUARY 4, 2016, ODBMS.org

The rise of immutable data stores, By Alan Morrison, Senior Manager, PwC Center for technology and innovation (CTI), OCTOBER 9, 2015, ODBMS.org

The enterprise data lake: Better integration and deeper analytics, By Brian Stein and Alan Morrison, PwC, AUGUST 20, 2014 ODBMS.org

Related Posts

On the Industrial Internet of Things. Interview with Leon Guzenda , ODBMS Industry Watch, January 28, 2016

On Big Data and Society. Interview with Viktor Mayer-Schönberger , ODBMS Industry Watch, January 8, 2016

On Big Data Analytics. Interview with Shilpa Lawande , ODBMS Industry Watch, December 10, 2015

On Dark Data. Interview with Gideon Goldin , ODBMS Industry Watch, November 16, 2015

Follow us on Twitter: @odbmsorg

##

Feb 9 16

Orleans, the technology behind Xbox Halo4 and Halo5. Interview with Phil Bernstein

by Roberto V. Zicari

“Orleans is an open-source programming framework for .NET that simplifies the development of distributed applications, that is, ones that run on many servers in a datacenter.”– Phil Bernstein.

I have interviewed, Phil Bernstein,a well known data base researcher and Distinguished Scientist at Microsoft Research, where he has worked for over 20 years. We discussed his latest project “Orleans”.

RVZ

Q1. With the project “Orleans” you and your team invented the “Virtual Actor abstraction”. What is it? 

Phil Bernstein: Orleans is an open-source programming framework for .NET that simplifies the development of distributed applications, that is, ones that run on many servers in a datacenter. In Orleans, objects are actors, by which we mean that they don’t share memory.

In Orleans, actors are virtual in the same sense as virtual memory: an object is activated on demand, i.e. when one of its methods is invoked. If an object is already active when it’s invoked, the Orleans runtime will use its object directory to find the object and invoke it. If the runtime determines that the object isn’t active, the runtime will choose a server on which to activate the object, invoke the object’s constructor on that server to load its state, invoke the method, and update the object directory so it can direct future calls to the object.

Conversely, an object is deactivated when it hasn’t been invoked for some time. In that case, the runtime calls the object’s deactivate method, which does whatever cleanup is needed before freeing up the object’s runtime resources.

Q2. How is it possible to build distributed interactive applications, without the need to learn complex programming patterns? 

Phil Bernstein:  The virtual actor model hides distribution from the developer. You write code as if your program runs on one machine. The Orleans runtime is responsible for distributing objects across servers, which is something that doesn’t affect the program logic. Of course, there are performance and fault tolerance implications of distribution.
But Orleans is able to hide them too.

Q3. Building interactive services that are scalable and reliable is hard. How do you ensure that Orleans applications scale-up and are reliable?

Phil Bernstein:  The biggest impediment to scaling out an app across servers is to ensure no server is a bottleneck. Orleans does this by evenly distributing the objects across servers. This automatically balances the load.

As for reliability, the virtual actor model makes this automatic. If a server fails, then of course all of the objects that were active on that server are gone. No problem. The Orleans runtime detects the server failure and knows which objects were active on the failed server. So the next time any of those objects is invoked, it takes its usual course of action, that is, it chooses a server on which to activate the object, loads the object, and invokes it.

Q4. What about the object’s state? Doesn’t that disappear when its server fails?

Phil Bernstein:  Yes, of course all of the object’s main memory state is lost. It’s up to the object’s methods to save object state persistently, typically just before returning from a method that modifies the object’s state.

Q5. Is this transactional?

Phil Bernstein:  No, not yet. We’re working on adding a transaction mechanism. Coming soon.

Q6. Can you give us an example of an Orleans application?

Phil Bernstein:  Orleans is used for developing large-scale on-line games. For example, all of the cloud services for Halo 4 and Halo 5, the popular Xbox games, run on Orleans. Example object types are players, game consoles, game instances, weapons caches, and leaderboards. Orleans is also used for Internet of Things, communications, and telemetry applications. All of these applications are naturally actor-oriented, so they fit well with the Orleans programming model.

Q7. Why does the traditional three-tier architecture with stateless front-ends, stateless middle tier and a storage layer have limited scalability? 

Phil Bernstein:  The usual bottleneck is the storage layer. To solve this, developers add a middle tier to cache some state and thereby reduce the storage load. However, this middle tier loses the concurrency control semantics of storage, and now you have the hard problem of distributed cache invalidation. To enforce storage semantics, Orleans makes it trivial to express cached items as objects. And to avoid concurrency control problems, it routes requests to a single instance of each object, which is ordinarily single-threaded.

Also, a middle-tier cache does data shipping to the storage servers, which can be inefficient. With Orleans, you have an object-oriented cache and do function shipping instead.

Q8. How does Orleans differ from other Actor platforms such as Erlang and Akka? 

Phil Bernstein: In Erlang and Akka, the developer controls actor lifecycle. You explicitly create an actor and choose the server on which it’s activated. Fixing the actor’s location at creation time prevents automating load balancing, actor migration, and server failure handling. For example, if an actor fails, you need code to catch the exception and resurrect the actor on another server. In Orleans, this is all automatic.

Another difference is the communications model. Orleans uses asynchronous RPC’s. Erlang and Akka use one-way messages.

Q9. Database people sometimes focus exclusively on the data model and query language, and don’t consider the problem of writing a scalable application on top of the database. How is Orleans addressing this issue?

Phil Bernstein:  In a database-centric view, an app is a set of stored procedures with a stateless front-end and possibly a middle-tier cache. To scale out the app with this design, you need to partition the database into finer slices every time you want to add servers. By contrast, if your app runs on servers that are separate from the database, as it does with Orleans, you can add servers to scale out the app without scaling out the storage. This is easier, more flexible, and less expensive. For example, you can run with more app servers during the day when there’s heavier usage and fewer servers at night when the workload dies down. This is usually infeasible at the database server layer, since it would require migrating parts of the database twice a day.

Q10. Why did you transfer the core Orleans technology to 343 Industries ?

Phil Bernstein:  Orleans was developed in Microsoft Research starting in 2009. Like any research project, after several years of use in production, it was time to move it into a product group, which can better afford the resources to support it. Initially, that was 343 Industries, the biggest Orleans user, which ships the Halo game. After Halo 5 shipped, the Orleans group moved to the parent organization, Microsoft Game Studios, which provides technology to Halo and many other Xbox games.

In Microsoft Research, we are still working on Orleans technology and collaborate closely with the product group. For example, we recently published code to support geo-distributed applications on Orleans, and we’re currently working on adding a transaction mechanism.

Q11. The core Orleans technology was also made available as open source in January 2015. Are developers actively contributing to this? 

Phil Bernstein: Yes, there is a lot of activity, with contributions from developers both inside and outside Microsoft. You can see the numbers on GitHub – roughly 25 active contributors and over 25 more occasional contributors – with fully-tested releases published every couple of months. After the core .NET runtime and Roslyn compiler projects, Orleans is the next most popular .NET Foundation project on GitHub.

 ——————

Phil Bernstein is a Distinguished Scientist at Microsoft Research, where he has worked for over 20 years. Before Microsoft, he was a product architect and researcher at Digital Equipment Corp. and a professor at Harvard University. He has published over 150 papers and two books on the theory and implementation of database systems, especially on transaction processing and data integration, which are still the major areas of his work. He is an ACM Fellow, a winner of the ACM SIGMOD Innovations Award, a member of the Washington State Academy of Sciences and a member of the U.S. National Academy of Engineering. He received a B.S. degree from Cornell and M.Sc. and Ph.D. from University of Toronto.

Resources

Microsoft Research homepage for Orleans

Orleans code on GitHub

Orleans documentation

Related Posts

On the Industrial Internet of Things. Interview with Leon Guzenda ODBMS Industry Watch, Published on 2016-01-28

Challenges and Opportunities of The Internet of Things. Interview with Steve Cellini ODBMS Industry Watch, Published on 2015-10-07

Follow ODBMS.org on Twitter: @odbmsorg

##