ODBMS Industry Watch » InterSystems http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Sun, 02 Apr 2017 17:59:10 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.13 New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/ http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/#comments Wed, 30 Nov 2016 20:30:20 +0000 http://www.odbms.org/blog/?p=4272

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.

RVZ

Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

——————————-
Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.

——————————-

Resources
(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

Related Posts

MarkLogic Named a Next-Generation Database Challenger in 2016 Gartner Magic Quadrant. By GARY BLOOM, Chief Executive Officer and President MARKLOGIC

MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)

 

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/feed/ 0
On Data Interoperability. Interview with Julie Lockner. http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/ http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/#comments Tue, 07 Jun 2016 16:47:14 +0000 http://www.odbms.org/blog/?p=4151

“From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care?”– Julie Lockner.

I have interviewed Julie Lockner.  Julie leads data platform product marketing for InterSystems. Main topics of the interview are Data Interoperability and InterSystems` data platform strategy.

RVZ

Q1. Everybody is talking about Big Data — is the term obsolete?

Julie Lockner: Well, there is no doubt that the sheer volume of data is exploding, especially with the proliferation of smart devices and the Internet of Things (IoT). An overlooked aspect of IoT is the enormous volume of data generated by a variety devices, and how to connect, integrate and manage it all.

The real challenge, though, is not just processing all that data, but extracting useful insights from the variety of device types. Put another way, not all data is created using a common standard. You want to know how to interpret data from each device, know which data from what type of device is important, and which trends are noteworthy. Better information can create better results when it can be aggregated and analyzed consistently, and that’s what we really care about. Better, higher quality outcomes, not bigger data.

Q2. If not Big Data, where do we go from here?

Julie Lockner: We always want to be focusing on helping our customers build smarter applications to solve real business challenges, such as helping them to better compete on service, roll out high-quality products quicker, simplify processes – not build solutions in search of a problem. A canonical example is in retail. Our customers want to leverage insight from every transaction they process to create a better buying experience online or at the point of sale. This means being able to aggregate information about a customer, analyze what the customer is doing while on the website, and make an offer at transaction time that would delight them. That’s the goal – a better experience – because that is what online consumers expect.

From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care? That implies we are analyzing not just more data, but better data that comes in all shapes and sizes, and that changes more frequently. It really points to the need for data interoperability.

Q3. What are the challenges software developers are telling you they have in today’s data-intensive world?

Julie Lockner: That they have too many database technologies to choose from and prefer to have a simple data platform architecture that can support multiple data models and multiple workloads within a single development environment.
We understand that our customers need to build applications that can handle a vast increase in data volume, but also a vast array of data types – static, non-static, local, remote, structured and non-structured. It must be a platform that coalesces all these things, brings services to data, offers a range of data models, and deals with data at any volume to create a more stable, long-term foundation. They want all of these capabilities in one platform – not a platform for each data type.

For software developers today, it’s not enough to pick elements that solve some aspect of a problem and build enterprise solutions around them; not all components scale equally. You need a common platform without sacrificing scalability, security, resilience, rapid response. Meeting all these demands with the right data platform will create a successful application.
And the development experience is significantly improved and productivity drastically increased when they can use a single platform that meets all these needs. This is why they work with InterSystems.

Q4. Traditionally, analytics is used with structured data, “slicing and dicing” numbers. But the traditional approach also involves creating and maintaining a data warehouse, which can only provide a historical view of data. Does this work also in the new world of Internet of Things?

Julie Lockner: I don’t think so. It is generally possible to take amorphous data and build it into a structured data model, but to respond effectively to rapidly changing events, you need to be able to take data in the form in which it comes to you.

If your data platform lacks certain fields, if you lack schema definition, you need to be able to capitalize on all these forms without generating a static model or a refinement process. With a data warehouse approach, it can take days or weeks to create fully cleansed, normalized data.
That’s just not fast enough in today’s always-on world – especially as machine-generated data is not conforming to a common format any time soon. It comes back to the need for a data platform that supports interoperability.

Q5. How hard is it to make decisions based on real-time analysis of structured and unstructured data?

Julie Lockner: It doesn’t have to be hard. You need to generate rules that feed rules engines that, in turn, drive decisions, and then constantly update those rules. That is a radical enhancement of the concept of analytics in the service of improving outcomes, as more real-time feedback loops become available.

The collection of changes we describe as Big Data will profoundly transform enterprise applications of the future. Today we can see the potential to drive business in new ways and take advantage of a convergence of trends, but it is not happening yet. Where progress has been made is the intelligence of devices and first-level data aggregation, but not in the area of services that are needed. We’re not there yet.

Q6. What’s next on the horizon for InterSystems in meeting the data platform requirements of this new world?

Julie Lockner: We continually work on our data platform, developing the most innovative ways we can think of to integrate with new technologies and new modes of thinking. Interoperability is a hugely important component. It may seem a simple task to get to the single most pertinent fact, but the means to get there may be quite complex. You need to be able to make the right data available – easily – to construct the right questions.

Data is in all forms and at varying levels of completeness, cleanliness, and accuracy. For data to be consumed as we describe, you need measures of how well you can use it. You need to curate data so it gets cleansed and you can cull what is important. You need flexibility in how you view data, too. Gathering data without imposing an orthodoxy or structure allows you to gain access to more data. Not all data will conform to a schema a priori.

Q7. Recently you conducted a benchmark test of an application based on InterSystems Caché®. Could you please summarize the main results you have obtained?

Julie Lockner: One of our largest customers is Epic Systems, one of the world’s top healthcare software companies.
Epic relies on Caché as the data platform for electronic medical record solutions serving more than half the U.S. patient population and millions of patients worldwide.

Epic tested the scalability and performance improvements of Caché version 2015.1. Almost doubling the scalability of prior versions, Caché delivers what Epic President Cark Dvorak has described as “a key strategic advantage for our user organizations that are pursuing large-scale medical informatics programs as well as aggressive growth strategies in preparation for the volume-to-value transformation in healthcare.”

Qx Anything else you wish to add?

Julie Lockner: The reason why InterSystems has succeeded in the market for so many years is a commitment to the success of those who depend on our technology. A recent Gartner Magic Quadrant report found we had the highest number of customers surveyed – 85% – who would buy from us again. That is the highest number of any vendor participating in that study.

The foundation of the company’s culture is all about helping our customers succeed. When our customers come to us with a challenge, we all pitch in to solve it. Many times our solutions may address an unusual problem that could benefit others – which then becomes the source of many of our innovations. It is one of the ways we are using problem-solving skills as a winning strategy to benefit others. When our customers are successful at using our engine to solve the world’s most important challenges, we all win.

——————-

Julie Lockner leads data platform product marketing for InterSystems. She has more than 20 years of experience in IT product marketing management and technology strategy, including roles at analyst firm ESG as well as Informatica and EMC.

—————–

Resources

“InterSystems Unveils Major New Release of Caché,” Feb. 25, 2015.

“Gartner Magic Quadrant for Operational DBMS, Donald Feinberg, Merv Adrian, Nick Heudecker, Adam M. Ronthal, and Terilyn Palanca, October 12, 2015, ID: G00271405.

– White Paper: Big Data Healthcare: Data Scalability with InterSystems Caché® and Intel® Processors (LINK to .PDF)

Related Posts

– A Grand Tour of Big Data. Interview with Alan Morrison. ODBMs Industry Watch, February 25, 2016

–  RIP Big Data. By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, JANUARY 6, 2016.

What is data blending. By Oleg Roderick, David Sanchez, Geisinger Data Science. ODBMS.org, November 2015

Follow us on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/feed/ 0
The Gaia mission in 2015. Interview with Uwe Lammers and Vik Nagjee http://www.odbms.org/blog/2015/03/gaia-mission/ http://www.odbms.org/blog/2015/03/gaia-mission/#comments Tue, 24 Mar 2015 10:10:00 +0000 http://www.odbms.org/blog/?p=3810

“Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.”–Dr. Uwe Lammers.

“The Gaia mission is considered to be the largest data processing challenge in astronomy.”–Vik Nagjee

In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past.

The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.

I recall here the Objectives of the Gaia Project (source ESA Web site):

“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”

I have been following the GAIA mission since 2011, and I have reported it in two interviews until now. This is the third interview of the series, the first one after the launch.
The interview is with Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency, and Vik Nagjee, Product Manager for Data Platforms at InterSystems.

RVZ

Q1. Could you please elaborate in some detail what is the goal and what are the expected results of the Gaia mission?

Uwe Lammers: We are trying to construct the most consistent, most complete and most accurate astronomical catalog ever done. Completeness means to observe all objects in the sky that are brighter than a so-called magnitude limit of 20. These are mostly stars in our Milky Way up to 1.5 billion in number. In addition, we expect to observe as many as 10 million other galaxies, hundreds of thousands of celestial bodies in our solar system (mostly asteroids), tens of thousands of new exo-planets, and more. Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.

Vik Nagjee: The data collected from Gaia will ultimately result in a three-dimensional map of the Milky Way, plotting over a billion celestial objects at a distance of up to 30,000 light years. This will reveal the composition, formation and evolution of the Galaxy, and will enable the testing of Albert Einstein’s Theory of Relativity, the space-time continuum, and gravitational waves, among other things. As such, the Gaia mission is considered to be the largest data processing challenge in astronomy.

Orbiting the Lagrange 2 (L2) point, a fixed spot 1.5 million kilometers from Earth, Gaia will measure the position, movement, and brightness of more than a billion celestial objects, looking at each one an average of 70 times over the course of five years. Gaia’s measurements will be much more complete, powerful, and accurate than anything that has been done before. ESA scientists estimate that Gaia will find hundreds of thousands of new celestial objects, including extra-solar planets, and the failed stars known as brown dwarfs. In addition, because Gaia can so accurately measure the position and movement of the stars, it will provide valuable information about the galaxy’s past – and future – evolution.

Read more about the Gaia mission here.

Q2. What is the size and structure of the information you analysed so far?

Uwe Lammers: From the start of the nominal mission on 25 July until today, we have received about 13 terabytes of compressed binary telemetry from the satellite. The daily pipeline running here at the Science Operations Centre (SOC) has processed all this and generated about 48 TB of higher-level data products for downstream systems.
At the end of the mission, the Main Database (MDB) is expected to hold more than 1 petabyte of data. The structure of the data is complex and this is one of the main challenges of the project. Our data model contains about 1,500 tables with thousands of fields in total, and many inter-dependencies. The final catalog to be released sometime around 2020 will have a simpler structure, and there will be ways to access and work with it in a convenient form, of course.

Q3. Since the launch of Gaia in December 2013, what intermediate results did you obtain by analysing the data received so far?

Uwe Lammers: Last year we found our first supernova (exploding star)  with the prototype of the so-called Science Alert pipeline. When this system is fully operational, we expect to find several of these per day. The recent detection of a micro-lensing event was another nice demonstration of Gaia’s capabilities.

Q4. Did you find out any unexpected information and/or confirmation of theories by analysing the data generated by Gaia so far?

Uwe Lammers: It is still too early in the mission to prove or disprove established astronomical theories. For that we need to collect more data and do much more processing. The daily SOC pipeline is only one, the first part, of a large distributed system that involves five other Data Processing Centres (DPCs), each running complex scientific algorithms on the data. The whole system is designed to improve the results iteratively, step by step, until the final accuracy has been reached. However, there will certainly be intermediate results. One simple example of an unexpected early finding is that Gaia gets hit by micro-meteoroids much more often than pre-launch estimates predicted.

Q5. Could you please explain at some high level the Gaia’s data pipeline?

Uwe Lammers: Hmmm, that’s not easy to do in a few words. The daily pipeline at the SOC converts compact binary telemetry of the satellite into higher level products for the downstream systems at the SOC and the other processing centres. This sounds simple, but it is not – mainly because of the complex dependencies and the fact that data does not arrive from the satellite in strict time order. The output of the daily pipeline is only the start as mentioned above.

From the SOC, data gets sent out daily to the other DPCs, which perform more specialized processing. After a number of months we declare the current data segment as closed, receive the outputs from the other DPCs back at the SOC, and integrate all into a coherent next version of the MDB. The creation of it marks the end of the current iteration and the start of a new one. This cyclic processing will go on for as many iterations as needed to converge to a final result.
An important key process is the Astrometric Global Iterative Solution (AGIS), which will give us the astrometric part of the catalog. As the name suggests, it is in itself an iterative process and we run it likewise here at the SOC.

Vik Nagjee: To add on to what Dr. Lammers describes, Gaia data processing is handled by a pan-European collaboration, the Gaia Data Processing and Analysis Consortium (DPAC), and consists of about 450 scientists and engineers from across Europe. The DPAC is organized into nine Coordination Units (CUs); each CU is responsible for a specific portion of the Gaia data processing challenge.

One of the CUs – CU3: Core Processing – is responsible for unpacking, decompressing, and processing the science data retrieved from the satellite to provide rapid monitoring and feedback of the spacecraft and payload performances at the ultra-precise accuracy levels targeted by the mission. In other words, CU3 is responsible for ensuring the accuracy of the data collected by Gaia, as it is being collected, to ensure the accuracy of the eventual 3-D catalog of the Milky Way.

Over its lifetime, Gaia will generate somewhere between 500,000 to 1 million GB of data. On an average day, approximately 50 million objects will “transit” Gaia’s field of view, resulting in about 285 GB of data. When Gaia is surveying a densely populated portion of the galaxy, the daily amount could be 7 to 10 times as much, climbing to over 2,000 GB of data in a day.

There is an eight-hour window of time each day when raw data from Gaia is downloaded to one of three ground stations.
The telemetry is sent to the European Space Astronomy Centre (ESAC) in Spain – the home of CU3: Core Processing – where the data is ingested and staged.
The initial data treatment converts the data into the complex astrometric data models required for further computation. These astrometric objects are then sent to various other Computational Units, each of which is responsible for looking at different aspects of the data. Eventually the processed data will be combined into a comprehensive catalog that will be made available to astronomers around the world.

In addition to performing the initial data treatment, ESAC also processes the resulting astrometric data with some complex algorithms to take a “first-look” at the data, making sure that Gaia is operating correctly and sending back good information. This processing occurs on the Initial Data Treatment / First Look (IDT/FL) Database; the data platform for the IDT/FL database is InterSystems Caché.

Q6. Observations made and conclusions drawn are only as good as the data that supports them. How do you evaluate the “quality” of the data you receive? and how do you discard the “noise” from the valuable information?

Uwe Lammers: A very good question! If you refer to the final catalog, this is a non-trivial problem and a whole dedicated group of people is working on it. The main issue is, of course, that we do not know the “true” values as in simulations. We work with models, e.g., models of the stars’ positions and the satellite orientation. With those we can predict the observations, and the difference between the predicted and the observed values tells us how well our models represent reality. We can also do consistency checks. For instance, we do two runs of AGIS, one with only the observations from odd months and another one from even months, and both must give similar results. But we will also make use of external astronomical knowledge to validate results, e.g., known distances to particular stars. For distinguishing “noise” from “signal,” we have implemented robust outlier rejection schemes. The quality of the data coming directly from the satellite and from the daily pipeline is assessed with a special system called First Look running also at the SOC.

Vik Nagjee: The CU3: Core Processing Unit is responsible for ensuring the accuracy of the data being collected by Gaia, as it is being collected, so as to ensure the accuracy of the eventual 3-D catalog of the Milky Way.

InterSystems Caché is the data platform used by CU3 to quickly determine that Gaia is working properly and that the data being downloaded is trustworthy. Caché was chosen for this task because of its proven ability to rapidly ingest large amounts of data, populate extremely complex astrometric data models, and instantly make the data available for just-in-time analytics using SQL, NoSQL, and object paradigms.

One million GB of data easily qualifies as Big Data. What makes InterSystems Caché unique is not so much its ability to handle very large quantities of data, but its abilities to provide just-in-time analytics on just the right data.
We call this “Big Slice” — which is where analytics is performed just-in-time for a focused result.

A good analogy is how customer service benefits from occasional Big Data analytics. Breakthrough customer service comes from improving service at the point of service, one customer at a time, based on just-in-time processing of a Big Slice – the data relevant to the customer and her interactions. Back to the Gaia mission: at the conclusion of five years of data collection, a true Big Data exercise will plot the solar map. Yet, frequently ensuring data accuracy is an example of the increasing strategic need for our “Big Slice” concept.

Q7. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?

Uwe Lammers: At the SOC all systems use InterSystems’ Caché database. Despite some initial hiccups, Cache´ has proved to be a good choice for us. For analytics we use a few popular generic astronomical tools (e.g., topcat), but most are custom-made and specific to Gaia data. All DPCs had originally used relational databases, but some have migrated to Apache’s Hadoop.

Q8. Specifically for the Initial Data Treatment/First Look (IDT/FL) database, what are the main data management challenges you have?

Uwe Lammers: The biggest challenge is clearly the data volumes and the steady incoming stream that will not stop for the next five years. The satellite sends us 40-100 GB of compressed raw data every day, which the daily pipeline needs to process and store the output in near real time, as otherwise we quickly accumulate backlogs.
This means all components, the hardware, databases, and software,  have to run and work robustly more or less around the clock. The IDTFL database grows daily by a few hundred gigabytes, but not all data has to be kept forever. There is an automatic cleanup process running that deletes data that falls out of chosen retention periods. Keeping all this machinery running around the clock is tough!

Vik Nagjee: Gaia’s data pipeline imposes some rather stringent requirements on the data platform used for the Initial Data Treatment/First Look (IDT/FL) database. The technology must be capable of ingesting a large amount of data and converting it into complex objects very quickly. In addition, the data needs to be immediately accessible for just-in-time analytics using SQL.

ESAC initially attempted to use traditional relational technology for the IDT/FL database, but soon discovered that a traditional RDBMS couldn’t ingest discrete objects quickly enough. To achieve the required insert rate, the data would have to be ingested as large BLOBs of approximately 50,000 objects, which would make further analysis extremely difficult. In particular, the first look process, which requires rapid, just-in-time analytics of the discrete astrometric data, would be untenable. Another drawback to using traditional relational technology, in addition to the typical performance and scalability challenges, was the high cost of the hardware that would be needed.

Since traditional RDBMS technology couldn’t meet the stringent demands imposed by CU3, ESAC decided to use InterSystems Caché.

Q9. How did you solve such challenges and what lessons did you learn until now?

Uwe Lammers: I have a good team of talented and very motivated people and this is certainly one aspect.
In case of problems we are also totally dependent on quick response times from the hardware vendors, the software developers and InterSystems. This has worked well in the past, and InterSystems’ excellent support in all cases where the database was involved is much appreciated. As far as the software is concerned, the clear lesson is that rigorous validation testing is essential – the more the better. There can never be too much. As a general lesson, one of my favorite quotes from Einstein captures it well: “Everything should be made as simple as possible, but no simpler.

Q10. What is the usefulness of the CU3’s IDT/FL database for the Gaia’s mission so far?

Uwe Lammers: It is indispensable. It is the central working repository of all input/output data for the daily pipeline including the important health monitoring of the satellite.

Vik Nagjee: The usefulness of CU3’s IDT/FL database was proven early in Gaia’s mission. During the commissioning period for the satellite, an initial look at the data it was generating showed that extraneous light was being gathered. If the situation couldn’t be corrected, the extra light could significantly degrade Gaia’s ability to see and measure faint objects.

It was hypothesized that water vapor from the satellite outgassed in the vacuum of space, and refroze on Gaia’s mirrors, refracting light into its focal plane. Although this phenomenon was anticipated (and the mirrors equipped with heaters for that very reason), the amount of ice deposited was more than expected. Heating the mirrors melted the ice and solved the problem.

Scientists continue to rely on the IDT/FL database to provide just-in-time feedback about the efficacy and reliability of the data they receive from Gaia.

Qx Anything else you wish to add?

Uwe Lammers: Gaia is by far the most interesting and challenging project I have every worked on.
It is fascinating to see science, technology, and a large diverse group of people working together trying to create something truly great and lasting. Please all stay tuned for exciting results from Gaia to come!

Vik Nagjee: As Dr. Lammers said, Gaia is truly one of the most interesting and challenging computing projects of all time. I’m honored to have been a contributor to this project, and cannot wait to see the results from the Gaia catalog. Here’s to unraveling the chemical and dynamical history of our Galaxy!

——————–
Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency.
Uwe Lammers has a PhD in Physics and a degree in Computer Science and has been working for the European Space Agency on a number of space science mission for the past 20 years. After being involved in the X-ray missions
EXOSAT, BeppoSAX, and XMM-Newton, Gaia caught his attention in 2004.
As of late 2005, together with William O’Mullane, he built up the Gaia Science Operations Centre (SOC) at ESAC near Madrid. From early 2006 to mid-2014 he was in charge of the development of AGIS and is now leading the SOC as Gaia Science Operations Manager.

Vik Nagjee is a Product Manager for Data Platforms at InterSystems.
He’s responsible for Performance and Scalability of InterSystems Caché, and spends the rest of his time helping people (prospects, application partners, end users, etc.) find perfect solutions for their data, processing, and system architecture needs.

Resources

ESA Web site: The GAIA Mission

ESA’s website for the Gaia Scientific Community.

Related Posts

The Gaia mission, one year later. Interview with William O’Mullane. ODBMS Industry Watch, January 16, 2013 

Objects in Space. ODBMS Industry Watch, February 14, 2011

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2015/03/gaia-mission/feed/ 0
Big Data: Three questions to InterSystems. http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/ http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/#comments Mon, 13 Jan 2014 10:41:24 +0000 http://www.odbms.org/blog/?p=2880

“The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. “–Iran Hutchinson.

I start this new year with a new series of short interviews to leading vendors of Big Data technologies. I call them “Big Data: three questions to“. The first of such interviews is with Iran Hutchinson, Big Data Specialist at InterSystems.

RVZ

Q1. What is your current “Big Data” products offering?

Iran Hutchinson: InterSystems has actually been in the Big Data business for some time, since 1978, long before anyone called it that. We currently offer an integrated database, integration and analytics platform based on InterSystems Caché®, our flagship product, to enable Big Data breakthroughs in a variety of industries.

Launched in 1997, Caché is an advanced object database that provides in-memory speed with persistence, and the ability to ingest huge volumes of transactional data at insanely high velocity. It is massively scalable, because of its very lean design. Its efficient multidimensional data structures require less disk space and provide faster SQL performance than relational databases. Caché also provides sophisticated analytics, enabling real-time queries against transactional data with minimal maintenance and hardware requirements.

InterSystems Ensemble® is our seamless platform for integrating and developing connected applications. Ensemble can be used as a central processing hub or even as backbone for nationwide networks. By integrating this connectivity with our high-performance Caché database, as well as with new technologies for analytics, high-availability, security, and mobile solutions, we can deliver a rock-solid and unified Big Data platform, not a patchwork of disparate solutions.

We also offer additional technologies built on our integrated platform, such as InterSystems HealthShare®, a health informatics platform that enables strategic interoperability and analytics for action. Our TrakCare unified health information system is likewise built upon this same integrated framework.

Q2. Who are your current customers and how do they typically use your products?

Iran Hutchinson: We continually update our technology to enable customers to better manage, ingest and analyze Big Data. Our clients are in healthcare, financial services, aerospace, utilities – industries that have extremely demanding requirements for performance and speed. For example, Caché is the world’s most widely used database in healthcare. Entire countries, such as Sweden and Scotland, run their national health systems on Caché, as well as top hospitals and health systems around the world. One client alone runs 15 percent of the world’s equity trades through InterSystems software, and all of the top 10 banks use our products.

It is also being used by the European Space Agency to map a billion stars – the largest data processing task in astronomy to date. (See The Gaia Mission One Year Later.)

Our configurable ACID (Atomicity Consistency Isolation Durability) capabilities and ECP-based approach enable us to handle these kinds of very large-scale, very high-performance, transactional Big Data applications.

Q3. What are the main new technical features you are currently working on and why?

Iran Hutchinson: There are several new paradigms we are focusing on, but let’s focus on analytics. Once you absorb all that Big Data, you want to run analytics. And that’s where the three V’s of Big Data – volume, velocity and variety – are critically important.

Let’s talk about the variety of data. Most popular Big Data analytics solutions start with the assumption of structured data – rows and columns – when the most interesting data is unstructured, or text-based data. A lot of our competitors still struggle with unstructured data, but we solved this problem with Caché in 1997, and we keep getting better at it. InterSystems Caché offers both vertical and horizontal scaling, enabling schema-less and schema-based (SQL) querying options for both structured and unstructured data.
As a result, our clients today are running analytics on all their data – and we mean real-time, operational data, not the data that is aggregated a week later or a month later for boardroom presentations.

A lot of development has been done in the area of schema-less data stores or so-called document stores, which are mainly key-value stores. The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. Some companies now offer SQL querying on schema-less data stores as an add-on or plugin. InterSystems Caché provides a high-performance key-value store with native SQL support.

The commonly available SQL-based solutions also require a predefinition of what the user is interested in. But if you don’t know the data, how do you know what’s interesting? Embedded within Caché is a unique and powerful text analysis technology, called iKnow, that analyzes unstructured data out of the box, without requiring any predefinition through ontologies or dictionaries. Whether it’s English, German, or French, iKnow can automatically identify concepts and understand their significance – and do that in real-time, at transaction speeds.

iKnow enables not only lightning-fast analysis of unstructured data, but also equally efficient Google-like keyword searching via SQL with a technology called iFind.
And because we married that iKnow technology with another real-time OLAP-type technology we call DeepSee, we make it possible to embed this analytic capability into your applications. You can extract complex concepts and build cubes on both structured AND unstructured data. We blend keyword search and concept discovery, so you can express a SQL query and pull out both concepts and keywords on unstructured data.

Much of our current development activity is focused on enhancing our iKnow technology for a more distributed environment.
This will allow people to upload a data set, structured and/or unstructured, and organize it in a flexible and dynamic way by just stepping through a brief series of graphical representation of the most relevant content in the data set. By selecting, in the graphs, the elements you want to use, you can immediately jump into the micro-context of these elements and their related structured and unstructured information objects. Alternately, you can further segment your data into subsets that fit the use you had in mind. In this second case, the set can be optimized by a number of classic NLP strategies such as similarity extension, typicality pattern parallelism, etc. The data can also be wrapped into existing cubes or into new ones, or fed into advanced predictive models.

So our goal is to offer our customers a stable solution that really uses both structured and unstructured data in a distributed and scalable way. We will demonstrate the results of our efforts in a live system at our next annual customer conference, Global Summit 2014.

We also have a software partner that has built a very exciting social media application, using our analytics technology. It’s called Social Knowledge, and it lets you monitor what people are saying on Twitter and Facebook – in real-time. Mind you, this is not keyword search, but concept analysis – a very big difference. So you can see if there’s a groundswell of consumer feedback on your new product, or your latest advertising campaign. Social Knowledge can give you that live feedback – so you can act on it right away.

In summary, today InterSystems provides SQL and DeepSee over our shared data architecture to do structured data analysis.
And for unstructured data, we offer iKnow semantic analysis technology and iFind, our iKnow-powered search mechanism, to enable information discovery in text. These features will be enabled for text analytics in future versions of our shared-nothing data architectures.

Related Posts

The Gaia mission, one year later. Interview with William O’Mullane.
ODBMS Industry Watch, January 16, 2013

Operational Database Management Systems. Interview with Nick Heudecker. ODBMS Industry Watch, December 16, 2013.

Challenges and Opportunities for Big Data. Interview with Mike Hoskins. ODBMS Industry Watch, December 3, 2013.

On Analyzing Unstructured Data. — Interview with Michael Brands.
ODBMS Industry Watch, July 11, 2012.

Resources

ODBMS.org: Big Data Analytics, NewSQL, NoSQL, Object Database Vendors –Free Resources.

ODBMS.org: Big Data and Analytical Data Platforms, NewSQL, NoSQL, Object Databases– Free Downloads and Links.

ODBMS.org: Expert Articles.

Follow ODBMS.org on Twitter: @odbmsorg

##

]]>
http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/feed/ 0
The Gaia mission, one year later. Interview with William O’Mullane. http://www.odbms.org/blog/2013/01/the-gaia-mission-one-year-later-interview-with-william-omullane/ http://www.odbms.org/blog/2013/01/the-gaia-mission-one-year-later-interview-with-william-omullane/#comments Wed, 16 Jan 2013 07:57:23 +0000 http://www.odbms.org/blog/?p=1864 ” We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.” — William O`Mullane.

The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”. I recall here the Objectives and the Mission of the Gaia Project (source ESA Web site):
OBJECTIVES:
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
THE MISSION:
“Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way, in the process revealing the composition, formation and evolution of the Galaxy. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. This amounts to about 1 per cent of the Galactic stellar population. Combined with astrophysical information for each star, provided by on-board multi-colour photometry, these data will have the precision necessary to quantify the early formation, and subsequent dynamical, chemical and star formation evolution of the Milky Way Galaxy.
Additional scientific products include detection and orbital classification of tens of thousands of extra-solar planetary systems, a comprehensive survey of objects ranging from huge numbers of minor bodies in our Solar System, through galaxies in the nearby Universe, to some 500 000 distant quasars. It will also provide a number of stringent new tests of general relativity and cosmology.”

Last year in February, I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the initial Proof-of-Concept of the data management part of the project.

A year later, I have asked William O`Mullane (European Space Agency), and Jose Ruperez (Intersystems Spain), some follow up questions.

RVZ

Q1. The original goal of the Gaia mission was to “observe around 1,000,000,000 celestial objects”. Is this still true? Are you ready for that?

William O’Mullane: YES ! We will have a Ground Segment Readiness Review next Spring and a Flight Acceptance Review before summer. We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.

Q2. The plan was to launch the Gaia satellite in early 2013. Is this plan confirmed?

William O’Mullane: Currently September 2013 in Q1 is the official launch date.

Q3. Did the data requirements for the project change in the last year? If yes, how?

William O’Mullane: Downlink rate has not changed so we know how much comes into the System still only about 100TB over 5 years. Data processing volumes depend on how many intermediate steps we keep in different locations. Not much change there since last year.

Q4. The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. What work has been done in the last year to prepare for such a challenge? What did you learn from the Proof-of-Concept of the data management part of this project?

William O’Mullane: I suppose we learned the same lessons as other projects. We have multiple processing centres with different needs met by different systems. We did not try for a single unified approach across these centers.
The CNES have gone completely to Hadoop for their processing. At ESAC we are going to InterSystems Caché. Last year only AGIS was on Caché – now the main daily processing chain is in Caché also [Edit: see also Q.9 ]. There was a significant boost in performance here but it must be said some of this was probably internal to the system, in moving it we looked at some bottlenecks more closely.

Jose Ruperez: We are very pleased to know that last year was only AGIS and now they have several other databases in Caché.

William O’Mullane: The second operations rehearsal is just drawing to a close. This was run completely on Caché (the first rehearsal used Oracle). There were of course some minor problems (also with our software) but in general from Caché perspective it went well.

Q5. Could you please give us some numbers related to performance? Could you also tells us what bottlenecks did you look at, and how did you avoid them?

William O’Mullane: Would take me time to dig out numbers .. we got factor 10 in some places with combination of better queries and removing some code bottle necks. We seem to regularly see factor 10 on “non optimized” systems.

Q6. Is it technically possible to interchange data between Hadoop and Caché ? Does it make sense for the project?

Jose Ruperez: The raw data downloaded from the satellite link every day can be loaded in any database in general. ESAC has chosen InterSystems Caché for performance reasons, but not only. William also explains how cost-effectiveness as well as the support from InterSystems were key. Other centers can try and use other products.

William O’Mullane:
This is a valid point – support is a major reason for our using Caché. InterSystems work with us very well and respond to needs quickly. InterSystems certainly have a very developer oriented culture which matches our team well.
Hadoop is one thing HDFS is another .. but of course they go together. In many ways our DataTrain Whiteboard do “map reduce” with some improvements for our specific problem. There are Hadoop database interfaces so it could work with Caché.

Q7. Could you tell us a bit more what did you learn so far with this project? In particular, what is the implication for Caché, now that also the the main daily processing chain is stored in Caché?

Jose Ruperez: A relevant number, regarding InterSystems Caché performance, is to be able to insert over 100,000 records per second sustained over several days. This also means that InterSystems Caché, in turn, has to write hundreds of MegaBytes per second to disk. To me, this is still mind-boggling.

William O’Mullane:
And done with fairly standard NetApp Storage. Caché and NetApp engineers sat together here at ESAC to align the configuration of both systems to get the max IO for Java through Caché to NetApp. There were several low level page size settings etc. which were modified for this.

Q8. What else still need to be done?

William O’Mullane: Well we have most of the parts but it is not a well oiled machine yet. We need more robustness and a little more automation across the board.

Q9. Your high level architecture a year ago consisted of two databases, a so called Main Database and an AGIS Database.
The Main Database was supposed to hold all data from Gaia and the products of processing. (This was expected to grow from a few TBs to few hundreds of TBs during the mission). AGIS was only required a subset of this data for analytics purpose. Could you tell us how the architecture has evolved in the last year?

William O’Mullane: This remains precisely the same.

Q10. For the AGIS database, were you able to generate realistic data and load on the system?

William O’Mullane: We have run large scale AGIS tests with 50,000,000 sources or about 4,500,000,000 simulated observations. This worked rather nicely and well within requirements. We confirmed going from 2 to 10 to 50 million sources that the problem scales as expected. The final (end of mission 2018) requirement is for 100,000,000 sources, so for now we are quite confident with the load characteristics. The simulation had a realistic source distribution in magnitude and coordinates (i.e. real sky like inhomogeneities are seen).

Q11. What results did you obtain in tuning and configuring the AGIS system in order to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data?

William O’Mullane: We still have bottlenecks in the update servers but the 50 million test still ran inside one month on a small in house cluster. So the 100 million in 3 months (system requirement) will be easily met especially with new hardware.

Q12. What are the next steps planned for the Gaia mission and what are the main technical challenges ahead?

William O’Mullane: AGIS is the critical piece of Gaia software for astrometry but before that the daily data treatment must be run. This so called Initial Data Treatment (IDT) is our main focus right now. It must be robust and smoothly operating for the mission and able to cope with non nominal situations occurring in commissioning the instrument. So some months of consolidation, bug fixing and operational rehearsals for us. The future challenge I expect not to be technical but rather when we see the real data and it is not exactly as we expect/hope it will be.
I may be pleasantly surprised of course. Ask me next year …

———–
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a PhD in Physics and a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.

José Rupérez, Senior Engineer at InterSystems.
He has been providing technical advise to customers and partners in Spain and Portugal for the last 10 years. In particular, he has been working with the European Space Agency since December 2008. Before InterSystems, José developed his career at eSkye Solutions and A.T. Kearney in the United States, always in Software. He started his career working for Alcatel in 1994 as a Software Engineer. José holds a Bachelor of Science in Physics from Universidad Complutense (Madrid, Spain) and a Master of Science in Computer Science from Ball State University (Indiana, USA). He has also attended courses at the MIT Sloan School of Business.
—————-

Related Posts
Objects in Space -The biggest data processing challenge to date in astronomy: The Gaia mission. February 14, 2011

Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011

Objects in Space vs. Friends in Facebook. April 13, 2011

Resources

Gaia Overview (ESA)

Gaia Web page at ESA Spacecraft Operations.

ESA’s web site for the Gaia scientific community.

Gaia library (ESA) collates Gaia conference proceedings, selected reports, papers, and articles on the Gaia mission as well as public DPAC documents.

“Implementing the Gaia Astrometric Global Iterative Solution (AGIS) in Java”. William O’Mullane, Uwe Lammers, Lennart Lindegren, Jose Hernandez and David Hobbs. Aug. 2011

“Implementing the Gaia Astrometric Solution”, William O’Mullane, PhD Thesis, 2012
##

You can follow ODBMS.org on Twitter : @odbmsorg.
——————————-

]]>
http://www.odbms.org/blog/2013/01/the-gaia-mission-one-year-later-interview-with-william-omullane/feed/ 0
On Analyzing Unstructured Data. — Interview with Michael Brands. http://www.odbms.org/blog/2012/07/on-analyzing-unstructured-data-interview-with-michael-brands/ http://www.odbms.org/blog/2012/07/on-analyzing-unstructured-data-interview-with-michael-brands/#comments Wed, 11 Jul 2012 06:33:41 +0000 http://www.odbms.org/blog/?p=1590 “The real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data”– Michael Brands.

It is reported that 80% of all data in an enterprise is unstructured information. How do we manage unstructured data? I have interviewed Michael Brands, an expert on analyzing unstructured data and currently a senior product manager for the i.Know technology at InterSystems.

RVZ

Q1. It is commonly said that more than 80% of all data in an enterprise is unstructured information. Examples are telephone conversations, voicemails, emails, electronic documents, paper documents, images, web pages, video and hundreds of other formats. Why is unstructured data important for an enterprise?

Michael Brands: Well unstructured data is important for organizations in general in at least 3 ways. 
First of all 90% of what people do in a business day is unstructured and the results of most of these activities can only be captured in unstructured data.
Second it is generally acknowledged in modern economy that knowledge is the biggest of asset of companies and most of this knowledge, since itʼs developed by people, is recorded in unstructured formats.

The last and maybe most unexpected argument to underpin the importance of unstructured data is the fact large research organizations such as Gardner and IDC state that: “80% of business is conducted on unstructured data”
If we take these tree elements together is even surprising to see most organizations invest heavily in business intelligence applications to improve their business but these applications only cover a very small portion of the data (20% in the most optimistic estimation) that are actually important for their business.
If we look at this from a different prospective we think enterprises that really want to be leading and make a difference will heavily invest in technologies that help them to understand and exploit their unstructured data because if we only look at the numbers (and thatʼs the small portion of data most enterprises already understand very well) the area of unstructured data will be the one where the difference will be made over the next couple of years.
However the real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data.
As InterSystems we have a unique technology offering that was especially designed to help our customers and partners in doing exactly that and weʼre proud our partners that actually deploy the technology to fully exploit a 100% of their data make a real difference in their market and grow way faster than their competitors.

Q2. What is the main difference between semi-structured and unstructured information?

Michael Brands: The very short and bold answer to this question would be to say semi-structured is just a euphemism for unstructured. 
However a more in-depth answer is that unstructured data is a combination of structured and unstructured data in the same data channel.
Typically semi-structured data comes out of forms that foresee specific free text areaʼs to describe specific parts of the required information. This way a “structured” (meta)-data field describes with a fair degree of abstraction the contents of the associated text field.
A typical example will help to clarify this: In an electronic medical record system the notes section in which a doctor can record his observations about a specific patient in free text is typically semi-structured which means the doctor doesnʼt have to write all observations in one text but he can typically “categorize” his observations under different headers such as: “Patient History”, “Family History”, “Clinical Findings”, “Diagnose” and more.
Subdividing such text entry environments into a series of different fields with a fixed header is a very common example of semi-structured data.
 Another very popular example of semi-structured data is e-mail, mp3 or video-data. These data-types contain mainly unstructured data but these unstructured data is always attached to some more structured data such as: Author, Subject or Title, Summary etc.

Q3. The most common example of unstructured data is text. Several applications store portions of their data as unstructured text that is typically implemented as plain text, in rich text format (RTF), as XML, or as a BLOB (Binary Large Object). It is very hard to extract meaning from this content. How iKnow can help here?

Michael Brands: iKnow can help here in a very specific and unique way because it is able to structure these texts into chains of concepts and relations.
What this means is that iKnow will be able to tell you without prior knowledge what the most important concepts in these texts are and how they are related to each other.
This is why, when we talk about iKnow, we say the technology is proactive.
Any other technology that analyses text will need a domain specific model (statistical, ontological or syntactical) containing a lot of domain specific knowledge in order to make some sense out of the texts it is supposed to analyze. iKnow, thanks to its unique way of splitting sentences into concepts and relations doesnʼt need this.
It will fully automatically perform the analysis and highlighting tasks students usually perform as a first step in understanding and memorizing a course text book.

Q4. How do you exactly make conceptual meaning out of unstructured data? Which text analytical methods do you use for that?

Michael Brands: The process we use to extract meaning out of texts is unique because of the following: we do not split sentences into individual words and then try to recombine these words by means of a syntactic parser, an ontology (which essentially is a dictionary combined with a hierarchical model that describes a specific domain), or a statistical model. What iKnow does instead is we split sentences by identifying relational word(group)s in a sentence.
This approach is based on a couple of long known facts about language and communication.

First of all analytical semantics already discovered years ago every sentence is nothing else than a chain of conceptual word groups (often called Noun Phrases or Prepositional Phrases in formal linguistics) tied together by relations (often called Verb Phrases in formal linguistics). So a sentence will semantically always be built as a chain of a concept followed by a relation followed by another concept again followed by another relation and another concept etc.
This basic conception of a binary sentence structure consisting of Noun-headed phrases (concepts) and Verb-headed phrases (relations) is at the heart of almost all major approaches to automated syntactic sentence analysis. However this knowledge is only used by state-of-the-art analysis algorithms to construct second order syntactic dependency structure representations of a sentence rather than to effectively analyze the meaning of a sentence.

A second important discovery underpinning the iKnow approach is the fact, discovered by behavioral psychology and neuro-psychiatry, humans only understand and operate a very small set of different relations to express links between facts, events, or thoughts. Not only the set of different relations people use and understand is very limited but it is also a universal set. In other words people only use a limited number of different relations and these relations are the same for everybody no matter his language, education, cultural background or whatsoever.
This discovery can learn us a lot of how basic mechanisms for learning like derivation and inference work. But more important for our purposes is that we can derive from this that, in sharp contrast with the set of concepts that is infinite and has different subsets for each specific domain, the set of relations is limited and universal.
The combination of these two elements namely the basic binary concept-relation structure of language and the universality and limitedness of the set of relations led to the development of the iKnow approach after a thorough analysis of a lot of state-of-the-art techniques.
Our conclusion of this analysis is the main problem of all classical approaches to text analysis is they all focus essentially on the endless and domain specific set of concepts because they mostly were created to serve the specific needs of a specific domain.
Thanks to this domain specific focus the number of elements a system needs to know upfront can be controlled. Nevertheless a “serious” application quickly integrates several millions of different concepts. This need for large collections of predefined concepts to describe the application domain, commonly called dictionaries, taxonomies or ontologies, leads to a couple of serious problems.
First off all, the time needed to set up and tune such applications is substantial and expensive because domain experts are needed to come up with the most appropriate concepts. Second the foot print of these systems is rather big and their maintenance costly and time-consuming because specialists need to follow whatʼs going on in the domain and adapt the knowledge of the application.
Third, itʼs very difficult to open up a domain specific application for other domains because in these other domains concepts might have different meanings or even contradict each other which can create serious problems at the level of the parsing logic.
Therefore iKnow was built to perform a completely different kind of analysis because by focussing on the relations we can build systems with a very small footprint (an average language model only contains several 10.000s relations and a very small number of context based disambiguation rules).
Moreover our system is not domain specific but it can work with data from very different domains at the same time and doesnʼt need expert input. Splitting up sentences by means of relations and solving the ambiguous cases (this means the cases in which a word or word group can express both a concept and a relation e.g. walk: is a concept in this sentence: Brussels Paris would be quite a walk. and a relation in this sentence: Pete and Mary walk to school) by means of rules that use the function (concept or relation) of the surrounding words (or word groups) to decide whether the ambiguous word is a concept or a relation is a computationally very efficient and fast process and ensures a system that learns as it analyses more data because it kind of “learns” the concepts from the texts because it identifies them as “the groups of words between the relations, before the first relation and between the last relation and the end of the sentence.

Q5. How “precise” is the meaning you extract from unstructured data? Do you have a way to validate it?

Michael Brands: This is a very interesting question because it raises two very difficult topics in the area of semantic data analysis namely : How do you define precision and How to evaluate results generated by semantic technologies ? 
If we use the classical definition of precision in this area, it describes what percentage of the documents given back by a system in response to a query asking for documents containing information about certain concepts actually contains useful information about these concepts.
Based on this definition of precision we can say iKnow scores very close to a 100% because it outperforms competing technologies in itʼs efficiency to detect what words in a sentence belong together and form meaningful groups and how the relate to each other.
Even if weʼd use other more challenging definitions of precision like: the syntactic or formal correctness of the word groups identified by iKnow we score very high percentages, but itʼs evident weʼre dependent of the quality of input. If the input doesnʼt accurately uses punctuation marks or contains a lot of non-letter characters that will affect our precision. Moreover how precision is perceived and defined varies a lot from one use case to another.
Evaluation is a very complex and subjective operation in this area because whatʼs considered to be good or bad heavily depends on what people want to do with the technology and what their background is. So far we let our customers and partners decide after an evaluation period whether the technology does what they expect from it and we didnʼt have “no goes” yet.

Q6. How do you process very large scale archives of data?

Michael Brands: The architecture of the system has been set up to be as flexible as possible and to make sure processes can be executed in parallel where possible and desirable. Moreover the system provides different modes to load data: A batch-load of data which has been especially designed to pump large amounts of existing data such as document archives into an system as fast as possible, a single source load thatʼs especially designed to add individual documents to a system at transactional speed, and a small-batch mode to add limited sets of documents to a system in one process.
On top of that the loading architecture foresees different steps in the loading process: data to be loaded needs to be listed or staged, the data can be converted (this means the data that has to be indexed can be adapted to get better indexing and analysis results), and, off course the data will be loaded into the system.
These different steps can partially be done in parallel and in multiple processes to ensure the best possible performance and flexibility.

Q.7 On one hand we have mining text data, and on the other hand we have database transactions on structured data: how do you relate them to each other?

Michael Brands: Well there are two different perspectives in this question:
 On the one hand itʼs important to underline that all textual data indexed with iKnow can be used as if it was structured data, because the API foresees appropriate methods that allow you to query the textual data the same you would query traditional row-column data. These methods come in 3 different flavors: they can be called as native caché object script methods, they can be called from within a SQL-environment as stored procedures and they are also available as web services.

On the other hand thereʼs the fact all structured data that has a link with the indexed texts can be used as metadata within iKnow. Based on these structured metadata filters can be created and used within the iKnow API to make sure the API returns exactly the results you need.

_______________________
Michael Brands previously founded i.Know NV a company specialized in analyzing unstructured data. In 2010 InterSystems acquired i.Know and since then he is serving as a senior product manager for the i.Know technology at InterSystems.
i.Know’s technology is embedded in the InterSystems technology platform.

Related Posts

Managing Big Data. An interview with David Gorbet (July 2, 2012)

Big Data: Smart Meters — Interview with Markus Gerdes (June 18, 2012)

Big Data for Good (June 4, 2012)

On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum (February 1, 2012)

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica (November 16, 2011)

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com (November 2, 2011)

Analytics at eBay. An interview with Tom Fastner (October 6, 2011)


##

]]>
http://www.odbms.org/blog/2012/07/on-analyzing-unstructured-data-interview-with-michael-brands/feed/ 0
Big Data: Smart Meters — Interview with Markus Gerdes. http://www.odbms.org/blog/2012/06/big-data-smart-meters-interview-with-markus-gerdes/ http://www.odbms.org/blog/2012/06/big-data-smart-meters-interview-with-markus-gerdes/#comments Mon, 18 Jun 2012 06:51:46 +0000 http://www.odbms.org/blog/?p=1501 “For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000 sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.” — Markus Gerdes.

80 percent of all households in Germany will have to be equipped with smart meters by 2020, according to a EU single market directive.
Why smart meters? A smart meter, as described by e.On, is “a digital device which can be read remotely and allows customers to check their own energy consumption at any time. This helps them to control their usage better and to identify concrete ways to save energy. Every customer can access their own consumption data online in graphic form displayed in quarter-hour intervals. There is also a great deal of additional information, such as energy saving tips. Similarly, measurements can be made using a digital display in the home in real time and the current usage viewed.” This means Big Data. How do we store, and use all these machine-generated data?
To better understand this, I have interviewed Dr. Markus Gerdes, Product Manager at BTC , a company specialized in the energy sector.

RVZ

Q1. What are the main business activities of BTC ?

Markus Gerdes: BTC provides various IT-services: besides the basics of system management, e.g. hosting services, security services or the new field of mobile security services, BTC primarily delivers IT- and process consulting and system integration services for different industries, especially for utilities.
This means, BTC plans and rolls IT-architectures out, integrates and customizes IT-applications and migrates data for ERP, CRM and more applications. BTC also delivers its IT-applications if desired: In particular, BTC’s Smart Metering solution BTC Advanced Meter Management (BTC AMM) is increasingly known in the smart meter market and has drawn customers` interest at this stage of the market, not only in Germany, but e.g. in Turkey and other European countries as well.

Q2. According to a EU single market directive and German Federal Government, 80 percent of all households in Germany will have to be equipped with smart meters by 2020, How many smart meters will have to be installed? What will the government do with all these data generated?

Markus Gerdes: Currently, 42 million electricity meters are installed in Germany. Thus, about 34 million meters need to be exchanged according to the EU directive in Germany until 2020. In order to achieve this aim, in 2011 the Germany EnWG (law on the energy industry) adds some new aspects: smart meters have to be installed where customers` electricity consumption is more than 6.000 kWh per year, at decentralized feed-in with more than 7 kW and in considerably refurbished or newly constructed buildings, if this is technically feasible.
In this context technical feasible means, that the installed smart meters are certified (as a precondition they have to use the protection profiles) and must be commercially available in the meter market. An amendment to the EnWG is due in September 2012 and it is generally expected that this threshold of 6000 kWh will be lowered. The government will actually not be in charge of the data collected by the Smart Meters. It is metering companies who have to provide the data to the distribution net operators and utility companies. The data is then used for billing and as an input to customer feedback systems for example and potentially grid analyses under the use of pseudonyms.

Q3. Smart Metering: Could you please give us some detail on what Smart Metering means in the Energy sector?

Markus Gerdes: Smart Metering means opportunities. The technology itself does no more or less than deliver data, foremost a timestamp plus a measured value, from a metering system via a communication network to an IT-system, where it is prepared and provided to other systems. If necessary this may even be done in real time. This data can be relevant to different market players in different resolutions and aggregations as a basis for other services.
Furthermore, smart meter offer new features like complex tariffs, load limitations etc. The data and the new features will lead to optimized processes with respect to quality, speed and costs. The type of processing will finally lead to new services, products and solutions – some of which we do not even know today. In combination with other technologies and information types the smart metering infrastructure will be the backbone of smart home applications and the so-called smart grid.
For instance, BTC develops scenarios to combine the BTC AMM with the control of virtual power plants or even with the BTC grid management and control application BTC PRINS. This means: smart markets become reality.

Q4. BTC AG has developed an advanced meter management system for the energy industry. What is it?

Markus Gerdes:The BTC AMM is an innovative software system, which allows meter service providers to manage, control and readout smart meters and provide these meter readings and other possibly relevant information, e.g. status information, information on meter corruption and manipulation to authorized market partners.
Also data and control signals for the smart meter can be provided by the system.
The BTC AMM is developed as a new solution BTC has been able to particularly focus on mass data management and smart meter mass process optimized workflows. In combination with a clear and easy to use frontend we bring our customers a high performance solution for their most important requirements.
In addition, our modular concept and the use of open standards makes our vendor-independent solution not only fit into utilities IT-architecture easily but makes it future-proof.

Q5. What kind of data management requirements do you have for this application? What kind of data is a smart meter producing and at what speed? How do you plan to store and process all the data generated by these smart meters?

Markus Gerdes: Let me address the issue of the data volume and frequency of data sent first. The BTC AMM is designed to collect the data of several millions of
smart meters. In a standard scenario each of the smart meters sends a load profile with a resolution of 15 minutes to the BTC AMM. This means that at least 96 data points have to be stored by the BTC AMM per day and meter. This implies both, a huge amount of data to be stored and a high frequency data traffic.
Hence, the data management system needs to be highly performant in both dimensions. In order to process time series BTC has developed a specific, highly efficient time series management which runs with different data base providers. This enables the BTC AMM to cope even with data sent in a higher frequency. For certain smart grid use cases the BTC AMM processes metering data sent from the meters on the scale of seconds.

Q6. The system you are developing is based on InterSystems Caché® database system. How do you use Cache`?

Markus Gerdes: BTC uses InterSystems Caché as Meter Data Management solution. This means the incoming data from the smart meters is saved into the database and the information provided e.g. to backend-systems via webservices or to other interfaces so that the data can be used for market partner communication or customer feedback systems. And all this means the BTC AMM has to handle thousands of read- and write-operations per second.

Q7. You said that one of the critical challenge you are facing is to “master up the mass data efficiency in communicating with smart meters and the storage and processing of measured time series” Can you please elaborate on this? What is the volume of the data sets involved?

Markus Gerdes: For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000
sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.
In addition calculations, aggregations and plausibility checks are necessary. Moreover incoming tasks have to be processed and the relevant data has to be delivered to backend applications. This means that the underlying database as well as the AMM-processes may have to process the incoming data every 15 minutes while reading thousands of time series per minute simultaneously.

Q8. How did you test the performance of the underlying database system when handling data streams? What results did you obtain so far?

Markus Gerdes: We designed a load profile generator and used it to simulate the meter readings of more than 1 million smart meters. The tests included the
writing of quarter-hour meter readings. Actually the problem with this test was the speed of the generator to provide the data, not the speed of the AMM. In fact we are able to write more than 12.000 time-series per second. This is far enough to cope even with full meter rollouts.

Q9. What is the current status of this project? What are the lessons learned so far? And the plans ahead? Are there any similar systems implemented in Europe?

Markus Gerdes: At the moment we think that our BTC AMM- and database-performance is able to handle the upcoming mass data during the next years including a full smart meter rollout in Germany. Nevertheless, in terms of smart grid and smart home appliances and an increasing amount of real time event processings, both read and write, it is necessary to get a clear view of future technologies to speed up processing of mass data (e.g. in-memory).
In addition we still have to keep an eye on usability. Although we hope that smart metering in the end will lead to complete machine-to-machine-communication we always have to expect errors and disturbances from technology, communication or even the human factor. As event driven processes are time critical we still have to work on solutions for fast and efficient handling, analyses and processing of mass errors.
_________________

Dr. Markus Gerdes, Product Manager BTC AMM / BTC Smarter Metering Suite, BTC Business Technology Consulting AG.
Since 2009 Mr. Gerdes worked in several research, development and consulting projects in the area of smart metering. He was involved in research and consulting in the sectors Utilities, Industry and Public, regarding IT-architecture and solutions and IT-Security.
He is experienced in the development of energy management solutions.

]]>
http://www.odbms.org/blog/2012/06/big-data-smart-meters-interview-with-markus-gerdes/feed/ 1
Interview with Iran Hutchinson, Globals. http://www.odbms.org/blog/2011/06/interview-with-iran-hutchinson-globals/ http://www.odbms.org/blog/2011/06/interview-with-iran-hutchinson-globals/#comments Mon, 13 Jun 2011 22:06:22 +0000 http://www.odbms.org/blog/?p=820 “ The newly launched Globals initiative is not about creating a new database.
It is however, about exposing the core multi-dimensional arrays directly to developers.” — Iran Hutchinson.

__________________________

InterSystems recently launched a new initiative: Globals.
I wanted to know more about Globals. I have therefore interviewed Iran Hutchinson, software/systems architect at InterSystems and one of the people behind the Globals project.

RVZ

Q1. InterSystems recently launched a new database product: Globals. Why a new database? What is Globals?

Iran Hutchinson: InterSystems has continually provided innovative database technology to its technology partners for over 30 years. Understanding customer needs to build rich, high-performance, and scalable applications resulted
in a database implementation with a proven track record. The core of the database technology is multi-dimensional arrays (aka globals).
The newly launched Globals initiative is not about creating a new database. It is however, about exposing the core multi-dimensional arrays directly to developers. By closely integrating access into development technologies like Java and JavaScript, developers can take full advantage of high-performance access to our core database components.

We undertook this project to build much broader awareness of the technology that lies at the heart of all of our products. In doing so, we hope to build a thriving developer community conversant in the Globals technology, and aware of the benefits to this approach of building applications.

Q2. You classify Globals as a NoSQL-database. Is this correct? What are the differences and similarities of Globals with respect to other NoSQL databases in the market?

Iran Hutchinson: While Globals can be classified as a NoSQL database, it goes beyond the definition of other NoSQL databases. As you there are many different offerings in NoSQL and no key comparison matrices or feature lists. Below we list some comparisons and differences with hopes of later expanding the available information on the globalsdb.org website.

Globals differs from other NoSQL databases in a number of ways.

o It is not limited to one of the known paradigms in NoSQL (Column/Wide Column, Key-Value, Graph, Document, etc.). You can build your own paradigm on top of the core engine. This is an approach we took as we evolved Caché to support objects, xml, and relational, to name a few.
o Globals still offers optional transactions and locking. Though efficient in implementation we wanted to make sure that locking and transactions were at the discretion of the developer.
o MVCC is built into the database.
o Globals runs in-memory and writes data to disk.
o There is currently no sharding or replication available in Globals. We are discussing options for these features.
o Globals builds on the over 33 years of success of Caché. It is well proven. It is the exact same database technology. Globals will continue to evolve, and receive the innovations going into the core of Caché.
o Our goal with Globals is be a very good steward of the project and technology. The Globals initiative will also start to drive contests and events to further promote adoption of the technology, as well as innovative approaches to building applications. We see this stewardship as a key differentiator, along with the underlying flexible core technology.

• Globals shares similar traits with other NoSQL databases in the market.

o It is free for development and deployment.
o The data model can optionally use a schema. We mitigate the impact of using schemas by using the same infrastructure we use to store the data. The schema information and the data are both stored in globals.
o Developers can index their data.
o The document paradigm enabled by the Globals Document Store (GDS) API enables a query language for data stored using the GDS API. GDS is also an example of how to build a storage paradigm in Globals. Globals APIs are open source and available on the github link.
o Globals is fast and efficient at storing data. We know performance is one of many hallmarks of NoSQL. Globals can store data at rates exceeding 100,000 objects/records per second.
o Different technology APIs are available for use with Globals. We’ve released 2 Java APIs and the JavaScript API is immanent.

Q3. How do you position Globals with respect to Caché? Who should use Globals and who should use Caché?

Iran Hutchinson: Today, Globals offers multi-dimensional array storage, whereas Caché offers a much richer set of features. Caché (and the InterSystems technology it powers including Ensemble, DeepSee, HealthShare, and TrakCare) offers a core underlying object technology, native web services, distributed communication via ECP (Enterprise Cache Protocol), strategies for high availability, interactive development environment, industry standard data access (JDBC, ODBC, SQL, XML, etc.) and a host of other enterprise ready features.

Anyone can use Globals or Caché to tackle challenges with large data volumes (terabytes, petabytes, etc.), high transactions (100,000+ per second), and complex data (healthcare, financial, aerospace, etc.). However, Caché provides much of the needed out-of-box tooling and technology to get started rapidly building solutions in our core technology, as well as a variety of languages. Currently provided as Java APIs, Globals is a toolkit to build the infrastructure already provided by Caché. Use Caché if you want to get started today; use Globals if you have a keen interest in building the infrastructure of your data management system.

Q4. Globals offers multi-dimensional array storage. Can you please briefly explain this feature, and how this can be beneficial for developers?

Iran Hutchinson: It is beneficial to go here. I grabbed the following paragraphs directly from this page:

Summary Definition: A global is a persistent sparse multi-dimensional array, which consists of one or more storage elements or “nodes”. Each node is identified by a node reference (which is, essentially, its logical address). Each node consists of a name (the name of the global to which this node belongs) and zero or more subscripts.

Subscripts may be of any of the types String, int, long, or double. Subscripts of any of these types can be mixed among the nodes of the same global, at the same or different levels.

Benefits for developers: Globals does not limit developers to using objects, key-value, or any other type of storage paradigm. Developers are free to think of the optimal storage paradigm for what they are working on. With this flexibility, and the history of successful applications powered by globals, we think developers can begin building applications with confidence.

Q5. Globals does not include Objects. Is it possible to use Globals if my data is made of Java objects? If yes, how?

Iran Hutchinson:. Globals exposes a multi-dimensional sparse array directly to Java and other languages. While Globals itself does not include direct Java object storage technology like JPA or JDO, one can easily store and retrieve data in Java objects using the APIs documented here. Anyone can also extend Globals to support popular Java object storage and retrieval interfaces.

One of the core concepts in Globals is that it is not limited to a paradigm, like objects, but can be used in many paradigms. As an example, the new GDS (Globals Document Store) API enables developers to use the NoSQL document paradigm to store their objects in Globals. GDS is available here (more docs to come).

Q6. Is Globals open source?

Iran Hutchinson: Globals itself it not open source. However, the Globals APIs hosted at the github location are open source.

Q7. Do you plan to create a Globals Community? And if yes, what will you offer to the community and what do you expect back from the community?

Iran Hutchinson: We created a community for Globals from the beginning. One of the main goals of the Globals initiative is to create a thriving community around the technology, and applications built on the technology.
We offer the community:
• Proven core data management technology
• An enthusiastic technology partner that will continue to evolve and support project ◦ Marketing the project globally
◦ Continual underlying technology evolution ◦ Involvement in the forums and open source technology development ◦ Participation in or hosting events and contests around Globals.
• A venue to not only express ideas, but take a direct role in bringing those ideas to life in technology
• For those who want to build a business around Globals, 30+ years of experience in supplying software developers with the technology to build successful breakthrough applications.

____________________________________

Iran Hutchinson serves as product manager and software/systems architect at InterSystems. He is one of the people behind the Globals project. He has held architecture and development positions at startups and Fortune 50 companies. He focuses on language platforms, data management technologies, distributed/cloud computing, and high performance computing. When not on trail talking with fellow geeks or behind the computer you can find him eating (just look for the nearest steak house).
___________________________________

Resources

Globals.
Globals is a free database from InterSystem. Globals offer multi-dimensional storage. The first version is for Java. Software | Intermediate | English | LINK | May 2011

Globals APIs
Globals APIs are open source available at github location .

Related Posts

Interview with Jonathan Ellis, project chair of Apache Cassandra.

The evolving market for NoSQL Databases: Interview with James Phillips.

– “Marrying objects with graphs”: Interview with Darren Wood.

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.
———-

]]>
http://www.odbms.org/blog/2011/06/interview-with-iran-hutchinson-globals/feed/ 0
Objects in Space. http://www.odbms.org/blog/2011/02/objects-in-space/ http://www.odbms.org/blog/2011/02/objects-in-space/#comments Mon, 14 Feb 2011 11:10:21 +0000 http://www.odbms.org/blog/?p=521 Objects in Space.
–The biggest data processing challenge to date in astronomy: The Gaia mission.–

I became aware of a very interesting project at the European Space Agency:the Gaia mission. It is considered by the experts “the biggest data processing challenge to date in astronomy“.

I wanted to know more. I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the Proof-of-Concept of the data management part of this project.

Hope you`ll enjoy this interview.
RVZ

Q1. Space missions are long-term. Generally 15 to 20 years in length. The European Space Agency plans to launch in 2012 a Satellite called Gaia. What is Gaia supposed to do?

O`Mullane:
Well, we now have learned we launch in early 2013, delays are fairly common place in complex space missions. Gaia is ESA’s ambitious space astrometry mission, the main objective of which is to astrometrically and spectro-photometrically map 1000 Million celestial objects (mostly in our galaxy) with unprecedented accuracy. The satellite will downlink close to 100 TB of raw telemetry data over 5 years.
To achieve its required accuracy of a few 10s of Microarcsecond astrometry, a highly involved processing of this data is required. The data processing is a pan-European effort undertaken by the Gaia Data processing and Analysis consortium. The result a phase space map of our Galaxy helping to untangle its evolution and formation.

Q2. In your report, “Charting the Galaxy with the Gaia Satellite”, you write “the Gaia mission is considered the biggest data processing challenge to date in astronomy. Gaia is expected to observe around 1,000,000,000 celestial objects”. What kind of information Gaia is expected to collect on celestial objects? And what kind of information Gaia itself needs in order to function properly?

O`Mullane:
Gaia has two telescopes and a Radial Velocity Spectrometer (RVS). From each telescope simultaneously images are taken to calculate position on the sky and magnitude. Two special sets of CCDs at the end of the focal plane record images in red and blue bands to provide photometry for every object. Then for a large number of objects spectrographic images are recorded.
From the combination of this data we derive very accurate positions distances and motions for celestial objects. Additionally we get metalicites and temperatures etc. which allow the objects to be classified in different star groups. We use the term celestial objects since not all objects that Gaia observes are stars.
Gaia will, for example, see many asteroids and improve their know orbits. Many planets will be detected (though not directly observed).

As for any satellite Gaia has a barrage of on board instrumentation such as gyroscopes, star trackers and thermometers which are read out and downlinked as ‘house keeping’ telemetry. All of this information is needed to track Gaia’s state and position. Perhaps rather more unique for Gaia is the use of the data taken through scientific instruments for ‘self’ calibration. Gaia is all about beating noise with statistics.

Q3. All Gaia data processing software is written in Java. What are the main functionalities of the Gaia data processing software? What were the data requirements for this software?

O`Mullane:
The functionality is to process all of the observed data and reduce it to a set of star catalogue entries which represent the observed stars. The data requirements vary – in the science operations centre we downlink 50-80GB per day so there is a requirement on the daily processing software to be able to process this volume of data in 8 hours. This is the strictest requirement because if we are swamped by the incoming data stream we may never recover.

The Astrometric solution involves a few TB of data extracted from the complete set, the process runs less frequently (each six months to one year) the requirement on that software is to do its job in 4 weeks.
There are many systems like this for photometry, spectroscopy, non single stars, classification, variability analysis, each have there own requirements on data volume and processing time.

Q4. What are the main technical challenges with respect to data processing, manipulation and storage this project poses?

Nagjee:
The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. For example, 1 billion celestial objects will be surveyed, and roughly 1000 observations (100*10) will be captured for each object, totaling around 1000 billion observations; each observation is represented as a discrete Java object and contains many properties that express various characteristics of these celestial bodies.
The trick is to not only capture this information and ingest (insert) it into the database very quickly, but to also be able to do this as discrete (non-BLOB) objects so that downstream processing can be facilitated in an easy manner. Additionally, the system needs to be optimized for very fast inserts (writes) and also for very fast queries (reads). All of this needs to be done in a very economical and cost-effective fashion – reducing power, energy, cooling, etc. requirements, and also reducing costs as much as possible.
These are some of the challenges that this project poses.

Q5. A specific part of the Gaia data processing software is the so called Astrometric Global Iterative Solution (AGIS). This software is written in Java. What is the function of such module? And what specific data requirements and technical challenges does it have?

Nagjee:
AGIS is a solution or program that iteratively turns the raw data into meaningful information.

O`Mullane:
AGIS takes a subset of the data, so called well behaved or primary stars and basically fits the observations to the astrometric model. This involves refining the satellite attitude (using the known basic angle and the multiple simultaneous observations in each telescope) and calibration (given an attitude and known star positions they should appear at a definite points in the CCD at a given times). It`s a huge (too huge infact) matrix inversion but we do a block iterative estimation based on the conjugate gradient method. This requires looping (or iterating) over the observational data up to 40 times to converge the solution. So we need the IO to be reasonable -we know the access pattern though so we can organize the data such that the reads are almost serial.

Q6. In your high level architecture you use two databases, a so called Main Database and an AGIS Database. What is the rational for this choice and what are the functionalities expected from the two databases?

O`Mullane:
Well the Main Database will hold all data from Gaia and the products of processing. This will grow from a few TBs to few hundreds of TBs during the mission. It`s a large repository of data. Now we could consider lots of tasks reading and updating this but the accounting would be a nightmare – and astronomers really like to know the provenance of their results. So we made the Main Database a little slower to update declaring a version as immutable. One version is then the input to all processing tasks the outputs of which form the next version.
AGIS meanwhile (and other tasks) only require a subset of this data. Often the tasks are iterative and will need to scan over the data frequently or access it in some particular way. Again it is easier for the designer of each task to also design and optimize his data system than try to make one system work for all.

Q7. You write that the AGIS database will contain data for roughly 500,000,000 sources (totaling 50,000,000,000 observations). This is roughly 100 Terabyte of Java data objects. Can you please tell us what are the attributes of the Java objects and what do you plan to do with these 100 Terabyte of data? Is scalability an important factor in your application? Do you have specific time requirements for handling the 100 Terabyte of Java data objects?

O`Mullane:
Last question first – 4 weeks is the target time for running AGIS, that’s ~40 passes through the dataset. The 500 Million sources plus observations for Gaia is a subset of the Gaia Data we estimate it to be around 10TB. Reduced to its quintessential element each object observation is a time, the time when the celestial image crosses the fiducial line on the CCD. For one transit there are 10 such observations with an average of 80 transits per source. Carried with that is the small cut out image over the 5 year mission lifetime, from the CCD. In addition there are various other pieces of information such as which telescope and residuals which are carried around for each observation. For each source we calculate the astrometric parameters which you may see as 6 numbers plus errors. These are: the position (alpha and delta) the distance or parallax (varPi) the transverse motion (muAlpha and muDelta) and the radial velocity (muRvs).
Then there is a Magnitude estimation and various other parameters. We see no choice but to scale the application to achieve the run times we desire and it has always been designed as a distributed application.

Nagjee:
The AGIS Data Model comprises several objects and is defined in terms of Java interfaces. Specifically, AGIS treats each observation as a discrete AstroElementary object. As described in the paper, the AstroElementary object contains various properties (mostly of the IEEE long data type) and is roughly 600 bytes on disk. In addition, the AGIS database contains several supporting indexes which are built during the ingestion phase. These indexes assist with queries during AGIS processing, and also provide fast ad-hoc reporting capabilities. Using InterSystems Caché, with its Caché eXTreme for Java capability, multiple AGIS Java programs will ingest the 100 Terabytes of data generated by Gaia as 50,000,000,000 discrete AstroElementary objects within 5 days (yielding roughly 115,000 object inserts per second, sustained over 5 days).

Internally, we will spread the data across several database files within Caché using our Global and Subscript Mapping capabilities (you can read more about these capabilities here) ,while providing seamless access to the data across all ranges. The spreading of the data across multiple database files is mainly done for manageability.

Q8. You conducted a proof-of-concept with Caché for the AGIS database. What were the main technical challenges of such proof-of-concept and what are the main results you obtained? Why did you select Caché and not a relational database for the AGIS database?

O`Mullane:
We have worked with Oracle for years and we can run AGIS on Derby. We have tested MySql and Postgress (though not with AGIS). To make the relation systems work fast enough we had to reduce our row count – this we did by effectively combining objects in blobs with the result that the RDBMs became more like a files system. Tests with Caché have shown we can achieve the read and write speeds we require without having to group data in blobs. This obviously is more flexible. It may have been possible to do this with another product but each time we had a problem InterSystems came (quickly) and showed us how to get
around it or fixed it. For the recent test we requested writing of specific representative dataset within 24 hours our our hardware – this was achieved in 12 hours. Caché is also a more cost-effective solution.

Nagjee:
The main technical challenge of such a Proof-of-Concept is to be able to generate realistic data and load on the system, and to tune and configure the system to be able to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data.
The white paper(.pdf)” discusses the results, but in summary, we were able to ingest 5,000,000,000 AstroElementary objects (roughly 1/10th of the eventual projected amount) in around 12 hours. Our target was to ingest this data within 24 hours, and we were successful at being able to do this in 1/2 the time.

Caché is an extremely high-performance database, and as the Proof-of-Concept outlined in the white paper proves, Caché is more than capable of handling the stringent time requirements imposed by the Gaia project, even when run on relatively modest hardware.

Q9. How do you handle data ingestion in the AGIS database, and how do you publish back updated objects into the main database?

Nagjee:
We use the Caché eXTreme for Java capability to interact between the Java AGIS application.

Q10. One main component of the proof-of-concept is the new Caché eXTreme for Java. Why is it important, and how did it get used in the proof-of-concept? How do you ensure low latency data storage and retrieval in the AGIS solution?

Nagjee:
Caché eXTreme for Java is a new capability of the InterSystems Caché database that exposes Caché’s enterprise and high-performance features to Java via the JNI (Java Native Intervace). It enables “in-process” communication between Java and Caché, thereby providing extremely low-latency data storage and retrieval.

Q11. What are the next steps planned for AGIS project and the main technical challenges ahead?

Nagjee:
The next series of testing will focus on ingesting even more data sources –up to 50% of the total projected objects. Next, we’ll work on tuning the application and system for reads (queries), and will also continue to explore additional deployment options for the read/query phase (for example, ESA and InterSystems are looking at deploying hundreds of AGIS nodes in the Amazon EC2 cloud so as to reduce the amount of hardware that ESA has to purchase).

O`Mullane:
Well on the technical side we need to move AGIS definitively to Caché for production. There are always communication bottlenecks to be investigated which limit scalability. AGIS itself requires several further developments such as a more robust outlier scheme and a more complete set of calibration equations. AGIS is in a good state but needs more work to deal with the mission data.

———————-
William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.

Vik Nagjee, Product Manager, Core Technologies, InterSystems Corporation.
Vik Nagjee is the Product Manager for Core Technologies in the Systems Development group at InterSystems. He is responsible for the areas of Reliability, Availability, Scalability, and Performance for InterSystems’ core products – Caché and Ensemble. Prior to joining InterSystems in 2008, Nagjee held several positions, including Security Architect, Development Lead, and head of the performance & scalability group, at Epic Systems Corporation, a leading healthcare application vendor in the US.
————————-
For additional reading:

The Gaia Mission :
1. “Pinpointing the Milky Way” (download paper, .pdf).

2. “Gaia: Organisation and challenges for the data processing” (download paper, .pdf).

3. “Gaia Data Processing Architecture 2009″ (download paper, .pdf).

4. “To Boldly Go Where No Man has Gone Before: Seeking Gaia’s Astrometric Solution with AGIS” (download paper, .pdf).

The AGIS Database

5. “Charting the Galaxy with the Gaia Satellite”.(download white paper, .pdf)”

]]>
http://www.odbms.org/blog/2011/02/objects-in-space/feed/ 5
New Lecture Notes http://www.odbms.org/blog/2010/06/new-lecture-notes/ http://www.odbms.org/blog/2010/06/new-lecture-notes/#comments Mon, 07 Jun 2010 15:33:39 +0000 http://www.odbms.org/blog/?p=279 I have published a new complete set of lecture notes on object databases, from Martin Hulin (Hochschule Ravensburg-Weingarten).

The lecture notes (397 slides) are called: “Modern Database Techniques”, and are available for free download.

The Content and Course Structure:

Learning module 0: Course Information, Welcome: 1 h
Learning module 1: Introduction to Object Oriented Databases: 4 h
Learning module 2: Concepts of Object Oriented Databases: 40 h
Learning module 3: Client Application and Language Binding: 20 h
Learning module 4: Advanced storage concepts, performance tuning: 10 h
Learning module 5: System Management: 20 h
Learning module 6: Security in Databases: 10 h
Learning module 7: Distributed and Mobile Databases: 25 h
Web conference: Exam Preparation: 2 h

Target group: Students of Computer Science in 5th semester or higher.
Prerequisites: Relational Databases, SQL, Object Oriented Programming, UML .
Duration: Student workload: 120 h –150 h. Lecture and supervised exercises: 64 lessons (45 min).

For all exercises, Martin uses the object oriented database management system
Caché from InterSystems.

You can DOWNLOAD (PDF), the lecture notes at ODBMS.ORG.

##

]]>
http://www.odbms.org/blog/2010/06/new-lecture-notes/feed/ 0