ODBMS Industry Watch » InterSystems http://www.odbms.org/blog Trends and Information on Big Data, New Data Management Technologies, Data Science and Innovation. Fri, 09 Feb 2018 21:04:31 +0000 en-US hourly 1 http://wordpress.org/?v=4.2.19 On the InterSystems IRIS Data Platform. http://www.odbms.org/blog/2018/02/on-the-intersystems-iris-data-platform/ http://www.odbms.org/blog/2018/02/on-the-intersystems-iris-data-platform/#comments Fri, 09 Feb 2018 15:16:22 +0000 http://www.odbms.org/blog/?p=4572

“We believe that businesses today are looking for ways to leverage the large amounts of data collected, which is driving them to try to minimize, or eliminate, the delay between event, insight, and action to embed data-driven intelligence into their real-time business processes.” –Simon Player

I have interviewed Simon Player, Director of Development for TrakCare and Data PlatformsHelene Lengler, Regional Director for DACH & BeNeLux, and  Joe Lichtenberg, Director of Marketing for Data Platforms. All three work at InterSystems. We talked about the new InterSystems IRIS Data Platform.


Q1. You recently  announced the InterSystems IRIS Data Platform®. What is it?

Simon Player: We believe that businesses today are looking for ways to leverage the large amounts of data collected, which is driving them to try to minimize, or eliminate, the delay between event, insight, and action to embed data-driven intelligence into their real-time business processes.

It is time for database software to evolve and offer multiple capabilities to manage that business data within a single, integrated software solution. This is why we chose to include the term ‘data platform’ in the product’s name.
InterSystems IRIS Data Platform supports transactional and analytic workloads concurrently, in the same engine, without requiring moving, mapping, or translating the data, eliminating latency and complexity. It incorporates multiple, disparate and dissimilar data sources, supports embedded real-time analytics, easily scales for growing data and user volumes, interoperates seamlessly with other systems, and provides flexible, agile, Dev Ops-compatible deployment capabilities.

InterSystems IRIS provides concurrent transactional and analytic processing capabilities; support for multiple, fully synchronized data models (relational, hierarchical, object, and document); a complete interoperability platform for integrating disparate data silos and applications; and sophisticated structured and unstructured analytics capabilities supporting both batch and real-time use cases in a single product built from the ground up with a single architecture. The platform also provides an open analytics environment for incorporating best-of-breed analytics into InterSystems IRIS solutions, and offers flexible deployment capabilities to support any combination of cloud and on-premises deployments.

Q2. How is InterSystems IRIS Data Platform positioned with respect to other Big Data platforms in the market (e.g. Amazon Web Services, Cloudera, Hortonworks Data Platform, Google Cloud Platform, IBM Watson Data Platform and Watson Analytics, Oracle Data Cloud system, Microsoft Azure, to name a few) ?

Joe Lichtenberg: Unlike other approaches that require organizations to implement and integrate different technologies, InterSystems IRIS delivers all of the functionality in a single product with a common architecture and development experience, making it faster and easier to build real-time, data rich applications. However it is an open environment and can integrate with existing technologies already in use in the customer’s environment.

Q3. How do you ensure High Performance with Horizontal and Vertical Scalability? 

Simon Player: Scaling a system vertically by increasing its capacity and resources is a common, well-understood practice. Recognizing this, InterSystems IRIS includes a number of built-in capabilities that help developers leverage the gains and optimize performance. The main areas of focus are Memory, IOPS and Processing management. Some of these tuning mechanisms operate transparently, while others require specific adjustments on the developer’s own part to take full advantage.
One example of those capabilities is parallel query execution, built on a flexible infrastructure for maximizing CPU usage, it spawns one process per CPU core, and is most effective with large data volumes, such as analytical workloads that make large aggregation.

When vertical scaling does not provide the complete solution—for example, when you hit the inevitable hardware (or budget) ceiling—data platforms can also be scaled horizontally. Horizontal scaling fits very well with virtual and cloud infrastructure, in which additional nodes can be quickly and easily provisioned as the workload grows, and decommissioned if the load decreases.
InterSystems IRIS accomplishes this by providing the ability to scale for both increasing user volume and increasing data volume.

For increased user capacity, we leverage a distributed cache with an architectural solution that partitions users transparently across a tier of application servers sitting in front of our data server(s). Each application server handles user queries and transactions using its own cache, while all data is stored on the data server(s), which automatically keeps the application server caches in sync.

For increased data volume, we distribute the workload to a sharded cluster with partitioned data storage, along with the corresponding caches, providing horizontal scaling for queries and data ingestion. In a basic sharded cluster, a sharded table is partitioned horizontally into roughly equal sets of rows called shards, which are distributed across a number of shard data servers. For example, if a table with 100 million rows is partitioned across four shard data servers, each stores a shard containing about 25 million rows. Queries against a sharded table are decomposed into multiple shard-local queries to be run in parallel on multiple servers; the results are then transparently combined and returned to the user. This distributed data layout can further be exploited for parallel data loading and with third party frameworks like Apache Spark.

Horizontal clusters require greater attention to the networking component to ensure that it provides sufficient bandwidth for the multiple systems involved and is entirely transparent to the user and the application.

Q4. How can you simultaneously processes both transactional and analytic workloads in a single database?

Simon Player: At the core of InterSystems IRIS is a proven, enterprise-grade, distributed, hybrid transactional-analytic processing (HTAP) database. It can ingest and store transactional data at very high rates while simultaneously processing high volumes of analytic workloads on real-time data (including ACID-compliant transactional data) and non-real-time data. This architecture eliminates the delays associated with moving real-time data to a different environment for analytic processing. InterSystems IRIS is built on a distributed architecture to support large data volumes, enabling organizations to analyze very large data sets while simultaneously processing large amounts of real-time transactional data.

Q5. There are a wide range of analytics, including business intelligence, predictive analytics, distributed big data processing, real-time analytics, and machine learning. How do you support them in the InterSystems IRIS  Data Platform?

Simon Player: Many of these capabilities are built into the platform itself and leverage that tight integration to simultaneously processes both transactional and analytic workloads; however, we realize that there are multiple use cases where customers and partners would like InterSystems IRIS Data Platform to access data on other systems or to build solutions that leverage best-of-breed tools (such as ML algorithms, Spark etc.) to complement our platform and quickly access data stored on it.
That’s why we chose to provide open analytics capabilities supporting industry standard APIs such as UIMA, Java Integration, xDBC and other connectivity options.

Q6. What about third-party analytics tools? 

Simon Player:  The InterSystems IRIS Data Platform offers embedded analytics capabilities such as business intelligence, distributed big data processing & natural language processing, which can handle both structured and unstructured data with ease. It is designed as an Open Analytics Platform, built around a universal, high-performance and highly scalable data store.
Third-party analytics tools can access data stored on the platform via standard APIs including ODBC, JDBC, .NET, SOAP, REST, and the new Apache Spark Connector. In addition, the platform supports working with industry-standard analytical artifacts such as predictive models expressed in PMML and unstructured data processing components adhering to the UIMA standard.

Q7. How does InterSystems IRIS Data Platform integrate into existing infrastructures and with existing best-of-breed technologies (including your own products)?

Simon Player:  InterSystems IRIS offers a powerful, flexible integration technology that enables you to eliminate “siloed” data by connecting people, processes, and applications. It includes the comprehensive range of technologies needed for any connectivity task.
InterSystems IRIS can connect to your existing data and applications, enabling you to leverage your investment, rather than “ripping and replacing.” With its flexible connectivity capabilities, solutions based on InterSystems IRIS can easily be deployed in any client environment.

Built-in support for standard APIs enables solutions based on InterSystems IRIS to leverage applications that use Java, .NET, JavaScript, and many other languages. Support for popular data formats, including JSON, XML, and more, cuts down time to connect to other systems.

A comprehensive library of adapters provides out-of-the-box connectivity and data transformations for packaged applications, databases, industry standards, protocols, and technologies – including SQL, SOAP, REST, HTTP, FTP, SAP, TCP, LDAP, Pipe, Telnet, and Email.

Object inheritance minimizes the effort required to build any needed custom adapters. Using InterSystems IRIS’ unit testing service, custom adapters can be tested without first having to complete the entire solution. Traceability of each event allows efficient analysis and debugging.

The InterSystems IRIS messaging engine offers guaranteed message delivery, content-based routing, high-performance message transformation, and support for both synchronous and asynchronous interactions. InterSystems IRIS has a graphical editor for business process orchestration, a business rules engine, and a workflow editor that enable you to automate your enterprise-wide business procedures or create new composite applications. With world-class support for XML, SOAP, JSON and REST, InterSystems

IRIS is ideal for creating an Enterprise Service Bus (ESB) or employing a Service-Oriented Architecture (SOA).

Because it includes a high performance transactional-analytic database, InterSystems IRIS can store and analyze messages as they flow through your system. It enables business activity monitoring, alerting, real-time business intelligence, and event processing.

· Other integration point with industry standards or best-of-breed technologies include the ability to easily transport files between client machines and the server in a secure via our Managed File Transfer (MFT) capability. This functionality leverages state-of-the-art MFT providers like Box, Dropbox and KiteWorks to provide a simple client that non-technical users can install and companies can pre-configure and brand. InterSystems IRIS connects with these providers as a peer and exposes common APIs (e.g. to manage users)

· When using Apache Spark for large distributed data processing and analytics tasks, the Spark Connector will leverage the distributed data layout of sharded tables and push computation as close to the data as possible, increasing parallelism and thus overall throughput significantly vs regular JDBC connections.

Q8. What market segments do you address with IRIS  Data Platform?

Helene Lengler: InterSystems IRIS is an open platform that suits virtually any industry, but we will be initially focusing on a couple of core market segments, primarily due to varying regional demand. For instance, we will concentrate on the financial services industry in the US or UK and the retail and logistics market in the DACH and Benelux regions. Additionally, in Germany and Japan, our major focus will be on the manufacturing industry, where we see a rapidly growing demand for data-driven solutions, especially in the areas of predictive maintenance and predictive analytics.
We are convinced that InterSystems IRIS is ideal for this and also for other kinds of IoT applications with its ability to handle large-scale transactional and analytic workloads On top of this, we are also looking to engage with companies that are at the very beginning of product development – in other words, start-ups and innovators working on solutions that require a robust, future-proof data platform.

Q9. Are there any proof of concepts available? 

Helene Lengler: Yes. Although the solution has only been available to selected partners for a couple of weeks, we have already completed the first successful migration in Germany. A partner that is offering an Enterprise Information Management System, which allows organizations to archive and access all of an organization’s data, documents, emails and paper files has been able to migrate from InterSystems Caché to InterSystems IRIS in as little as a couple of hours and – most importantly – without any issues at all. The partner decided to move to InterSystems IRIS because they are in the process of signing a contract with one of the biggest players in the German travel & transport industry. With customers like this, you are looking at data volumes in the Petabyte range very, very shortly, meaning you require the right technology from the start in order to be able to scale horizontally – using the InterSystems IRIS technologies such as sharding – as well as vertically.

In addition, we were able to show a live IoT demonstrator at our InterSystems DACH Symposium in November 2017. This proof of concept is actually a lighthouse example of what the new platform’s brings to the table: A team of three different business partners and InterSystems experts leveraged InterSystems IRIS’ capabilities to rapidly develop and implement a fully functional solution for a predictive maintenance scenario. Numerous other test scenarios and PoC’s are currently being conducted in various industry segments with different partners around the globe.

Q10. Can developers already use InterSystems IRIS Data Platform? 

Simon Player: Yes. Starting on 1/31, developers can use our sandbox, the InterSystems IRIS Experience, at www.intersystems.com/experience.

Qx. Anything else you wish to add?

Simon Player: The public is welcome to join the discussion on how to graduate from database to data platform on our developer community at https://community.intersystems.com.

Simon Player is director of development for both TrakCare and Data Platforms at InterSystems. Simon has used and developed on InterSystems technologies since the early 1990s. He holds a BSc in Computer Sciences from the University of Manchester.


Helene Lengler is the Regional Managing Director for the DACH and Benelux regions. She joined InterSystems in July 2016 and has more than 25 years of experience in the software technology industry. During her professional career, she has held various senior positions at Oracle, including Vice President (VP) Sales Fusion Middleware and member of the executive board at Oracle Germany, VP Enterprise Sales and VP of Oracle Direct. Prior to her 16 years at Oracle, she worked for the Digital Equipment Corporation in several business disciplines such as sales, marketing and presales.
Helene holds a Masters degree from the Julius-Maximilians-University in Würzburg and a post-graduate Business Administration degree from AKAD in Pinneberg.

Joe Lichtenberg is responsible for product and industry marketing for data platform software at InterSystems. Joe has decades of experience working with various data management, analytics, and cloud computing technology providers.


InterSystems IRIS Data Platform, Product Page.

E-Book (IDC): Slow Data Kills Business.

White Paper (ESG): Building Smarter, Faster, and Scalable Data-rich Applications for Businesses that Operate in Real Time. 

Achieving Horizontal Scalability, Alain Houf – Sales Engineer, InterSystems

Horizontal Scalability with InterSystems IRIS

Press release:InterSystems IRIS Data Platform™ Now Available.

Related Posts

Facing the Challenges of Real-Time Analytics. Interview with David Flower. Source: ODBMS Industry Watch,Published on 2017-12-19

On the future of Data Warehousing. Interview with Jacque Istok and Mike Waas. Source: ODBMS Industry Watch,Published on 2017-11-09

On Vertica and the new combined Micro Focus company. Interview with Colin Mahony. Source: ODBMS Industry Watch, Published on 2017-10-25

On Open Source Databases. Interview with Peter Zaitsev Source: ODBMS Industry Watch, Published on 2017-09-06

Follow up on Twitter: @odbsmorg


http://www.odbms.org/blog/2018/02/on-the-intersystems-iris-data-platform/feed/ 0
Gaia Mission maps 1 Billion stars. Interview with Uwe Lammers http://www.odbms.org/blog/2017/08/gaia-mission-maps-1-billion-stars-interview-with-uwe-lammers/ http://www.odbms.org/blog/2017/08/gaia-mission-maps-1-billion-stars-interview-with-uwe-lammers/#comments Wed, 16 Aug 2017 01:15:33 +0000 http://www.odbms.org/blog/?p=4417

“Gaia continues to be a challenging mission in all areas even after 4 years of operation.
In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.”

— Uwe Lammers.

In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past. The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.
I recall here the Objectives of the Gaia Project (source ESA Web site):

“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”

I have been following the GAIA mission since 2011, and I have reported it in three interviews until now.
In this interview, Uwe Lammers Gaia’s Science Operations Manager – gives a very detailed description of the data challenges and the opportunities of the Gaia mission.

This interview is the fourth of the series, the second after the launch.



Q1. Of the raw astrometry, photometry and spectroscopy data collected so far by the Gaia spacecraft, what is their Volume, Velocity, Variety, Veracity and Value?

Since the beginning of the nominal mission in 2014 until end June 2017 the satellite has delivered about 47.5 TB compressed raw data. This data is not suitable for any scientific analysis but first has to be processed into higher-level products which inflates the volume about 4 times.

The average raw daily data rate is about 40 GB but highly variable depending on which part of the sky the satellite is currently scanning through. The data is highly-complex and interdependent but not unstructured – it does not come with a lot of meta-information as such but follows strictly defined structures. In general it is very trustworthy, however, the downstream
data processing cannot blindly assume that every single observation is valid.

As with all scientific measurements, there can be outliers which must be identified and eliminated from the data stream as part of the analysis. Regarding value, Gaia’s data set is absolutely unique in a number of ways.
Gaia is the only mission surveying the complete sky with unprecedented precision and completeness. The end results is expected to be a treasure trove for generations of astronomers to come.

Q2 How is this data transmitted to Earth?

Under normal observing conditions the data is transmitted from the satellite to the ground through a so-called phased-array-antenna (PAA) at a rate of up to 8.5 Mbps. As the satellite spins, it continuously keeps a radio beam directed towards the Earth by activating successive panels on the PAA. This is a fully electronic process as there can be no moving parts on Gaia which would otherwise disturb the precise measurements. On the Earth we use three 35m radio dishes in Spain, Australia, and Argentina to receive the telemetry from Gaia.

Q3. Calibrated processed data, high level data products and raw data. What is the difference? What kind of technical data challenges do they each pose?

That question is not easy to answer in a few words. Raw data are essentially unprocessed digital measurements from the CCDs – perhaps comparable to data from the “raw mode” of digital consumer cameras. They have to be processed with a range of complex software to turn it into higher level products from which at the end astrophysical information can be inferred. There are many technical challenges, the most basic one is still to handle the 100s of GBs of daily data. Handling means, reception, storage, processing, I/O by the scientific algorithms, backing-up, and disseminating the processed data to 5 other partner data processing centres across Europe.

Here at the Science Operations Centre (SOC) near Madrid we have chosen years ago InterSystems Caché RDMS + NetApp hardware as our storage solution and this continues to be a good solution. The system is reliable and performant which are crucial pre-requisites for us. Another technical challenge is data accountability which means to keep track of the more than 70 Mio scientific observation we get from the satellite every single day.

Q4. Who are the users for such data and what they do with it?

The data we are generating here at the SOC has no immediate users. It is sent out to the 5 other Gaia Data Processing Centres where more scientific processing takes place and more higher-level products get created. From all this processed data we are constructing a stellar catalogue which is our final result and this is what the end users – the astronomical community of world – to see. The first version of our catalogue was published 14 September last year (Gaia Data Release 1) and we are currently working hard to release the second version (DR2) in April next year.

Our end users do fundamental astronomical research with the data ranging from looking at individual stars, studies of clusters, dynamics of our Milky-Way to cosmological questions like the expansion rate of our universe. The scientific exploitation of the Gaia data has just started but already now more than 200 scientific articles have been published. This is about 1 per day since DR1 and we expect this rate to go higher up after DR2.

Q5. Can you explain at a high level how is the ground processing of Gaia data implemented?

ESA has entrusted the Gaia data processing to the Data Processing Analysis Consortium (DPAC) which the SOC is an integral part of. DPAC consists of 9 so-called Coordination Units (CU) and 6 data processing centres (DPCs) across Europe, so this is a large distributed system.

In total some 450 people from 20+ countries with a large range of educational backgrounds and experiences are forming DPAC. Roughly speaking, the CUs are responsible for writing and validating the scientific processing software which is then run in one of the DPCs (every CU is associated with exactly one DPC).

The different CUs cover different aspects of the data processing (e.g. CU3 takes care of astrometry, CU5 of photometry).
The corresponding processes run more or less independent of each other, however, due to the complex interdependencies of the Gaia data itself this is only a first approximation. Ultimately, everything depends on everything else (e.g. astrometry depends on photometry and vice versa) which means that the entire system has to be iterated to produce the final solution. As you can imagine a lot of data has the be exchanged. SOC/DPCE is the hub in a hub-and-spokes topology where the other 5 DPCs are sitting at the ends of the spokes. No data exchange between DPCs is allowed but all the data flow is centrally managed through the hub at DPCE.

Q6. How do you process the data stream in near real-time in order to provide rapid alerts to facilitate ground-base follow up?

Yes, indeed we do. For ground-based follow up observations of variable objects quick turn-around times are essential. The time difference between an observation made on-board and the confirmation of a photometric alert on the ground is typically 2 days now which is close to the optimal value given all the operational constraints we have.

Q7. What are the main technical challenges with respect to data processing, manipulation and storage you have encountered so far? and how did you solved them?

Regarding storage, the handling of 100s of GBs of raw and processed data every day has always been and remains until today quite a challenge as explained above. The Gaia data reduction task is also a formidable computational problem. Years ago we estimated the total numerical effort to produce the final catalogue at some 10^20 FLOPs and this has proven fairly accurate.

So we need quite some number-crunching capabilities in the DPCs and to continuously expand CPU resources as the data volume keeps growing in the operational phase of the mission. Moore’s law is slowly coming to an end but, fortunately, a number of algorithms are perfectly parallelizable (processing every object in the sky individually and isolated) such that CPU bottlenecks can be ameliorated by simply adding more processors to the existing systems.

Data transfers are likewise a challenge. At the moment 1 Gbps connections (public Internet) between DPCE and the other 5 DPCs are sufficient, however, in the coming years we heavily rely on seeing bandwidths increasing to 10 Gbps and beyond. Unfortunately, this is largely not under our control which is a risk to the project.

Q8. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?

As explained above, for the so-called daily pipeline we have chosen  InterSystems Caché and are very satisfied with this approach. We had some initial problems with the system but were able to overcome all difficulties with the help of Intersystems. We much appreciated their excellent service and customer orientation in this phase and till the present day. Regarding analytics tools we use most facilities that are part of Caché, but have also developed a suite of custom-made solutions.

Q9. How do you transform the raw information into useful and reliable stellar positions?

The raw data from the satellite is first turned into higher level-products which already includes preliminary estimates for the stellar positions. But each of these positions is then only based on a single measurements. The high accuracy of Gaia comes from combining _all_ observations that have been taken during the mission with a scheme called Astrometric Iterative Solution (AGIS) [see The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation].

This cannot be done on a star-by-star basis but is a global, simultaneous optimization of a large number of parameters including the 5 basic astrometric parameters of each star (about 1 Billion in total), the time-varying attitude of the satellite
(a few Million), and a number of calibration parameters (a few 10.000).

The process is iterative and in the end gives the best match between the model parameters and the actual observations. The stellar positions are two of the five astrometric parameter of each object.

Q10. What is the level of accuracy you have achieved so far?

The accuracies depend on the brightnesses of the stars – the brighter a star, the higher is the achievable accuracy. In DR1 the typical uncertainty is about 0.3 mas for the positions and parallaxes, and about 1 mas yr^-1 for the proper motions.
For positions and parallaxes a systematic component of another 0.3 mas should be added. With DR2 we are aiming to reduce these formal errors by at least a factor 3 and likewise eliminate systematic errors by the same or a larger amount.

There is then still quite some way to go to reach the end-of-mission accuracies (e.g. 25 micro-arcsec for a magnitude 15 star) but the DR2 catalogue will already become a game changer for astronomy!

Q11. The first catalogue of more than a billion stars from ESA’s Gaia satellite was published on 14 September 2016 – the largest all-sky survey of celestial objects to date. What data is in this catalog? What is the size and structure of the information you analysed so far?

Gaia DR1 contains astrometry, G-band photometry (brightnesses), and a modest number of variable star light curves, for a total of 1 142 679 769 sources [See Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties]. For the large majority of those we only provide position and magnitude but about 2 Million stars also have parallaxes and proper motions. In DR2 these numbers will be substantially larger.
The information is structured in simple, easy-to-use tables which can be queried via the central Gaia Archive  and a number of other data centres around the world.

Q12. What insights have been derived so far by analysing this data?

The astronomical community eagerly grabbed the DR1 data and since 14 September a couple of hundred scientific articles have appeared in peer-reviewed astronomical journals covering a large breads of topics.
Only to give one example: A new so-called open cluster of stars was discovered very close to the brightest star in the night sky, Sirius. All previous surveys had missed it!

Q13 How do you offer a proper infrastructure and middleware upon which scientists will be able to do exploration and modeling with this huge data set?

That is a very good question! At the moment the archive system does not allow yet real big data-mining using the entire large Gaia data set. Up to know we do not know precisely yet what scientists will want to do with the Gaia data in the end.

There is the “traditional” astronomical research which mostly uses only subsets of the data, e.g. all stars in a particular area of the sky. Such data requests can be satisfied with traditional queries to a RDBMS.

But in the future we expect also applications which will need data mining capabilities and we are experimenting with a number of different approaches using the “code-to-the-data” paradigm. The idea is that scientists will be able to upload and deploy their codes directly through a platform which allows execution with quick data access close to the archive.

For DR2 this will only be available for DPAC-internal use but, depending on experiences gained, as per DR3 it might become a service for public use. One technology we are looking at is Apache Spark for big data mining.

Q14. What software technologies do you use for accessing the Gaia catalogue and associated data?

As explained above, at the moment we are offering access to the catalogue only through a traditional RDBMS system which allows queries to be submitted in a special SQL dialect called ADQL (Astronomical Data Query Language). This DB system is not using InterSystems Caché but Postgres.

Q15. In addition to the query access, how do you “visualize” such data? Which “big data” techniques do you use for histograms production?

Visualization is done with a special custom-made application that sits close to the archive and is using not the raw data but pre-computed special objects especially constructed for fast visualization. We are not routinely using any big data techniques but are experimenting with a few key concepts.

For visualization one interesting novel application is called vaex and we are looking at it.
Histogramming of the entire data set is likewise done using pre-canned summary statistics which was generated when the data was ingested into the archive. The number of users really wanting the entire data set and this kind of functionality is very limited at the moment. We as well as the scientific community are still learning what can be done with the Gaia data set.

Q16. Which “big data” software and hardware technologies did use so far? And what are the lessons learned?

Again, we are only starting to look into big data technologies that may be useful for us. Until now most of the effort has gone into robustifying all systems and prepare DR1 and now DR2 for April next year. One issue is always that the Gaia data is so peculiar and special that COTS solutions rarely work. Most of the software systems we use are special developments.

Q.17 What are the main technical challenges ahead?

As far as the daily systems are concerned we are now finally in the routine phase. The main future challenges lie in robustifying and validating the big outer iterative loop that I described above. It has not been tested yet, so, we are executing it for the first time with real flight data.

Producing DR3 (mid to late 2020) will be a challenge as this for the first time involves output from all CUs and the results from the outer iterative loop. DR4 around end 2022 is then the final release for the nominal mission and for that we want to release “everything”. This means also the individual observation data (“epoch data”) which will inflate the total volume served by the archive by a factor 100 or so.

Qx Anything else you wish to add?

Gaia continues to be a challenging mission in all areas even after 4 years of operation. In total we have processed almost 800 Billion (=800,000 Million) astrometric, 160 Billion (=160,000 Million) photometric and more than 15 Billion spectroscopic observation which is the largest astronomical dataset from a science space mission until the present day.

Gaia is fulfilling its promises in every regard and the scientific community is eagerly looking into what is available already now and the coming data releases. This continues to be a great source of motivation for everybody working on this great mission.

Uwe Lammers.  My academic background is in physics and computer science. After my PhD I joined ESA to first work on the X-ray missions EXOSAT, Beppo-SAX, and XMM-Newton before getting interested in Gaia in 2005. The first years I led the development of the so-called Astrometric Global Iterative Solution (AGIS) system and then became Gaia’s Science Operations Manager in 2014.

– The astrometric core solution for the Gaia mission. Overview of models, algorithms, and software implementation
L. Lindegren, U. Lammers et al. Astronomy & Astrophysics, Volume 538, id.A78, 47 pp. February 2012, DOI: 10.1051/0004-6361/201117905

– Gaia Data Release 1. Summary of the astrometric, photometric, and survey properties A.G.A. Brown and Gaia Collaboration, Astronomy & Astrophysics, Volume 595, id.A2, 23 pp. November 2016, DOI: 10.1051/0004-6361/201629512

– Gaia Data Release 1. Astrometry: one billion positions, two million proper motions and parallaxes L. Lindegren, U. Lammers, et al. Astronomy & Astrophysics, Volume 595, id.A4, 32 pp. November 2016, DOI: 10.1051/0004-6361/201628714

The Gaia Archive, Alcione Moraet al.  Astrometry and Astrophysics in the Gaia SkyProceedings IAU Symposium No. 330, 2017

Related Posts

– The Gaia mission in 2015. Interview with Uwe Lammers and Vik Nagjee , ODBMS Industry WatchMarch 24, 2015

– The Gaia mission, one year later. Interview with William O’Mullane.  ODBMS Industry Watch, January 16, 2013

– Objects in Space vs. Friends in Facebook.  ODBMS Industry Watch, April 13, 2011

– Objects in Space.  ODBMS Industry Watch, February 14, 2011


Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2017/08/gaia-mission-maps-1-billion-stars-interview-with-uwe-lammers/feed/ 0
New Gartner Magic Quadrant for Operational Database Management Systems. Interview with Nick Heudecker http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/ http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/#comments Wed, 30 Nov 2016 20:30:20 +0000 http://www.odbms.org/blog/?p=4272

“It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.”–Nick Heudecker.

I have interviewed Nick Heudecker, Research Director on Gartner’s Data & Analytics team.
The main topic of the interview is the new Magic Quadrant for Operational Database Management Systems.


Q1. You have published the new Magic Quadrant for Operational Database Management Systems (*). How do you define the operational database management system market?

Nick Heudecker: We define a DBMS as a complete software system used to define, create, manage, update and query a database. DBMSs provide interfaces to independent programs and tools that both support and govern the performance of a variety of concurrent workload types. There is no presupposition that DBMSs must support the relational model or that they must support the full set of possible data types in use today. OPDBMSs must include functionality to support backup and recovery, and have some form of transaction durability — although the atomicity, consistency, isolation and durability model is not a requirement. OPDBMSs may support multiple delivery models, such as stand-alone DBMS software, certified configurations, cloud (public and private) images or versions, and database appliances.

Q2. Can you explain the methodology you used for this new Magic Quadrant?

Nick Heudecker: The methodologies for several Gartner methodologies are public. The Magic Quadrant methodology can be found here.

We use a number of data sources when we’re creating the Magic Quadrant for Operational Database Management Systems.
We survey vendor reference customers and include data from our interactions with Gartner clients. We also consider earlier information and any news about vendors’ products, customers and finances that came to light during the time frame for our analysis.

Once we have the data, we score vendors across the various dimensions of Completeness of Vision and Ability to Execute.
One thing that’s important to note is Magic Quadrants are relative assessments of vendors in a market. We couldn’t have one vendor on an MQ because it would be right in the middle – there’s nothing to compare it to.

Q3. Why were there no Visionaries this year?

Nick Heudecker: We determined there was an overall lack of vision in the market. After a few years of rapid feature expansion, the focus has shifted to operational excellence and execution. Even Leaders shifted to the left on vision, but are still placed in the Leaders quadrant based on their vision for the development of hybrid database management, hardware optimization and integration, emerging deployment models such as containerization, as well as vertical features.

Q4. Were you surprised by the analysis and some of the results you obtained?

Nick Heudecker: The lack of overall vision in the market struck us the most. Other than in a few notable cases, we received largely the same story from most vendors. The explosion of features, and the vendors emerging to implement them, has slowed. The features that initiated the expansion, such as storing new data types, geographically distributed storage, cloud and flexible data consistency models, have become common. Today, nearly every established or emerging DBMS vendor supports these features to some degree. The OPDBMS market has shifted from a phase of rapid innovation to a phase of maturing products and capabilities.

Q5. Do you believe the “NoSQL” label will continue to distinguish DBMSs?

Nick Heudecker: If you look at the entire operational DBMS space, there’s already a great deal of convergence between NoSQL vendors, as well as between NoSQL and traditionally relational vendors. Nearly every vendor, nonrelational and relational, supports multiple data types, like JSON documents, graph or wide-column. NoSQL vendors are adding SQL: MongoDB’s BI Connector and Couchbase’s N1QL are good, if diverse, examples. They’re also adding things like schema management and data validation capabilities.
On the relational side, they’re adding horizontal scaling options and alternative consistency models, as well as modern APIs. And everyone either has or is adding in-memory and cloud capabilities.

It is too soon to call the operational DBMS market a commodity market, but it’s easy to see a future where that is the case.

Q6. What are the other “Vendors to Consider”?

Nick Heudecker: The other vendors to consider are vendors that did not meet the inclusion requirements for the Magic Quadrant. Usually this is because they missed our minimum revenue requirements, but that doesn’t mean they don’t have compelling products.

Nick Heudecker is a Research Director on Gartner’s Data & Analytics team. His coverage includes data management technologies and practices.


(*) Magic Quadrant for Operational Database Management Systems. Published: 05 October 2016 ID: G00293203Analyst(s): Nick Heudecker, Donald Feinberg, Merv Adrian, Terilyn Palanca, Rick Greenwald

– Complimentary Gartner Research: 100 Data and Analytics Predictions Through 2020. Get exclusive access to Gartner’s top 100 data and analytics predictions through 2020. Plus access other relevant Gartner research including Magic Quadrant reports for database and data warehouse solutions, and the market guide for in-memory computing (LINK to MemSQL web site – registration required).

Related Posts

MarkLogic Named a Next-Generation Database Challenger in 2016 Gartner Magic Quadrant. By GARY BLOOM, Chief Executive Officer and President MARKLOGIC

MarkLogic Recognized in New Gartner® Magic Quadrant. Gartner Magic Quadrant for Operational Database Management Systems positions MarkLogic® the highest for ability to execute in the Challengers Quadrant

– Accelerating Business Value with a Multi-Model, Multi-Workload Data Platform

– NuoDB Recognized by Gartner in Critical Capabilities for Operational Database Management Systems. Elastic SQL database achieves top five score in all four use cases.

– Clustrix Recognized in Gartner Magic Quadrant for Operational Database Management Systems

– Learn why EDB is named a “Challenger” in the 2016 Gartner ODBMS Magic Quadrant

– DataStax Receives Highest Scores in 2 Use Cases in Gartner’s Critical Capabilities for Operational Database Management Systems

– Gartner Scores Oracle Highest In 3 of 4 Use Cases: Gartner Critical Capabilities for Operational Database Management Systems Report

Gartner Critical Capabilities For Operational Database Management Systems 2016 – Redis Labs Ranked Second Highest In 2/4 Categories (Link- Registation required)


Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2016/11/new-gartner-magic-quadrant-for-operational-database-management-systems-interview-with-nick-heudecker/feed/ 0
On Data Interoperability. Interview with Julie Lockner. http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/ http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/#comments Tue, 07 Jun 2016 16:47:14 +0000 http://www.odbms.org/blog/?p=4151

“From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care?”– Julie Lockner.

I have interviewed Julie Lockner.  Julie leads data platform product marketing for InterSystems. Main topics of the interview are Data Interoperability and InterSystems` data platform strategy.


Q1. Everybody is talking about Big Data — is the term obsolete?

Julie Lockner: Well, there is no doubt that the sheer volume of data is exploding, especially with the proliferation of smart devices and the Internet of Things (IoT). An overlooked aspect of IoT is the enormous volume of data generated by a variety devices, and how to connect, integrate and manage it all.

The real challenge, though, is not just processing all that data, but extracting useful insights from the variety of device types. Put another way, not all data is created using a common standard. You want to know how to interpret data from each device, know which data from what type of device is important, and which trends are noteworthy. Better information can create better results when it can be aggregated and analyzed consistently, and that’s what we really care about. Better, higher quality outcomes, not bigger data.

Q2. If not Big Data, where do we go from here?

Julie Lockner: We always want to be focusing on helping our customers build smarter applications to solve real business challenges, such as helping them to better compete on service, roll out high-quality products quicker, simplify processes – not build solutions in search of a problem. A canonical example is in retail. Our customers want to leverage insight from every transaction they process to create a better buying experience online or at the point of sale. This means being able to aggregate information about a customer, analyze what the customer is doing while on the website, and make an offer at transaction time that would delight them. That’s the goal – a better experience – because that is what online consumers expect.

From a healthcare perspective, how can we aggregate all the medical data, in all forms from multiple sources, such as wearables, home medical devices, MRI images, pharmacies and so on, and also blend in intelligence or new data sources, such as genomic data, so that doctors can make better decisions at the point of care? That implies we are analyzing not just more data, but better data that comes in all shapes and sizes, and that changes more frequently. It really points to the need for data interoperability.

Q3. What are the challenges software developers are telling you they have in today’s data-intensive world?

Julie Lockner: That they have too many database technologies to choose from and prefer to have a simple data platform architecture that can support multiple data models and multiple workloads within a single development environment.
We understand that our customers need to build applications that can handle a vast increase in data volume, but also a vast array of data types – static, non-static, local, remote, structured and non-structured. It must be a platform that coalesces all these things, brings services to data, offers a range of data models, and deals with data at any volume to create a more stable, long-term foundation. They want all of these capabilities in one platform – not a platform for each data type.

For software developers today, it’s not enough to pick elements that solve some aspect of a problem and build enterprise solutions around them; not all components scale equally. You need a common platform without sacrificing scalability, security, resilience, rapid response. Meeting all these demands with the right data platform will create a successful application.
And the development experience is significantly improved and productivity drastically increased when they can use a single platform that meets all these needs. This is why they work with InterSystems.

Q4. Traditionally, analytics is used with structured data, “slicing and dicing” numbers. But the traditional approach also involves creating and maintaining a data warehouse, which can only provide a historical view of data. Does this work also in the new world of Internet of Things?

Julie Lockner: I don’t think so. It is generally possible to take amorphous data and build it into a structured data model, but to respond effectively to rapidly changing events, you need to be able to take data in the form in which it comes to you.

If your data platform lacks certain fields, if you lack schema definition, you need to be able to capitalize on all these forms without generating a static model or a refinement process. With a data warehouse approach, it can take days or weeks to create fully cleansed, normalized data.
That’s just not fast enough in today’s always-on world – especially as machine-generated data is not conforming to a common format any time soon. It comes back to the need for a data platform that supports interoperability.

Q5. How hard is it to make decisions based on real-time analysis of structured and unstructured data?

Julie Lockner: It doesn’t have to be hard. You need to generate rules that feed rules engines that, in turn, drive decisions, and then constantly update those rules. That is a radical enhancement of the concept of analytics in the service of improving outcomes, as more real-time feedback loops become available.

The collection of changes we describe as Big Data will profoundly transform enterprise applications of the future. Today we can see the potential to drive business in new ways and take advantage of a convergence of trends, but it is not happening yet. Where progress has been made is the intelligence of devices and first-level data aggregation, but not in the area of services that are needed. We’re not there yet.

Q6. What’s next on the horizon for InterSystems in meeting the data platform requirements of this new world?

Julie Lockner: We continually work on our data platform, developing the most innovative ways we can think of to integrate with new technologies and new modes of thinking. Interoperability is a hugely important component. It may seem a simple task to get to the single most pertinent fact, but the means to get there may be quite complex. You need to be able to make the right data available – easily – to construct the right questions.

Data is in all forms and at varying levels of completeness, cleanliness, and accuracy. For data to be consumed as we describe, you need measures of how well you can use it. You need to curate data so it gets cleansed and you can cull what is important. You need flexibility in how you view data, too. Gathering data without imposing an orthodoxy or structure allows you to gain access to more data. Not all data will conform to a schema a priori.

Q7. Recently you conducted a benchmark test of an application based on InterSystems Caché®. Could you please summarize the main results you have obtained?

Julie Lockner: One of our largest customers is Epic Systems, one of the world’s top healthcare software companies.
Epic relies on Caché as the data platform for electronic medical record solutions serving more than half the U.S. patient population and millions of patients worldwide.

Epic tested the scalability and performance improvements of Caché version 2015.1. Almost doubling the scalability of prior versions, Caché delivers what Epic President Cark Dvorak has described as “a key strategic advantage for our user organizations that are pursuing large-scale medical informatics programs as well as aggressive growth strategies in preparation for the volume-to-value transformation in healthcare.”

Qx Anything else you wish to add?

Julie Lockner: The reason why InterSystems has succeeded in the market for so many years is a commitment to the success of those who depend on our technology. A recent Gartner Magic Quadrant report found we had the highest number of customers surveyed – 85% – who would buy from us again. That is the highest number of any vendor participating in that study.

The foundation of the company’s culture is all about helping our customers succeed. When our customers come to us with a challenge, we all pitch in to solve it. Many times our solutions may address an unusual problem that could benefit others – which then becomes the source of many of our innovations. It is one of the ways we are using problem-solving skills as a winning strategy to benefit others. When our customers are successful at using our engine to solve the world’s most important challenges, we all win.


Julie Lockner leads data platform product marketing for InterSystems. She has more than 20 years of experience in IT product marketing management and technology strategy, including roles at analyst firm ESG as well as Informatica and EMC.



“InterSystems Unveils Major New Release of Caché,” Feb. 25, 2015.

“Gartner Magic Quadrant for Operational DBMS, Donald Feinberg, Merv Adrian, Nick Heudecker, Adam M. Ronthal, and Terilyn Palanca, October 12, 2015, ID: G00271405.

– White Paper: Big Data Healthcare: Data Scalability with InterSystems Caché® and Intel® Processors (LINK to .PDF)

Related Posts

– A Grand Tour of Big Data. Interview with Alan Morrison. ODBMs Industry Watch, February 25, 2016

–  RIP Big Data. By Carl Olofson, Research Vice President, Data Management Software Research, IDC. ODBMS.org, JANUARY 6, 2016.

What is data blending. By Oleg Roderick, David Sanchez, Geisinger Data Science. ODBMS.org, November 2015

Follow us on Twitter: @odbmsorg


http://www.odbms.org/blog/2016/06/on-data-interoperability-interview-with-julie-lockner/feed/ 0
The Gaia mission in 2015. Interview with Uwe Lammers and Vik Nagjee http://www.odbms.org/blog/2015/03/gaia-mission/ http://www.odbms.org/blog/2015/03/gaia-mission/#comments Tue, 24 Mar 2015 10:10:00 +0000 http://www.odbms.org/blog/?p=3810

“Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.”–Dr. Uwe Lammers.

“The Gaia mission is considered to be the largest data processing challenge in astronomy.”–Vik Nagjee

In December of 2013, the European Space Agency (ESA) launched a satellite called Gaia on a five-year mission to map the galaxy and learn about its past.

The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”.

I recall here the Objectives of the Gaia Project (source ESA Web site):

“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”

I have been following the GAIA mission since 2011, and I have reported it in two interviews until now. This is the third interview of the series, the first one after the launch.
The interview is with Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency, and Vik Nagjee, Product Manager for Data Platforms at InterSystems.


Q1. Could you please elaborate in some detail what is the goal and what are the expected results of the Gaia mission?

Uwe Lammers: We are trying to construct the most consistent, most complete and most accurate astronomical catalog ever done. Completeness means to observe all objects in the sky that are brighter than a so-called magnitude limit of 20. These are mostly stars in our Milky Way up to 1.5 billion in number. In addition, we expect to observe as many as 10 million other galaxies, hundreds of thousands of celestial bodies in our solar system (mostly asteroids), tens of thousands of new exo-planets, and more. Some believe that the Gaia data will revolutionize astronomy! Only time will tell if that is true, but it is clear that it will be a treasure trove for astronomers for decades to come.

Vik Nagjee: The data collected from Gaia will ultimately result in a three-dimensional map of the Milky Way, plotting over a billion celestial objects at a distance of up to 30,000 light years. This will reveal the composition, formation and evolution of the Galaxy, and will enable the testing of Albert Einstein’s Theory of Relativity, the space-time continuum, and gravitational waves, among other things. As such, the Gaia mission is considered to be the largest data processing challenge in astronomy.

Orbiting the Lagrange 2 (L2) point, a fixed spot 1.5 million kilometers from Earth, Gaia will measure the position, movement, and brightness of more than a billion celestial objects, looking at each one an average of 70 times over the course of five years. Gaia’s measurements will be much more complete, powerful, and accurate than anything that has been done before. ESA scientists estimate that Gaia will find hundreds of thousands of new celestial objects, including extra-solar planets, and the failed stars known as brown dwarfs. In addition, because Gaia can so accurately measure the position and movement of the stars, it will provide valuable information about the galaxy’s past – and future – evolution.

Read more about the Gaia mission here.

Q2. What is the size and structure of the information you analysed so far?

Uwe Lammers: From the start of the nominal mission on 25 July until today, we have received about 13 terabytes of compressed binary telemetry from the satellite. The daily pipeline running here at the Science Operations Centre (SOC) has processed all this and generated about 48 TB of higher-level data products for downstream systems.
At the end of the mission, the Main Database (MDB) is expected to hold more than 1 petabyte of data. The structure of the data is complex and this is one of the main challenges of the project. Our data model contains about 1,500 tables with thousands of fields in total, and many inter-dependencies. The final catalog to be released sometime around 2020 will have a simpler structure, and there will be ways to access and work with it in a convenient form, of course.

Q3. Since the launch of Gaia in December 2013, what intermediate results did you obtain by analysing the data received so far?

Uwe Lammers: Last year we found our first supernova (exploding star)  with the prototype of the so-called Science Alert pipeline. When this system is fully operational, we expect to find several of these per day. The recent detection of a micro-lensing event was another nice demonstration of Gaia’s capabilities.

Q4. Did you find out any unexpected information and/or confirmation of theories by analysing the data generated by Gaia so far?

Uwe Lammers: It is still too early in the mission to prove or disprove established astronomical theories. For that we need to collect more data and do much more processing. The daily SOC pipeline is only one, the first part, of a large distributed system that involves five other Data Processing Centres (DPCs), each running complex scientific algorithms on the data. The whole system is designed to improve the results iteratively, step by step, until the final accuracy has been reached. However, there will certainly be intermediate results. One simple example of an unexpected early finding is that Gaia gets hit by micro-meteoroids much more often than pre-launch estimates predicted.

Q5. Could you please explain at some high level the Gaia’s data pipeline?

Uwe Lammers: Hmmm, that’s not easy to do in a few words. The daily pipeline at the SOC converts compact binary telemetry of the satellite into higher level products for the downstream systems at the SOC and the other processing centres. This sounds simple, but it is not – mainly because of the complex dependencies and the fact that data does not arrive from the satellite in strict time order. The output of the daily pipeline is only the start as mentioned above.

From the SOC, data gets sent out daily to the other DPCs, which perform more specialized processing. After a number of months we declare the current data segment as closed, receive the outputs from the other DPCs back at the SOC, and integrate all into a coherent next version of the MDB. The creation of it marks the end of the current iteration and the start of a new one. This cyclic processing will go on for as many iterations as needed to converge to a final result.
An important key process is the Astrometric Global Iterative Solution (AGIS), which will give us the astrometric part of the catalog. As the name suggests, it is in itself an iterative process and we run it likewise here at the SOC.

Vik Nagjee: To add on to what Dr. Lammers describes, Gaia data processing is handled by a pan-European collaboration, the Gaia Data Processing and Analysis Consortium (DPAC), and consists of about 450 scientists and engineers from across Europe. The DPAC is organized into nine Coordination Units (CUs); each CU is responsible for a specific portion of the Gaia data processing challenge.

One of the CUs – CU3: Core Processing – is responsible for unpacking, decompressing, and processing the science data retrieved from the satellite to provide rapid monitoring and feedback of the spacecraft and payload performances at the ultra-precise accuracy levels targeted by the mission. In other words, CU3 is responsible for ensuring the accuracy of the data collected by Gaia, as it is being collected, to ensure the accuracy of the eventual 3-D catalog of the Milky Way.

Over its lifetime, Gaia will generate somewhere between 500,000 to 1 million GB of data. On an average day, approximately 50 million objects will “transit” Gaia’s field of view, resulting in about 285 GB of data. When Gaia is surveying a densely populated portion of the galaxy, the daily amount could be 7 to 10 times as much, climbing to over 2,000 GB of data in a day.

There is an eight-hour window of time each day when raw data from Gaia is downloaded to one of three ground stations.
The telemetry is sent to the European Space Astronomy Centre (ESAC) in Spain – the home of CU3: Core Processing – where the data is ingested and staged.
The initial data treatment converts the data into the complex astrometric data models required for further computation. These astrometric objects are then sent to various other Computational Units, each of which is responsible for looking at different aspects of the data. Eventually the processed data will be combined into a comprehensive catalog that will be made available to astronomers around the world.

In addition to performing the initial data treatment, ESAC also processes the resulting astrometric data with some complex algorithms to take a “first-look” at the data, making sure that Gaia is operating correctly and sending back good information. This processing occurs on the Initial Data Treatment / First Look (IDT/FL) Database; the data platform for the IDT/FL database is InterSystems Caché.

Q6. Observations made and conclusions drawn are only as good as the data that supports them. How do you evaluate the “quality” of the data you receive? and how do you discard the “noise” from the valuable information?

Uwe Lammers: A very good question! If you refer to the final catalog, this is a non-trivial problem and a whole dedicated group of people is working on it. The main issue is, of course, that we do not know the “true” values as in simulations. We work with models, e.g., models of the stars’ positions and the satellite orientation. With those we can predict the observations, and the difference between the predicted and the observed values tells us how well our models represent reality. We can also do consistency checks. For instance, we do two runs of AGIS, one with only the observations from odd months and another one from even months, and both must give similar results. But we will also make use of external astronomical knowledge to validate results, e.g., known distances to particular stars. For distinguishing “noise” from “signal,” we have implemented robust outlier rejection schemes. The quality of the data coming directly from the satellite and from the daily pipeline is assessed with a special system called First Look running also at the SOC.

Vik Nagjee: The CU3: Core Processing Unit is responsible for ensuring the accuracy of the data being collected by Gaia, as it is being collected, so as to ensure the accuracy of the eventual 3-D catalog of the Milky Way.

InterSystems Caché is the data platform used by CU3 to quickly determine that Gaia is working properly and that the data being downloaded is trustworthy. Caché was chosen for this task because of its proven ability to rapidly ingest large amounts of data, populate extremely complex astrometric data models, and instantly make the data available for just-in-time analytics using SQL, NoSQL, and object paradigms.

One million GB of data easily qualifies as Big Data. What makes InterSystems Caché unique is not so much its ability to handle very large quantities of data, but its abilities to provide just-in-time analytics on just the right data.
We call this “Big Slice” — which is where analytics is performed just-in-time for a focused result.

A good analogy is how customer service benefits from occasional Big Data analytics. Breakthrough customer service comes from improving service at the point of service, one customer at a time, based on just-in-time processing of a Big Slice – the data relevant to the customer and her interactions. Back to the Gaia mission: at the conclusion of five years of data collection, a true Big Data exercise will plot the solar map. Yet, frequently ensuring data accuracy is an example of the increasing strategic need for our “Big Slice” concept.

Q7. What kind of databases and analytics tools do you use for the Gaia`s data pipeline?

Uwe Lammers: At the SOC all systems use InterSystems’ Caché database. Despite some initial hiccups, Cache´ has proved to be a good choice for us. For analytics we use a few popular generic astronomical tools (e.g., topcat), but most are custom-made and specific to Gaia data. All DPCs had originally used relational databases, but some have migrated to Apache’s Hadoop.

Q8. Specifically for the Initial Data Treatment/First Look (IDT/FL) database, what are the main data management challenges you have?

Uwe Lammers: The biggest challenge is clearly the data volumes and the steady incoming stream that will not stop for the next five years. The satellite sends us 40-100 GB of compressed raw data every day, which the daily pipeline needs to process and store the output in near real time, as otherwise we quickly accumulate backlogs.
This means all components, the hardware, databases, and software,  have to run and work robustly more or less around the clock. The IDTFL database grows daily by a few hundred gigabytes, but not all data has to be kept forever. There is an automatic cleanup process running that deletes data that falls out of chosen retention periods. Keeping all this machinery running around the clock is tough!

Vik Nagjee: Gaia’s data pipeline imposes some rather stringent requirements on the data platform used for the Initial Data Treatment/First Look (IDT/FL) database. The technology must be capable of ingesting a large amount of data and converting it into complex objects very quickly. In addition, the data needs to be immediately accessible for just-in-time analytics using SQL.

ESAC initially attempted to use traditional relational technology for the IDT/FL database, but soon discovered that a traditional RDBMS couldn’t ingest discrete objects quickly enough. To achieve the required insert rate, the data would have to be ingested as large BLOBs of approximately 50,000 objects, which would make further analysis extremely difficult. In particular, the first look process, which requires rapid, just-in-time analytics of the discrete astrometric data, would be untenable. Another drawback to using traditional relational technology, in addition to the typical performance and scalability challenges, was the high cost of the hardware that would be needed.

Since traditional RDBMS technology couldn’t meet the stringent demands imposed by CU3, ESAC decided to use InterSystems Caché.

Q9. How did you solve such challenges and what lessons did you learn until now?

Uwe Lammers: I have a good team of talented and very motivated people and this is certainly one aspect.
In case of problems we are also totally dependent on quick response times from the hardware vendors, the software developers and InterSystems. This has worked well in the past, and InterSystems’ excellent support in all cases where the database was involved is much appreciated. As far as the software is concerned, the clear lesson is that rigorous validation testing is essential – the more the better. There can never be too much. As a general lesson, one of my favorite quotes from Einstein captures it well: “Everything should be made as simple as possible, but no simpler.

Q10. What is the usefulness of the CU3’s IDT/FL database for the Gaia’s mission so far?

Uwe Lammers: It is indispensable. It is the central working repository of all input/output data for the daily pipeline including the important health monitoring of the satellite.

Vik Nagjee: The usefulness of CU3’s IDT/FL database was proven early in Gaia’s mission. During the commissioning period for the satellite, an initial look at the data it was generating showed that extraneous light was being gathered. If the situation couldn’t be corrected, the extra light could significantly degrade Gaia’s ability to see and measure faint objects.

It was hypothesized that water vapor from the satellite outgassed in the vacuum of space, and refroze on Gaia’s mirrors, refracting light into its focal plane. Although this phenomenon was anticipated (and the mirrors equipped with heaters for that very reason), the amount of ice deposited was more than expected. Heating the mirrors melted the ice and solved the problem.

Scientists continue to rely on the IDT/FL database to provide just-in-time feedback about the efficacy and reliability of the data they receive from Gaia.

Qx Anything else you wish to add?

Uwe Lammers: Gaia is by far the most interesting and challenging project I have every worked on.
It is fascinating to see science, technology, and a large diverse group of people working together trying to create something truly great and lasting. Please all stay tuned for exciting results from Gaia to come!

Vik Nagjee: As Dr. Lammers said, Gaia is truly one of the most interesting and challenging computing projects of all time. I’m honored to have been a contributor to this project, and cannot wait to see the results from the Gaia catalog. Here’s to unraveling the chemical and dynamical history of our Galaxy!

Dr. Uwe Lammers, Gaia Science Operations Manager at the European Space Agency.
Uwe Lammers has a PhD in Physics and a degree in Computer Science and has been working for the European Space Agency on a number of space science mission for the past 20 years. After being involved in the X-ray missions
EXOSAT, BeppoSAX, and XMM-Newton, Gaia caught his attention in 2004.
As of late 2005, together with William O’Mullane, he built up the Gaia Science Operations Centre (SOC) at ESAC near Madrid. From early 2006 to mid-2014 he was in charge of the development of AGIS and is now leading the SOC as Gaia Science Operations Manager.

Vik Nagjee is a Product Manager for Data Platforms at InterSystems.
He’s responsible for Performance and Scalability of InterSystems Caché, and spends the rest of his time helping people (prospects, application partners, end users, etc.) find perfect solutions for their data, processing, and system architecture needs.


ESA Web site: The GAIA Mission

ESA’s website for the Gaia Scientific Community.

Related Posts

The Gaia mission, one year later. Interview with William O’Mullane. ODBMS Industry Watch, January 16, 2013 

Objects in Space. ODBMS Industry Watch, February 14, 2011

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2015/03/gaia-mission/feed/ 0
Big Data: Three questions to InterSystems. http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/ http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/#comments Mon, 13 Jan 2014 10:41:24 +0000 http://www.odbms.org/blog/?p=2880

“The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. “–Iran Hutchinson.

I start this new year with a new series of short interviews to leading vendors of Big Data technologies. I call them “Big Data: three questions to“. The first of such interviews is with Iran Hutchinson, Big Data Specialist at InterSystems.


Q1. What is your current “Big Data” products offering?

Iran Hutchinson: InterSystems has actually been in the Big Data business for some time, since 1978, long before anyone called it that. We currently offer an integrated database, integration and analytics platform based on InterSystems Caché®, our flagship product, to enable Big Data breakthroughs in a variety of industries.

Launched in 1997, Caché is an advanced object database that provides in-memory speed with persistence, and the ability to ingest huge volumes of transactional data at insanely high velocity. It is massively scalable, because of its very lean design. Its efficient multidimensional data structures require less disk space and provide faster SQL performance than relational databases. Caché also provides sophisticated analytics, enabling real-time queries against transactional data with minimal maintenance and hardware requirements.

InterSystems Ensemble® is our seamless platform for integrating and developing connected applications. Ensemble can be used as a central processing hub or even as backbone for nationwide networks. By integrating this connectivity with our high-performance Caché database, as well as with new technologies for analytics, high-availability, security, and mobile solutions, we can deliver a rock-solid and unified Big Data platform, not a patchwork of disparate solutions.

We also offer additional technologies built on our integrated platform, such as InterSystems HealthShare®, a health informatics platform that enables strategic interoperability and analytics for action. Our TrakCare unified health information system is likewise built upon this same integrated framework.

Q2. Who are your current customers and how do they typically use your products?

Iran Hutchinson: We continually update our technology to enable customers to better manage, ingest and analyze Big Data. Our clients are in healthcare, financial services, aerospace, utilities – industries that have extremely demanding requirements for performance and speed. For example, Caché is the world’s most widely used database in healthcare. Entire countries, such as Sweden and Scotland, run their national health systems on Caché, as well as top hospitals and health systems around the world. One client alone runs 15 percent of the world’s equity trades through InterSystems software, and all of the top 10 banks use our products.

It is also being used by the European Space Agency to map a billion stars – the largest data processing task in astronomy to date. (See The Gaia Mission One Year Later.)

Our configurable ACID (Atomicity Consistency Isolation Durability) capabilities and ECP-based approach enable us to handle these kinds of very large-scale, very high-performance, transactional Big Data applications.

Q3. What are the main new technical features you are currently working on and why?

Iran Hutchinson: There are several new paradigms we are focusing on, but let’s focus on analytics. Once you absorb all that Big Data, you want to run analytics. And that’s where the three V’s of Big Data – volume, velocity and variety – are critically important.

Let’s talk about the variety of data. Most popular Big Data analytics solutions start with the assumption of structured data – rows and columns – when the most interesting data is unstructured, or text-based data. A lot of our competitors still struggle with unstructured data, but we solved this problem with Caché in 1997, and we keep getting better at it. InterSystems Caché offers both vertical and horizontal scaling, enabling schema-less and schema-based (SQL) querying options for both structured and unstructured data.
As a result, our clients today are running analytics on all their data – and we mean real-time, operational data, not the data that is aggregated a week later or a month later for boardroom presentations.

A lot of development has been done in the area of schema-less data stores or so-called document stores, which are mainly key-value stores. The absence of a schema has some flexibility advantages, although for querying the data, the absence of a schema presents some challenges to people accustomed to a classic RDBMS. Some companies now offer SQL querying on schema-less data stores as an add-on or plugin. InterSystems Caché provides a high-performance key-value store with native SQL support.

The commonly available SQL-based solutions also require a predefinition of what the user is interested in. But if you don’t know the data, how do you know what’s interesting? Embedded within Caché is a unique and powerful text analysis technology, called iKnow, that analyzes unstructured data out of the box, without requiring any predefinition through ontologies or dictionaries. Whether it’s English, German, or French, iKnow can automatically identify concepts and understand their significance – and do that in real-time, at transaction speeds.

iKnow enables not only lightning-fast analysis of unstructured data, but also equally efficient Google-like keyword searching via SQL with a technology called iFind.
And because we married that iKnow technology with another real-time OLAP-type technology we call DeepSee, we make it possible to embed this analytic capability into your applications. You can extract complex concepts and build cubes on both structured AND unstructured data. We blend keyword search and concept discovery, so you can express a SQL query and pull out both concepts and keywords on unstructured data.

Much of our current development activity is focused on enhancing our iKnow technology for a more distributed environment.
This will allow people to upload a data set, structured and/or unstructured, and organize it in a flexible and dynamic way by just stepping through a brief series of graphical representation of the most relevant content in the data set. By selecting, in the graphs, the elements you want to use, you can immediately jump into the micro-context of these elements and their related structured and unstructured information objects. Alternately, you can further segment your data into subsets that fit the use you had in mind. In this second case, the set can be optimized by a number of classic NLP strategies such as similarity extension, typicality pattern parallelism, etc. The data can also be wrapped into existing cubes or into new ones, or fed into advanced predictive models.

So our goal is to offer our customers a stable solution that really uses both structured and unstructured data in a distributed and scalable way. We will demonstrate the results of our efforts in a live system at our next annual customer conference, Global Summit 2014.

We also have a software partner that has built a very exciting social media application, using our analytics technology. It’s called Social Knowledge, and it lets you monitor what people are saying on Twitter and Facebook – in real-time. Mind you, this is not keyword search, but concept analysis – a very big difference. So you can see if there’s a groundswell of consumer feedback on your new product, or your latest advertising campaign. Social Knowledge can give you that live feedback – so you can act on it right away.

In summary, today InterSystems provides SQL and DeepSee over our shared data architecture to do structured data analysis.
And for unstructured data, we offer iKnow semantic analysis technology and iFind, our iKnow-powered search mechanism, to enable information discovery in text. These features will be enabled for text analytics in future versions of our shared-nothing data architectures.

Related Posts

The Gaia mission, one year later. Interview with William O’Mullane.
ODBMS Industry Watch, January 16, 2013

Operational Database Management Systems. Interview with Nick Heudecker. ODBMS Industry Watch, December 16, 2013.

Challenges and Opportunities for Big Data. Interview with Mike Hoskins. ODBMS Industry Watch, December 3, 2013.

On Analyzing Unstructured Data. — Interview with Michael Brands.
ODBMS Industry Watch, July 11, 2012.


ODBMS.org: Big Data Analytics, NewSQL, NoSQL, Object Database Vendors –Free Resources.

ODBMS.org: Big Data and Analytical Data Platforms, NewSQL, NoSQL, Object Databases– Free Downloads and Links.

ODBMS.org: Expert Articles.

Follow ODBMS.org on Twitter: @odbmsorg


http://www.odbms.org/blog/2014/01/big-data-three-questions-to-intersystems/feed/ 0
The Gaia mission, one year later. Interview with William O’Mullane. http://www.odbms.org/blog/2013/01/the-gaia-mission-one-year-later-interview-with-william-omullane/ http://www.odbms.org/blog/2013/01/the-gaia-mission-one-year-later-interview-with-william-omullane/#comments Wed, 16 Jan 2013 07:57:23 +0000 http://www.odbms.org/blog/?p=1864 ” We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.” — William O`Mullane.

The Gaia mission is considered by the experts “the biggest data processing challenge to date in astronomy”. I recall here the Objectives and the Mission of the Gaia Project (source ESA Web site):
“To create the largest and most precise three dimensional chart of our Galaxy by providing unprecedented positional and radial velocity measurements for about one billion stars in our Galaxy and throughout the Local Group.”
“Gaia is an ambitious mission to chart a three-dimensional map of our Galaxy, the Milky Way, in the process revealing the composition, formation and evolution of the Galaxy. Gaia will provide unprecedented positional and radial velocity measurements with the accuracies needed to produce a stereoscopic and kinematic census of about one billion stars in our Galaxy and throughout the Local Group. This amounts to about 1 per cent of the Galactic stellar population. Combined with astrophysical information for each star, provided by on-board multi-colour photometry, these data will have the precision necessary to quantify the early formation, and subsequent dynamical, chemical and star formation evolution of the Milky Way Galaxy.
Additional scientific products include detection and orbital classification of tens of thousands of extra-solar planetary systems, a comprehensive survey of objects ranging from huge numbers of minor bodies in our Solar System, through galaxies in the nearby Universe, to some 500 000 distant quasars. It will also provide a number of stringent new tests of general relativity and cosmology.”

Last year in February, I have interviewed William O`Mullane, Science Operations Development Manager, at the European Space Agency, and Vik Nagjee, Product Manager, Core Technologies, at InterSystems Corporation, both deeply involved with the initial Proof-of-Concept of the data management part of the project.

A year later, I have asked William O`Mullane (European Space Agency), and Jose Ruperez (Intersystems Spain), some follow up questions.


Q1. The original goal of the Gaia mission was to “observe around 1,000,000,000 celestial objects”. Is this still true? Are you ready for that?

William O’Mullane: YES ! We will have a Ground Segment Readiness Review next Spring and a Flight Acceptance Review before summer. We will observe at LEAST 1,000,000,000 celestial objects. If we launched today we would cope with difficulty – but we are on track to be ready by September when we actually launch. This is a game changer for astronomy thus very challenging for us, but we have done many large scale tests to gain confidence in our ability to process the complex and voluminous data arriving on ground and turn it into catalogues. I still feel the galaxy has plenty of scope to throw us an unexpected curve ball though and really challenge us in the data processing.

Q2. The plan was to launch the Gaia satellite in early 2013. Is this plan confirmed?

William O’Mullane: Currently September 2013 in Q1 is the official launch date.

Q3. Did the data requirements for the project change in the last year? If yes, how?

William O’Mullane: Downlink rate has not changed so we know how much comes into the System still only about 100TB over 5 years. Data processing volumes depend on how many intermediate steps we keep in different locations. Not much change there since last year.

Q4. The sheer volume of data that is expected to be captured by the Gaia satellite poses a technical challenge. What work has been done in the last year to prepare for such a challenge? What did you learn from the Proof-of-Concept of the data management part of this project?

William O’Mullane: I suppose we learned the same lessons as other projects. We have multiple processing centres with different needs met by different systems. We did not try for a single unified approach across these centers.
The CNES have gone completely to Hadoop for their processing. At ESAC we are going to InterSystems Caché. Last year only AGIS was on Caché – now the main daily processing chain is in Caché also [Edit: see also Q.9 ]. There was a significant boost in performance here but it must be said some of this was probably internal to the system, in moving it we looked at some bottlenecks more closely.

Jose Ruperez: We are very pleased to know that last year was only AGIS and now they have several other databases in Caché.

William O’Mullane: The second operations rehearsal is just drawing to a close. This was run completely on Caché (the first rehearsal used Oracle). There were of course some minor problems (also with our software) but in general from Caché perspective it went well.

Q5. Could you please give us some numbers related to performance? Could you also tells us what bottlenecks did you look at, and how did you avoid them?

William O’Mullane: Would take me time to dig out numbers .. we got factor 10 in some places with combination of better queries and removing some code bottle necks. We seem to regularly see factor 10 on “non optimized” systems.

Q6. Is it technically possible to interchange data between Hadoop and Caché ? Does it make sense for the project?

Jose Ruperez: The raw data downloaded from the satellite link every day can be loaded in any database in general. ESAC has chosen InterSystems Caché for performance reasons, but not only. William also explains how cost-effectiveness as well as the support from InterSystems were key. Other centers can try and use other products.

William O’Mullane:
This is a valid point – support is a major reason for our using Caché. InterSystems work with us very well and respond to needs quickly. InterSystems certainly have a very developer oriented culture which matches our team well.
Hadoop is one thing HDFS is another .. but of course they go together. In many ways our DataTrain Whiteboard do “map reduce” with some improvements for our specific problem. There are Hadoop database interfaces so it could work with Caché.

Q7. Could you tell us a bit more what did you learn so far with this project? In particular, what is the implication for Caché, now that also the the main daily processing chain is stored in Caché?

Jose Ruperez: A relevant number, regarding InterSystems Caché performance, is to be able to insert over 100,000 records per second sustained over several days. This also means that InterSystems Caché, in turn, has to write hundreds of MegaBytes per second to disk. To me, this is still mind-boggling.

William O’Mullane:
And done with fairly standard NetApp Storage. Caché and NetApp engineers sat together here at ESAC to align the configuration of both systems to get the max IO for Java through Caché to NetApp. There were several low level page size settings etc. which were modified for this.

Q8. What else still need to be done?

William O’Mullane: Well we have most of the parts but it is not a well oiled machine yet. We need more robustness and a little more automation across the board.

Q9. Your high level architecture a year ago consisted of two databases, a so called Main Database and an AGIS Database.
The Main Database was supposed to hold all data from Gaia and the products of processing. (This was expected to grow from a few TBs to few hundreds of TBs during the mission). AGIS was only required a subset of this data for analytics purpose. Could you tell us how the architecture has evolved in the last year?

William O’Mullane: This remains precisely the same.

Q10. For the AGIS database, were you able to generate realistic data and load on the system?

William O’Mullane: We have run large scale AGIS tests with 50,000,000 sources or about 4,500,000,000 simulated observations. This worked rather nicely and well within requirements. We confirmed going from 2 to 10 to 50 million sources that the problem scales as expected. The final (end of mission 2018) requirement is for 100,000,000 sources, so for now we are quite confident with the load characteristics. The simulation had a realistic source distribution in magnitude and coordinates (i.e. real sky like inhomogeneities are seen).

Q11. What results did you obtain in tuning and configuring the AGIS system in order to meet the strict insert requirements, while still optimizing sufficiently for down-stream querying of the data?

William O’Mullane: We still have bottlenecks in the update servers but the 50 million test still ran inside one month on a small in house cluster. So the 100 million in 3 months (system requirement) will be easily met especially with new hardware.

Q12. What are the next steps planned for the Gaia mission and what are the main technical challenges ahead?

William O’Mullane: AGIS is the critical piece of Gaia software for astrometry but before that the daily data treatment must be run. This so called Initial Data Treatment (IDT) is our main focus right now. It must be robust and smoothly operating for the mission and able to cope with non nominal situations occurring in commissioning the instrument. So some months of consolidation, bug fixing and operational rehearsals for us. The future challenge I expect not to be technical but rather when we see the real data and it is not exactly as we expect/hope it will be.
I may be pleasantly surprised of course. Ask me next year …

William O`Mullane, Science Operations Development Manager, European Space Agency.
William O’Mullane has a PhD in Physics and a background in Computer Science and has worked on space science projects since 1996 when he assisted with the production of the Hipparcos CDROMS. During this period he was also involved with the Planck and Integral science ground segments as well as contemplating the Gaia data processing problem. From 2000-2005 Wil worked on developing the US National Virtual Observatory (NVO) and on the Sloan Digital Sky Survey (SDSS) in Baltimore, USA. In August 2005 he rejoined the European Space Agency as Gaia Science Operations Development Manager to lead the ESAC development effort for the Gaia Data Processing and Analysis Consortium.

José Rupérez, Senior Engineer at InterSystems.
He has been providing technical advise to customers and partners in Spain and Portugal for the last 10 years. In particular, he has been working with the European Space Agency since December 2008. Before InterSystems, José developed his career at eSkye Solutions and A.T. Kearney in the United States, always in Software. He started his career working for Alcatel in 1994 as a Software Engineer. José holds a Bachelor of Science in Physics from Universidad Complutense (Madrid, Spain) and a Master of Science in Computer Science from Ball State University (Indiana, USA). He has also attended courses at the MIT Sloan School of Business.

Related Posts
Objects in Space -The biggest data processing challenge to date in astronomy: The Gaia mission. February 14, 2011

Objects in Space: “Herschel” the largest telescope ever flown. March 18, 2011

Objects in Space vs. Friends in Facebook. April 13, 2011


Gaia Overview (ESA)

Gaia Web page at ESA Spacecraft Operations.

ESA’s web site for the Gaia scientific community.

Gaia library (ESA) collates Gaia conference proceedings, selected reports, papers, and articles on the Gaia mission as well as public DPAC documents.

“Implementing the Gaia Astrometric Global Iterative Solution (AGIS) in Java”. William O’Mullane, Uwe Lammers, Lennart Lindegren, Jose Hernandez and David Hobbs. Aug. 2011

“Implementing the Gaia Astrometric Solution”, William O’Mullane, PhD Thesis, 2012

You can follow ODBMS.org on Twitter : @odbmsorg.

http://www.odbms.org/blog/2013/01/the-gaia-mission-one-year-later-interview-with-william-omullane/feed/ 0
On Analyzing Unstructured Data. — Interview with Michael Brands. http://www.odbms.org/blog/2012/07/on-analyzing-unstructured-data-interview-with-michael-brands/ http://www.odbms.org/blog/2012/07/on-analyzing-unstructured-data-interview-with-michael-brands/#comments Wed, 11 Jul 2012 06:33:41 +0000 http://www.odbms.org/blog/?p=1590 “The real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data”– Michael Brands.

It is reported that 80% of all data in an enterprise is unstructured information. How do we manage unstructured data? I have interviewed Michael Brands, an expert on analyzing unstructured data and currently a senior product manager for the i.Know technology at InterSystems.


Q1. It is commonly said that more than 80% of all data in an enterprise is unstructured information. Examples are telephone conversations, voicemails, emails, electronic documents, paper documents, images, web pages, video and hundreds of other formats. Why is unstructured data important for an enterprise?

Michael Brands: Well unstructured data is important for organizations in general in at least 3 ways. 
First of all 90% of what people do in a business day is unstructured and the results of most of these activities can only be captured in unstructured data.
Second it is generally acknowledged in modern economy that knowledge is the biggest of asset of companies and most of this knowledge, since itʼs developed by people, is recorded in unstructured formats.

The last and maybe most unexpected argument to underpin the importance of unstructured data is the fact large research organizations such as Gardner and IDC state that: “80% of business is conducted on unstructured data”
If we take these tree elements together is even surprising to see most organizations invest heavily in business intelligence applications to improve their business but these applications only cover a very small portion of the data (20% in the most optimistic estimation) that are actually important for their business.
If we look at this from a different prospective we think enterprises that really want to be leading and make a difference will heavily invest in technologies that help them to understand and exploit their unstructured data because if we only look at the numbers (and thatʼs the small portion of data most enterprises already understand very well) the area of unstructured data will be the one where the difference will be made over the next couple of years.
However the real difference will be made by those companies that will be able to fully exploit and integrate their structured and unstructured data into so called active analytics. With Active Analytics enterprises will be able to use both quantitative and qualitative data and drive action based on a plain understanding of 100% of their data.
As InterSystems we have a unique technology offering that was especially designed to help our customers and partners in doing exactly that and weʼre proud our partners that actually deploy the technology to fully exploit a 100% of their data make a real difference in their market and grow way faster than their competitors.

Q2. What is the main difference between semi-structured and unstructured information?

Michael Brands: The very short and bold answer to this question would be to say semi-structured is just a euphemism for unstructured. 
However a more in-depth answer is that unstructured data is a combination of structured and unstructured data in the same data channel.
Typically semi-structured data comes out of forms that foresee specific free text areaʼs to describe specific parts of the required information. This way a “structured” (meta)-data field describes with a fair degree of abstraction the contents of the associated text field.
A typical example will help to clarify this: In an electronic medical record system the notes section in which a doctor can record his observations about a specific patient in free text is typically semi-structured which means the doctor doesnʼt have to write all observations in one text but he can typically “categorize” his observations under different headers such as: “Patient History”, “Family History”, “Clinical Findings”, “Diagnose” and more.
Subdividing such text entry environments into a series of different fields with a fixed header is a very common example of semi-structured data.
 Another very popular example of semi-structured data is e-mail, mp3 or video-data. These data-types contain mainly unstructured data but these unstructured data is always attached to some more structured data such as: Author, Subject or Title, Summary etc.

Q3. The most common example of unstructured data is text. Several applications store portions of their data as unstructured text that is typically implemented as plain text, in rich text format (RTF), as XML, or as a BLOB (Binary Large Object). It is very hard to extract meaning from this content. How iKnow can help here?

Michael Brands: iKnow can help here in a very specific and unique way because it is able to structure these texts into chains of concepts and relations.
What this means is that iKnow will be able to tell you without prior knowledge what the most important concepts in these texts are and how they are related to each other.
This is why, when we talk about iKnow, we say the technology is proactive.
Any other technology that analyses text will need a domain specific model (statistical, ontological or syntactical) containing a lot of domain specific knowledge in order to make some sense out of the texts it is supposed to analyze. iKnow, thanks to its unique way of splitting sentences into concepts and relations doesnʼt need this.
It will fully automatically perform the analysis and highlighting tasks students usually perform as a first step in understanding and memorizing a course text book.

Q4. How do you exactly make conceptual meaning out of unstructured data? Which text analytical methods do you use for that?

Michael Brands: The process we use to extract meaning out of texts is unique because of the following: we do not split sentences into individual words and then try to recombine these words by means of a syntactic parser, an ontology (which essentially is a dictionary combined with a hierarchical model that describes a specific domain), or a statistical model. What iKnow does instead is we split sentences by identifying relational word(group)s in a sentence.
This approach is based on a couple of long known facts about language and communication.

First of all analytical semantics already discovered years ago every sentence is nothing else than a chain of conceptual word groups (often called Noun Phrases or Prepositional Phrases in formal linguistics) tied together by relations (often called Verb Phrases in formal linguistics). So a sentence will semantically always be built as a chain of a concept followed by a relation followed by another concept again followed by another relation and another concept etc.
This basic conception of a binary sentence structure consisting of Noun-headed phrases (concepts) and Verb-headed phrases (relations) is at the heart of almost all major approaches to automated syntactic sentence analysis. However this knowledge is only used by state-of-the-art analysis algorithms to construct second order syntactic dependency structure representations of a sentence rather than to effectively analyze the meaning of a sentence.

A second important discovery underpinning the iKnow approach is the fact, discovered by behavioral psychology and neuro-psychiatry, humans only understand and operate a very small set of different relations to express links between facts, events, or thoughts. Not only the set of different relations people use and understand is very limited but it is also a universal set. In other words people only use a limited number of different relations and these relations are the same for everybody no matter his language, education, cultural background or whatsoever.
This discovery can learn us a lot of how basic mechanisms for learning like derivation and inference work. But more important for our purposes is that we can derive from this that, in sharp contrast with the set of concepts that is infinite and has different subsets for each specific domain, the set of relations is limited and universal.
The combination of these two elements namely the basic binary concept-relation structure of language and the universality and limitedness of the set of relations led to the development of the iKnow approach after a thorough analysis of a lot of state-of-the-art techniques.
Our conclusion of this analysis is the main problem of all classical approaches to text analysis is they all focus essentially on the endless and domain specific set of concepts because they mostly were created to serve the specific needs of a specific domain.
Thanks to this domain specific focus the number of elements a system needs to know upfront can be controlled. Nevertheless a “serious” application quickly integrates several millions of different concepts. This need for large collections of predefined concepts to describe the application domain, commonly called dictionaries, taxonomies or ontologies, leads to a couple of serious problems.
First off all, the time needed to set up and tune such applications is substantial and expensive because domain experts are needed to come up with the most appropriate concepts. Second the foot print of these systems is rather big and their maintenance costly and time-consuming because specialists need to follow whatʼs going on in the domain and adapt the knowledge of the application.
Third, itʼs very difficult to open up a domain specific application for other domains because in these other domains concepts might have different meanings or even contradict each other which can create serious problems at the level of the parsing logic.
Therefore iKnow was built to perform a completely different kind of analysis because by focussing on the relations we can build systems with a very small footprint (an average language model only contains several 10.000s relations and a very small number of context based disambiguation rules).
Moreover our system is not domain specific but it can work with data from very different domains at the same time and doesnʼt need expert input. Splitting up sentences by means of relations and solving the ambiguous cases (this means the cases in which a word or word group can express both a concept and a relation e.g. walk: is a concept in this sentence: Brussels Paris would be quite a walk. and a relation in this sentence: Pete and Mary walk to school) by means of rules that use the function (concept or relation) of the surrounding words (or word groups) to decide whether the ambiguous word is a concept or a relation is a computationally very efficient and fast process and ensures a system that learns as it analyses more data because it kind of “learns” the concepts from the texts because it identifies them as “the groups of words between the relations, before the first relation and between the last relation and the end of the sentence.

Q5. How “precise” is the meaning you extract from unstructured data? Do you have a way to validate it?

Michael Brands: This is a very interesting question because it raises two very difficult topics in the area of semantic data analysis namely : How do you define precision and How to evaluate results generated by semantic technologies ? 
If we use the classical definition of precision in this area, it describes what percentage of the documents given back by a system in response to a query asking for documents containing information about certain concepts actually contains useful information about these concepts.
Based on this definition of precision we can say iKnow scores very close to a 100% because it outperforms competing technologies in itʼs efficiency to detect what words in a sentence belong together and form meaningful groups and how the relate to each other.
Even if weʼd use other more challenging definitions of precision like: the syntactic or formal correctness of the word groups identified by iKnow we score very high percentages, but itʼs evident weʼre dependent of the quality of input. If the input doesnʼt accurately uses punctuation marks or contains a lot of non-letter characters that will affect our precision. Moreover how precision is perceived and defined varies a lot from one use case to another.
Evaluation is a very complex and subjective operation in this area because whatʼs considered to be good or bad heavily depends on what people want to do with the technology and what their background is. So far we let our customers and partners decide after an evaluation period whether the technology does what they expect from it and we didnʼt have “no goes” yet.

Q6. How do you process very large scale archives of data?

Michael Brands: The architecture of the system has been set up to be as flexible as possible and to make sure processes can be executed in parallel where possible and desirable. Moreover the system provides different modes to load data: A batch-load of data which has been especially designed to pump large amounts of existing data such as document archives into an system as fast as possible, a single source load thatʼs especially designed to add individual documents to a system at transactional speed, and a small-batch mode to add limited sets of documents to a system in one process.
On top of that the loading architecture foresees different steps in the loading process: data to be loaded needs to be listed or staged, the data can be converted (this means the data that has to be indexed can be adapted to get better indexing and analysis results), and, off course the data will be loaded into the system.
These different steps can partially be done in parallel and in multiple processes to ensure the best possible performance and flexibility.

Q.7 On one hand we have mining text data, and on the other hand we have database transactions on structured data: how do you relate them to each other?

Michael Brands: Well there are two different perspectives in this question:
 On the one hand itʼs important to underline that all textual data indexed with iKnow can be used as if it was structured data, because the API foresees appropriate methods that allow you to query the textual data the same you would query traditional row-column data. These methods come in 3 different flavors: they can be called as native caché object script methods, they can be called from within a SQL-environment as stored procedures and they are also available as web services.

On the other hand thereʼs the fact all structured data that has a link with the indexed texts can be used as metadata within iKnow. Based on these structured metadata filters can be created and used within the iKnow API to make sure the API returns exactly the results you need.

Michael Brands previously founded i.Know NV a company specialized in analyzing unstructured data. In 2010 InterSystems acquired i.Know and since then he is serving as a senior product manager for the i.Know technology at InterSystems.
i.Know’s technology is embedded in the InterSystems technology platform.

Related Posts

Managing Big Data. An interview with David Gorbet (July 2, 2012)

Big Data: Smart Meters — Interview with Markus Gerdes (June 18, 2012)

Big Data for Good (June 4, 2012)

On Big Data Analytics: Interview with Florian Waas, EMC/Greenplum (February 1, 2012)

On Big Data: Interview with Shilpa Lawande, VP of Engineering at Vertica (November 16, 2011)

On Big Data: Interview with Dr. Werner Vogels, CTO and VP of Amazon.com (November 2, 2011)

Analytics at eBay. An interview with Tom Fastner (October 6, 2011)


http://www.odbms.org/blog/2012/07/on-analyzing-unstructured-data-interview-with-michael-brands/feed/ 0
Big Data: Smart Meters — Interview with Markus Gerdes. http://www.odbms.org/blog/2012/06/big-data-smart-meters-interview-with-markus-gerdes/ http://www.odbms.org/blog/2012/06/big-data-smart-meters-interview-with-markus-gerdes/#comments Mon, 18 Jun 2012 06:51:46 +0000 http://www.odbms.org/blog/?p=1501 “For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000 sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.” — Markus Gerdes.

80 percent of all households in Germany will have to be equipped with smart meters by 2020, according to a EU single market directive.
Why smart meters? A smart meter, as described by e.On, is “a digital device which can be read remotely and allows customers to check their own energy consumption at any time. This helps them to control their usage better and to identify concrete ways to save energy. Every customer can access their own consumption data online in graphic form displayed in quarter-hour intervals. There is also a great deal of additional information, such as energy saving tips. Similarly, measurements can be made using a digital display in the home in real time and the current usage viewed.” This means Big Data. How do we store, and use all these machine-generated data?
To better understand this, I have interviewed Dr. Markus Gerdes, Product Manager at BTC , a company specialized in the energy sector.


Q1. What are the main business activities of BTC ?

Markus Gerdes: BTC provides various IT-services: besides the basics of system management, e.g. hosting services, security services or the new field of mobile security services, BTC primarily delivers IT- and process consulting and system integration services for different industries, especially for utilities.
This means, BTC plans and rolls IT-architectures out, integrates and customizes IT-applications and migrates data for ERP, CRM and more applications. BTC also delivers its IT-applications if desired: In particular, BTC’s Smart Metering solution BTC Advanced Meter Management (BTC AMM) is increasingly known in the smart meter market and has drawn customers` interest at this stage of the market, not only in Germany, but e.g. in Turkey and other European countries as well.

Q2. According to a EU single market directive and German Federal Government, 80 percent of all households in Germany will have to be equipped with smart meters by 2020, How many smart meters will have to be installed? What will the government do with all these data generated?

Markus Gerdes: Currently, 42 million electricity meters are installed in Germany. Thus, about 34 million meters need to be exchanged according to the EU directive in Germany until 2020. In order to achieve this aim, in 2011 the Germany EnWG (law on the energy industry) adds some new aspects: smart meters have to be installed where customers` electricity consumption is more than 6.000 kWh per year, at decentralized feed-in with more than 7 kW and in considerably refurbished or newly constructed buildings, if this is technically feasible.
In this context technical feasible means, that the installed smart meters are certified (as a precondition they have to use the protection profiles) and must be commercially available in the meter market. An amendment to the EnWG is due in September 2012 and it is generally expected that this threshold of 6000 kWh will be lowered. The government will actually not be in charge of the data collected by the Smart Meters. It is metering companies who have to provide the data to the distribution net operators and utility companies. The data is then used for billing and as an input to customer feedback systems for example and potentially grid analyses under the use of pseudonyms.

Q3. Smart Metering: Could you please give us some detail on what Smart Metering means in the Energy sector?

Markus Gerdes: Smart Metering means opportunities. The technology itself does no more or less than deliver data, foremost a timestamp plus a measured value, from a metering system via a communication network to an IT-system, where it is prepared and provided to other systems. If necessary this may even be done in real time. This data can be relevant to different market players in different resolutions and aggregations as a basis for other services.
Furthermore, smart meter offer new features like complex tariffs, load limitations etc. The data and the new features will lead to optimized processes with respect to quality, speed and costs. The type of processing will finally lead to new services, products and solutions – some of which we do not even know today. In combination with other technologies and information types the smart metering infrastructure will be the backbone of smart home applications and the so-called smart grid.
For instance, BTC develops scenarios to combine the BTC AMM with the control of virtual power plants or even with the BTC grid management and control application BTC PRINS. This means: smart markets become reality.

Q4. BTC AG has developed an advanced meter management system for the energy industry. What is it?

Markus Gerdes:The BTC AMM is an innovative software system, which allows meter service providers to manage, control and readout smart meters and provide these meter readings and other possibly relevant information, e.g. status information, information on meter corruption and manipulation to authorized market partners.
Also data and control signals for the smart meter can be provided by the system.
The BTC AMM is developed as a new solution BTC has been able to particularly focus on mass data management and smart meter mass process optimized workflows. In combination with a clear and easy to use frontend we bring our customers a high performance solution for their most important requirements.
In addition, our modular concept and the use of open standards makes our vendor-independent solution not only fit into utilities IT-architecture easily but makes it future-proof.

Q5. What kind of data management requirements do you have for this application? What kind of data is a smart meter producing and at what speed? How do you plan to store and process all the data generated by these smart meters?

Markus Gerdes: Let me address the issue of the data volume and frequency of data sent first. The BTC AMM is designed to collect the data of several millions of
smart meters. In a standard scenario each of the smart meters sends a load profile with a resolution of 15 minutes to the BTC AMM. This means that at least 96 data points have to be stored by the BTC AMM per day and meter. This implies both, a huge amount of data to be stored and a high frequency data traffic.
Hence, the data management system needs to be highly performant in both dimensions. In order to process time series BTC has developed a specific, highly efficient time series management which runs with different data base providers. This enables the BTC AMM to cope even with data sent in a higher frequency. For certain smart grid use cases the BTC AMM processes metering data sent from the meters on the scale of seconds.

Q6. The system you are developing is based on InterSystems Caché® database system. How do you use Cache`?

Markus Gerdes: BTC uses InterSystems Caché as Meter Data Management solution. This means the incoming data from the smart meters is saved into the database and the information provided e.g. to backend-systems via webservices or to other interfaces so that the data can be used for market partner communication or customer feedback systems. And all this means the BTC AMM has to handle thousands of read- and write-operations per second.

Q7. You said that one of the critical challenge you are facing is to “master up the mass data efficiency in communicating with smart meters and the storage and processing of measured time series” Can you please elaborate on this? What is the volume of the data sets involved?

Markus Gerdes: For a large to medium sized German utility, which has about 240,000 conventional meters, quarter-hour meter readings would produce 960,000
sets of meter data to be processed and stored each hour once replaced by smart meters. And every hour another 960,000 sets of meter data have to be processed.
In addition calculations, aggregations and plausibility checks are necessary. Moreover incoming tasks have to be processed and the relevant data has to be delivered to backend applications. This means that the underlying database as well as the AMM-processes may have to process the incoming data every 15 minutes while reading thousands of time series per minute simultaneously.

Q8. How did you test the performance of the underlying database system when handling data streams? What results did you obtain so far?

Markus Gerdes: We designed a load profile generator and used it to simulate the meter readings of more than 1 million smart meters. The tests included the
writing of quarter-hour meter readings. Actually the problem with this test was the speed of the generator to provide the data, not the speed of the AMM. In fact we are able to write more than 12.000 time-series per second. This is far enough to cope even with full meter rollouts.

Q9. What is the current status of this project? What are the lessons learned so far? And the plans ahead? Are there any similar systems implemented in Europe?

Markus Gerdes: At the moment we think that our BTC AMM- and database-performance is able to handle the upcoming mass data during the next years including a full smart meter rollout in Germany. Nevertheless, in terms of smart grid and smart home appliances and an increasing amount of real time event processings, both read and write, it is necessary to get a clear view of future technologies to speed up processing of mass data (e.g. in-memory).
In addition we still have to keep an eye on usability. Although we hope that smart metering in the end will lead to complete machine-to-machine-communication we always have to expect errors and disturbances from technology, communication or even the human factor. As event driven processes are time critical we still have to work on solutions for fast and efficient handling, analyses and processing of mass errors.

Dr. Markus Gerdes, Product Manager BTC AMM / BTC Smarter Metering Suite, BTC Business Technology Consulting AG.
Since 2009 Mr. Gerdes worked in several research, development and consulting projects in the area of smart metering. He was involved in research and consulting in the sectors Utilities, Industry and Public, regarding IT-architecture and solutions and IT-Security.
He is experienced in the development of energy management solutions.

http://www.odbms.org/blog/2012/06/big-data-smart-meters-interview-with-markus-gerdes/feed/ 1
Interview with Iran Hutchinson, Globals. http://www.odbms.org/blog/2011/06/interview-with-iran-hutchinson-globals/ http://www.odbms.org/blog/2011/06/interview-with-iran-hutchinson-globals/#comments Mon, 13 Jun 2011 22:06:22 +0000 http://www.odbms.org/blog/?p=820 “ The newly launched Globals initiative is not about creating a new database.
It is however, about exposing the core multi-dimensional arrays directly to developers.” — Iran Hutchinson.


InterSystems recently launched a new initiative: Globals.
I wanted to know more about Globals. I have therefore interviewed Iran Hutchinson, software/systems architect at InterSystems and one of the people behind the Globals project.


Q1. InterSystems recently launched a new database product: Globals. Why a new database? What is Globals?

Iran Hutchinson: InterSystems has continually provided innovative database technology to its technology partners for over 30 years. Understanding customer needs to build rich, high-performance, and scalable applications resulted
in a database implementation with a proven track record. The core of the database technology is multi-dimensional arrays (aka globals).
The newly launched Globals initiative is not about creating a new database. It is however, about exposing the core multi-dimensional arrays directly to developers. By closely integrating access into development technologies like Java and JavaScript, developers can take full advantage of high-performance access to our core database components.

We undertook this project to build much broader awareness of the technology that lies at the heart of all of our products. In doing so, we hope to build a thriving developer community conversant in the Globals technology, and aware of the benefits to this approach of building applications.

Q2. You classify Globals as a NoSQL-database. Is this correct? What are the differences and similarities of Globals with respect to other NoSQL databases in the market?

Iran Hutchinson: While Globals can be classified as a NoSQL database, it goes beyond the definition of other NoSQL databases. As you there are many different offerings in NoSQL and no key comparison matrices or feature lists. Below we list some comparisons and differences with hopes of later expanding the available information on the globalsdb.org website.

Globals differs from other NoSQL databases in a number of ways.

o It is not limited to one of the known paradigms in NoSQL (Column/Wide Column, Key-Value, Graph, Document, etc.). You can build your own paradigm on top of the core engine. This is an approach we took as we evolved Caché to support objects, xml, and relational, to name a few.
o Globals still offers optional transactions and locking. Though efficient in implementation we wanted to make sure that locking and transactions were at the discretion of the developer.
o MVCC is built into the database.
o Globals runs in-memory and writes data to disk.
o There is currently no sharding or replication available in Globals. We are discussing options for these features.
o Globals builds on the over 33 years of success of Caché. It is well proven. It is the exact same database technology. Globals will continue to evolve, and receive the innovations going into the core of Caché.
o Our goal with Globals is be a very good steward of the project and technology. The Globals initiative will also start to drive contests and events to further promote adoption of the technology, as well as innovative approaches to building applications. We see this stewardship as a key differentiator, along with the underlying flexible core technology.

• Globals shares similar traits with other NoSQL databases in the market.

o It is free for development and deployment.
o The data model can optionally use a schema. We mitigate the impact of using schemas by using the same infrastructure we use to store the data. The schema information and the data are both stored in globals.
o Developers can index their data.
o The document paradigm enabled by the Globals Document Store (GDS) API enables a query language for data stored using the GDS API. GDS is also an example of how to build a storage paradigm in Globals. Globals APIs are open source and available on the github link.
o Globals is fast and efficient at storing data. We know performance is one of many hallmarks of NoSQL. Globals can store data at rates exceeding 100,000 objects/records per second.
o Different technology APIs are available for use with Globals. We’ve released 2 Java APIs and the JavaScript API is immanent.

Q3. How do you position Globals with respect to Caché? Who should use Globals and who should use Caché?

Iran Hutchinson: Today, Globals offers multi-dimensional array storage, whereas Caché offers a much richer set of features. Caché (and the InterSystems technology it powers including Ensemble, DeepSee, HealthShare, and TrakCare) offers a core underlying object technology, native web services, distributed communication via ECP (Enterprise Cache Protocol), strategies for high availability, interactive development environment, industry standard data access (JDBC, ODBC, SQL, XML, etc.) and a host of other enterprise ready features.

Anyone can use Globals or Caché to tackle challenges with large data volumes (terabytes, petabytes, etc.), high transactions (100,000+ per second), and complex data (healthcare, financial, aerospace, etc.). However, Caché provides much of the needed out-of-box tooling and technology to get started rapidly building solutions in our core technology, as well as a variety of languages. Currently provided as Java APIs, Globals is a toolkit to build the infrastructure already provided by Caché. Use Caché if you want to get started today; use Globals if you have a keen interest in building the infrastructure of your data management system.

Q4. Globals offers multi-dimensional array storage. Can you please briefly explain this feature, and how this can be beneficial for developers?

Iran Hutchinson: It is beneficial to go here. I grabbed the following paragraphs directly from this page:

Summary Definition: A global is a persistent sparse multi-dimensional array, which consists of one or more storage elements or “nodes”. Each node is identified by a node reference (which is, essentially, its logical address). Each node consists of a name (the name of the global to which this node belongs) and zero or more subscripts.

Subscripts may be of any of the types String, int, long, or double. Subscripts of any of these types can be mixed among the nodes of the same global, at the same or different levels.

Benefits for developers: Globals does not limit developers to using objects, key-value, or any other type of storage paradigm. Developers are free to think of the optimal storage paradigm for what they are working on. With this flexibility, and the history of successful applications powered by globals, we think developers can begin building applications with confidence.

Q5. Globals does not include Objects. Is it possible to use Globals if my data is made of Java objects? If yes, how?

Iran Hutchinson:. Globals exposes a multi-dimensional sparse array directly to Java and other languages. While Globals itself does not include direct Java object storage technology like JPA or JDO, one can easily store and retrieve data in Java objects using the APIs documented here. Anyone can also extend Globals to support popular Java object storage and retrieval interfaces.

One of the core concepts in Globals is that it is not limited to a paradigm, like objects, but can be used in many paradigms. As an example, the new GDS (Globals Document Store) API enables developers to use the NoSQL document paradigm to store their objects in Globals. GDS is available here (more docs to come).

Q6. Is Globals open source?

Iran Hutchinson: Globals itself it not open source. However, the Globals APIs hosted at the github location are open source.

Q7. Do you plan to create a Globals Community? And if yes, what will you offer to the community and what do you expect back from the community?

Iran Hutchinson: We created a community for Globals from the beginning. One of the main goals of the Globals initiative is to create a thriving community around the technology, and applications built on the technology.
We offer the community:
• Proven core data management technology
• An enthusiastic technology partner that will continue to evolve and support project ◦ Marketing the project globally
◦ Continual underlying technology evolution ◦ Involvement in the forums and open source technology development ◦ Participation in or hosting events and contests around Globals.
• A venue to not only express ideas, but take a direct role in bringing those ideas to life in technology
• For those who want to build a business around Globals, 30+ years of experience in supplying software developers with the technology to build successful breakthrough applications.


Iran Hutchinson serves as product manager and software/systems architect at InterSystems. He is one of the people behind the Globals project. He has held architecture and development positions at startups and Fortune 50 companies. He focuses on language platforms, data management technologies, distributed/cloud computing, and high performance computing. When not on trail talking with fellow geeks or behind the computer you can find him eating (just look for the nearest steak house).


Globals is a free database from InterSystem. Globals offer multi-dimensional storage. The first version is for Java. Software | Intermediate | English | LINK | May 2011

Globals APIs
Globals APIs are open source available at github location .

Related Posts

Interview with Jonathan Ellis, project chair of Apache Cassandra.

The evolving market for NoSQL Databases: Interview with James Phillips.

– “Marrying objects with graphs”: Interview with Darren Wood.

“Distributed joins are hard to scale”: Interview with Dwight Merriman.

On Graph Databases: Interview with Daniel Kirstenpfad.

Interview with Rick Cattell: There is no “one size fits all” solution.

http://www.odbms.org/blog/2011/06/interview-with-iran-hutchinson-globals/feed/ 0