The first comprehensive textbook of data integration from theoretical principles to implementation issues and current challenges raised by the semantic web and cloud computing. How do you approach answering queries when your data is stored in multiple databases that were designed independently by different people? This is first comprehensive book on data integration and is written by three of the most respected experts in the field.
This book provides an extensive introduction to the theory and concepts underlying today’s data integration techniques, with detailed, instruction for their application using concrete examples throughout to explain the concepts. Data integration is the problem of answering queries that span multiple data sources (e.g., databases, web pages). Data integration problems surface in multiple contexts, including enterprise information integration, query processing on the Web, coordination between government agencies and collaboration between scientists. In some cases, data integration is the key bottleneck to making progress in a field. The authors provide a working knowledge of data integration concepts and techniques, giving you the tools you need to develop a complete and concise package of algorithms and applications.
In order for a data integration system to process a query over a set of data sources, the system must know which sources are available, what data exist in each source, and how each source can be accessed. The source descriptions in a data integration system encode this information. In this chapter we study the different components of source descriptions and identify the trade-offs involved in designing formalisms for source descriptions. To put the topic of this chapter in context, consider the architecture of a data inte- gration system, redrawn in Figure 3.1. Recall that a user (or an application) poses a query to the data integration system using the relations and attributes of the mediated schema. The system then reformulates the query into a query over the data sources. The result of the reformulation is called a logical query plan. The logical query plan is later opti- mized so it runs efficiently. In this chapter we show how source descriptions are expressed and how the system uses them to reformulate the user’s query into a logical query plan.
The World Wide Web offers a vast array of data in many forms.
The majority of this data is structured for presentation not to machines but to humans, in the form of HTML tables, lists, and forms-based search interfaces. These extremely heterogeneous sources were cre- ated by individuals around the world and cover a very broad collection of topics in over 100 languages. Building systems that offer data integration services on this vast collection of data requires many of the techniques described thus far in the book, but also raises its own unique challenges. While the Web offers many kinds of structured content, including XML (discussed in Chapter 11) and RDF (discussed in Chapter 12), the predominant representation by far is HTML.
Structured data appears on HTML pages in several forms. Figure 15.1 shows the most common forms: HTML tables, HTML lists, and formatted “cards” or templates. Chapter 9 discusses how one might extract content from a given HTML page. However, there are a number of additional challenges posed by Web data integration, beyond the task of wrapping pages.