Deep Web Content Monitoring

PhD Thesis by S. Mohammadreza Khelghati,  University of Twente.


In this chapter, our motivations behind writing this book, the raised research questions and the general structure of the book are described.

Data is one of the keys to success. Whether you are a fraud detection officer in a tax office, a data journalist [48, 65] or a business analyst, your primary concern is to access all the data that is relevant to your topic of interest. In any of these roles, an in-depth analysis is infeasible without a comprehensive data collection. Therefore, broadening the coverage of available information is a vital task. In such an information-thirsty environment, accessing every source of information is valuable. This emphasizes the role of the web as one of the biggest and main sources of data [94]. The web has an abundance of valuable public data, continuously produced and shared [52]. The web data has been used for a wide range of applications (i.e. business competitive intelligence [143] or crawling social web) to understand complex economical and social phenomena [52].

Nowadays, the most common approach to look for information on the web is posing queries on general search engines such as Google [61, 93, 94]. How- ever, none of these search engines cover all the indexed web data [3, 19]. They also miss data behind web forms. This data that is hidden from search engines behind search forms is defined as hidden web or deep web. It is estimated that deep web contains data in a scale several times bigger than the data accessible through search engines called surface web [15, 70, 115].

In accessing deep web data, finding all the interesting data sources and har- vesting them through their own interfaces could be a difficult, time-consuming and tiring task. Considering the huge amount of information that might be re- lated to one’s information needs, it might even be impossible for a person to cover all the deep web sources of his interest. Therefore, there is a great de- mand for applications that can facilitate the access to this big amount of data that is locked behind web search forms.

Of course, the availability of an up-to-date crawl including the deep web will definitely facilitate collecting all relevant information on a given entity. However, given the software and hardware requirements of keeping an up-to- date crawl of the whole web, this seems to be still impracticable even for big organizations like Google [3, 104, 105, 133].

Consequently, to access web data, a journalist has to resort to using gen- eral search engines, direct querying of deep web or both of these two ap- proaches. Considering all these cases, the laborious work of querying, navigat- ing through results, downloading data, storing it and tracking its changes can significantly benefit from automatic web data access approaches [51, 65, 142].

With this thesis, we aim to make a real-world automatic web data access approach that provides users with collections of all the relevant data to their desired topics.


You may also like...