Skip to content

An Empirical Performance Evaluation of Relational Keyword Search Systems

  1.  

In the past decade, extending the keyword search paradigm to relational data has been an active area of research within the database and information retrieval (IR) community. A large number of approaches have been proposed and implemented, but despite numerous publications, there remains a severe lack of standardization for system evaluations. This lack of standardization has resulted in contradictory results from Different evaluations and the numerous discrepancies muddle what advantages are proffered by different approaches. In this Paper, we present a thorough empirical performance evaluation of relational keyword search systems. Our results indicate that many existing search techniques do not provide acceptable performance for realistic retrieval tasks. In particular, memory consumption precludes many search techniques from scaling beyond small datasets with tens of thousands of vertices. We also explore the relationship between execution time and factors varied in previous evaluations; our analysis indicates that these factors have relatively little impact on performance. In summary, our work confirms previous claims regarding the unacceptable performance of these systems and underscores the need for standardization—as exemplified by the IR community when evaluating these retrieval systems.

  1. INTRODUCTION

The ubiquitous search text box has transformed the way people interact with information. Nearly half of all Internet users use a search engine daily, performing in excess of 4 billion searches. The success of keyword search stems from what it does not require—namely, a specialized query language or knowledge of the underlying structure of the data. Internet users increasingly demand keyword search interfaces for accessing information, and it is natural to extend this paradigm to relational data. This extension has been an active area of research throughout the past decade. Despite a significant number of research papers being published in this area, no research prototypes have transitioned from proof-of-concept implementations into deployed systems. The lack of technology transfer coupled with discrepancies among existing evaluations indicates a need for a thorough, independent empirical evaluation of proposed search techniques.

As part of previous work in this area, we created the first benchmark to evaluate relational keyword search techniques. This benchmark satisfies calls from the research community to standardize the evaluation of these search techniques, and our evaluation of search effectiveness revealed that many search techniques perform comparably despite contrary claims in the literature. During our evaluation of search effectiveness, we were surprised by the difficulty we had searching our data sets. In particular, straightforward implementations of many search techniques could not scale to databases with hundreds of thousands of tuples, which forced us to write “lazy” versions of their core algorithms and reduce their memory footprint. Even then, we were surprised by the excessive runtime of many search techniques. Other researchers have recently reported similar experiences. Baid et al. state ,

Our shared experience with existing search techniques suggests that the ad hoc evaluations that appear in the literature are inadequate. This sentiment is supported by our survey of existing evaluations and by others who are familiar with the practices established by the IR community for the evaluation of retrieval systems. In this paper, we augment our prior work with an evaluation of existing search techniques’ runtime performance. Our findings indicate that much room for improvement exists.

1.3 LITRATURE SURVEY

SEARCH ENGINE USE, TECHNICAL REPORT, PEW INTERNET AND AM

PUBLICATION: D. Fallows, Life Project, http://www.pewinternet.org/Reports/ 2008/Search-Engine-Use.aspx. Aug. 2008.

Some 84% of adult internet users, about 108 million Americans, have used search engines to help them find information on the Web. Only the act of sending and receiving email, with about 120 million users, eclipses searching in popularity as an internet activity. On an average day, about 68 million Americans, or about 53% of internet users, will go online. More than half of them, over 38 million people, will use a search engine. Again, second in popularity only to sending or receiving email, searching is becoming a daily habit for about a third of all internet users. American internet users pose about 4 billion queries per month. Many of these include popular queries like “Britney Spears” or “The Bible,” and many others include unique search terms reflecting a user’s personal interest or need. Popular search engines retrieve and index only a fraction of the countless tens of billions of pages on the Web. Google, the leader in sheer numbers, recently announced the company had doubled the number of their indexed pages to 8 billion; Microsoft indexes 5 billion; Yahoo is estimated to index about 4 billion, and AskJeeves about 2.3 billion pages.1  Promises of more and more searchable data seem to emerge regularly, from searching local listings to searching inside books. There are hundreds, if not thousands, of search engines in the world, although just a few capture a high proportion of the audience. Users can search in a multitude of languages. While searchers vary dramatically in their habits, the average user spends a total of about 43 minutes a month conducting some 34 searches, viewing on average 1.9 pages per search.2 The term “average user” belies the wide variety of styles among searchers: slightly more than one third of adults polled here, 35%, will conduct a search at least once a day, with the most enthusiastic of them, about two thirds of that group, searching at least several times a day. Roughly another third, 36%, searches less than daily but at least weekly. And just under a third, 29%, searches every few weeks or less.

A FRAMEWORK FOR EVALUATING DATABASE KEYWORD SEARCH STRATEGIES

PUBLICATION: J. Coffman and A.C. Weaver, Proc. 19th ACM Int’l Conf. Information and Knowledge Management (CIKM ’10), pp. 729-738, Oct. 2010.

With regard to keyword search systems for structured data, research during the past decade has largely focused on performance. Researchers have validated their work using ad hoc experiments that may not re ect real-world workloads. We illustrate the wide deviation in existing evaluations and present an evaluation framework designed to validate the next decade of research in this eld. Our comparison of 9 state-of-the-art keyword search systems contradicts the retrieval ectiveness purported by existing evaluations and reinforces the need for standardized evaluation. Our results also suggest that there remains considerable room for improvement in this eld. We found that many techniques cannot scale to even moderately-sized datasets that contain roughly a million tuples. Given that existing databases are considerably larger than this threshold, our results motivate the creation of new algorithms and indexing techniques that scale to meet both current and future workloads.

KEYWORD SEARCH ON STRUCTURED AND SEMI-STRUCTURED DATA

PUBLICATION: Y. Chen, W. Wang, Z. Liu, and X. Lin, Proc. ACM SIGMOD Int’l Conf. Management of Data (SIGMOD ’09), pp. 1005-1010, June 2009.

Empowering users to access databases using simple keywords can relieve the users from the steep learning curve of mastering a structured query language and understanding complex and possibly fast evolving data schemas. In this tutorial, we give an overview of the state-of-the-art techniques for supporting keyword search on structured and semi-structured data, including query result de¯nition, ranking functions, result generation and top-k query processing, snippet generation, result clustering, query cleaning, performance optimization, and search quality evaluation. Various data models will be discussed, including relational data, XML data, graph-structured data, data streams, and work- °ows. We also discuss applications that are built upon keyword search, such as keyword based database selection, query generation, and analytical processing. Finally we identify the challenges and opportunities of future research to advance the ¯eld.

EVALUATING THE EFFECTIVENESS OF KEYWORD SEARCH

PUBLICATION:  W. Webber, IEEE Data Eng. Bull., vol. 33, no. 1, pp. 54-59, Mar. 2010.

The prevalence of free text search in web search engines has inspired recent interest in keyword search on relational databases. Whereas relational queries formally specify matching tuples, keyword queries are imprecise expressions of the user’s information need. The correctness of search results depends on the user’s subjective assessment. As a result, the empirical evaluation of a keyword retrieval system’s effectiveness is essential. In this paper, we examine the evolving practices and resources for effectiveness evaluation of keyword searches on relational databases. We compare practices with the longer-standing full-text evaluation methodologies in information retrieval. In the light of this comparison, we make some suggestions for the future development of the art in evaluating keyword search effectiveness.

CHAPTER 2

2.0 SYSTEM ANALYSIS

2.1 EXISTING SYSTEM:

In existing system, extending the keyword search paradigm to relational data has been an active area of research within the database and information retrieval (IR) community. A large number of approaches have been proposed and implemented, but despite numerous publications, there remains a severe lack of standardization for system evaluations. This lack of standardization has resulted in contradictory results from Different evaluations and the numerous discrepancies muddle what advantages are proffered by different approaches.

2.1.1 DISADVANTAGES:

Keyword Search without ranking.

Execution time is more.

2.2 PROPOSED SYSTEM:

In proposed system, empirical performance evaluation of relational keyword search systems. Our results indicate that many existing search techniques do not provide acceptable performance for realistic retrieval tasks. In particular, memory consumption precludes many search techniques from scaling beyond small datasets with tens of thousands of vertices. We also explore the relationship between execution time and factors varied in previous evaluations; our analysis indicates that these factors have relatively little impact on performance. In summary, our work confirms previous claims regarding the unacceptable performance of these systems and underscores the need for standardization as exemplified by the IR community when evaluating these retrieval systems.

2.2.1 ADVANTAGES:

Keyword Search with ranking.

Execution Time consumption is less.

File length and Execution time can be seen.

Ranking can be seen by using chart.

2.3 HARDWARE & SOFTWARE REQUIREMENTS:

2.3.1 HARDWARE REQUIREMENT:

v    Processor                                 –    Pentium –IV

  • Speed                                      –    1.1 GHz
    • RAM                                       –    256 MB (min)
    • Hard Disk                               –   20 GB
    • Floppy Drive                           –    1.44 MB
    • Key Board                              –    Standard Windows Keyboard
    • Mouse                                     –    Two or Three Button Mouse
    • Monitor                                   –    SVGA

 

2.3.2 SOFTWARE REQUIREMENTS:

  • Operating System                   :           Windows XP
  • Front End                                :           Microsoft Visual Studio .NET 2008
  • Back End                                :           MS-SQL Server 2005
  • Document                               :           MS-Office 2007

CHAPTER 3

3.0 SYSTEM DESIGN:

Data Flow Diagram / Use Case Diagram / Flow Diagram:

  • The DFD is also called as bubble chart. It is a simple graphical formalism that can be used to represent a system in terms of the input data to the system, various processing carried out on these data, and the output data is generated by the system
  • The data flow diagram (DFD) is one of the most important modeling tools. It is used to model the system components. These components are the system process, the data used by the process, an external entity that interacts with the system and the information flows in the system.
  • DFD shows how the information moves through the system and how it is modified by a series of transformations. It is a graphical technique that depicts information flow and the transformations that are applied as data moves from input to output.
  • DFD is also known as bubble chart. A DFD may be used to represent a system at any level of abstraction. DFD may be partitioned into levels that represent increasing information flow and functional detail.

NOTATION:

SOURCE OR DESTINATION OF DATA:

External sources or destinations, which may be people or organizations or other entities

 

DATA SOURCE:

Here the data referenced by a process is stored and retrieved.

 

PROCESS:

People, procedures or devices that produce data. The physical component is not identified.

DATA FLOW:

Data moves in a specific direction from an origin to a destination. The data flow is a “packet” of data.

MODELING RULES:

There are several common modeling rules when creating DFDs:

  1. All processes must have at least one data flow in and one data flow out.
  2. All processes should modify the incoming data, producing new forms of outgoing data.
  3. Each data store must be involved with at least one data flow.
  4. Each external entity must be involved with at least one data flow.
  5. A data flow must be attached to at least one process.

3.1 BLOCK DIAGRAM

3.2 DATAFLOW DIAGRAM

UML DIAGRAMS:

3.2 USE CASE DIAGRAM:

3.3 CLASS DIAGRAM:

3.4 SEQUENCE DIAGRAM:

3.5 ACTIVITY DIAGRAM:

CHAPTER 4

4.0 IMPLEMENTATION:

We reimplemented BANKS, DISCOVER, DISCOVER-II, and DPBF and obtained implementations of BANKS-II (i.e., the bidirectional search algorithm), BLINKS, and STAR. All the search techniques are implemented in Java. For some search techniques, we also had access to others’ implementations. Among the implementations, we found that our reimplementations generally outperform the implementations provided by others. Exceptions to this trend were the result of correcting significant implementation defects. Our experiments do not compare against traditional IR systems because more traditional systems do not consider the relationships among database tuples, which is an important aspect of relational keyword search. Our implementation of BANKS adheres to its original description although it queries the database dynamically to identify nodes (tuples) that contain query keywords. Our implementation of DISCOVER borrows its successor’s query processing techniques. Both DISCOVER and DISCOVER- II are executed with the sparse algorithm, which provides the best performance for queries with AND  semantics . BLINKS’s block index was created using breadth-first partitioning and contains 50 nodes per block. STAR uses the edge weighting scheme proposed by Ding et al. for undirected graphs.

Tags: