Maximal termsets as a query structuring mechanism

Research and Advanced Technology for Digital Libraries

2010

Given a set of keyphrases, we analyze how Web queries with these phrases can be formed that, taken altogether, return a specified number of hits. The use case of this problem is a plagiarism detection system that searches the Web for potentially plagiarized passages in a given suspicious document. For the query formulation problem we develop a heuristic search strategy based on cooccurrence probabilities. Compared to the maximal termset strategy [3], which can be considered as the most sensible non-heuristic baseline, our expected savings are on average 50% when queries for 9 or 10 phrases are to be constructed.

Section: Basic Definitions and The Baseline Methodsmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Capacity-Constrained Query Formulation

Research and Advanced Technology for Digital Libraries

2010

“…Shapiro and Taksa [11] suggest a rather simple open end query formulation approach for which it is straightforward to find situations where the approach fails although appropriate queries exist. A more involved maximal termset method is proposed by Pôssas et al [10]. However, both approaches focus on finding a whole set of queries instead of just one maximum query and neither Shapiro and Taksa …”

Section: B Related Workmentioning

confidence: 99%

Search Strategies for Keyword-based Queries

2010 Workshops on Database and Expert Systems Applications

2010

Abstract-Given a set of keywords, we find a maximum Web query (containing the most keywords possible) that respects userdefined bounds on the number of returned hits. We assume a real-world setting where the user is not given direct access to a Web search engine's index, i.e., querying is possible only through an interface. The goal to be optimized is the overall number of submitted Web queries.One original contribution of our research is the formalization and theoretical foundation of the problem. But, in particular, we develop a co-occurrence probability informed search strategy for the problem. The performance gain achieved with our approach is substantial: compared to the uninformed baseline (without cooccurrence information) the expected savings are up to 20% in the number of submitted queries and runtime.

“…The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10,14], which is the most sensible non-heuristic baseline, we save on average 70% of the queries in realistic experiments. With respect to the candidate documents' quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4,8].…”

mentioning

confidence: 99%

Candidate Document Retrieval for Web-Scale Text Reuse Detection

String Processing and Information Retrieval

2011

Given a document d, the task of text reuse detection is to find those passages in d which in identical or paraphrased form also appear in other documents. To solve this problem at web-scale, keywords representing d's topics have to be combined to web queries. The retrieved web documents can then be delivered to a text reuse detection system for an in-depth analysis. We focus on the query formulation problem as the crucial first step in the detection process and present a new query formulation strategy that achieves convincing results: compared to a maximal termset query formulation strategy [10,14], which is the most sensible non-heuristic baseline, we save on average 70% of the queries in realistic experiments. With respect to the candidate documents' quality, our heuristic retrieves documents that are, on average, more similar to the given document than the results of previously published query formulation strategies [4,8].