Inverted index structures are the mainstay of modern text retrieval systems. They can be constructed quickly using off-line mergebased methods, and provide efficient support for a variety of querying modes. In this paper we examine the task of on-line index construction -that is, how to build an inverted index when the underlying data must be continuously queryable, and the documents must be indexed and available for search as soon they are inserted. When straightforward approaches are used, document insertions become increasingly expensive as the size of the database grows. This paper describes a mechanism based on controlled partitioning that can be adapted to suit different balances of insertion and querying operations, and is faster and scales better than previous methods. Using experiments on 100 GB of web data we demonstrate the efficiency of our methods in practice, showing that they dramatically reduce the cost of on-line index construction.
Inverted index structures are a core element of current text retrieval systems. They can be constructed quickly using offline approaches, in which one or more passes are made over a static set of input data, and, at the completion of the process, an index is available for querying. However, there are search environments in which even a small delay in timeliness cannot be tolerated, and the index must always be queryable and up to date. Here we describe and analyze a
geometric partitioning
mechanism for online index construction that provides a range of tradeoffs between costs, and can be adapted to different balances of insertion and querying operations. Detailed experimental results are provided that show the extent of these tradeoffs, and that these new methods can yield substantial savings in online indexing costs.
In certain English finite complement clauses, inclusion of the complementizer that is optional. Previous research has identified various factors that influence when native speakers tend to produce or omit the complementizer, including syntactic weight, clause juncture constraints, and predicate frequency. The present study addresses the question to what extent German and Spanish learners of English as a second language (L2) produce and omit the complementizer under similar conditions. 3,622 instances of English adjectival, object, and subject complement constructions were retrieved from the International Corpus of English and the German and Spanish components of the International Corpus of Learner English. A logistic regression model suggests that L2 learners’ and natives’ production is largely governed by the same factors. However, in comparison with native speakers, L2 learners display a lower rate of complementizer omission. They are more impacted by processing-related factors such as complexity and clause juncture, and less sensitive to verb-construction cue validity.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.