While the fast-paced inception of novel tasks and new datasets helps foster active research in a community towards interesting directions, keeping track of the abundance of research activity in different areas on different datasets is likely to become increasingly difficult. The community could greatly benefit from an automatic system able to summarize scientific results, e.g., in the form of a leaderboard. In this paper we build two datasets and develop a framework (TDMS-IE) aimed at automatically extracting task, dataset, metric and score from NLP papers, towards the automatic construction of leaderboards. Experiments show that our model outperforms several baselines by a large margin. Our model is a first step towards automatic leaderboard construction, e.g., in the NLP domain.
Search task extraction in information retrieval is the process of identifying search intents over a set of queries relating to the same topical information need. Search tasks may potentially span across multiple search sessions. Most existing research on search task extraction has focused on identifying tasks within a single session, where the notion of a session is defined by a fixed length time window. By contrast, in this work we seek to identify tasks that span across multiple sessions. To identify tasks, we conduct a global analysis of a query log in its entirety without restricting analysis to individual temporal windows. To capture inherent task semantics, we represent queries as vectors in an abstract space. We learn the embedding of query words in this space by leveraging the temporal and lexical contexts of queries. To evaluate the effectiveness of the proposed query embedding, we conduct experiments of clustering queries into tasks with a particular interest of measuring the cross-session search task recall. Results of our experiments demonstrate that task extraction effectiveness, including cross-session recall, is improved significantly with the help of our proposed method of embedding the query terms by leveraging the temporal and templexical contexts of queries.
Queries in patent prior art search, being full patent applications, are very much longer than standard ad hoc search and web search topics. Standard information retrieval (IR) techniques are not entirely effective for patent prior art search because of the presence of ambiguous terms in these massive queries. Reducing patent queries by extracting small numbers of key terms has been shown to be ineffective mainly because it is not clear what the focus of the query is. An optimal query reduction algorithm must thus seek to retain the useful terms for retrieval favouring recall of relevant patents, but remove terms which impair retrieval effectiveness. We propose a new query reduction technique decomposing a patent application into constituent text segments and computing the Language Modeling (LM) similarities by calculating the probability of generating each segment from the top ranked documents. We reduce a patent query by removing the least similar segments from the query, hypothesizing that removal of segments most dissimilar to the pseudo-relevant documents can increase the precision of retrieval by removing nonuseful context, while still retaining the useful context to achieve high recall as well. Experiments on the patent prior art search collection CLEF-IP 2010, show that the proposed method outperforms standard pseudo relevance feedback (PRF) and a naive method of query reduction based on removal of unit frequency terms (UFTs).
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.