We study information goals and patterns of attention in exploratory search for health information on the Web, reporting results of a large-scale log-based study. We examine search activity associated with the goal of diagnosing illness from symptoms versus more general information-seeking about health and illness. We decompose exploratory health search into evidence-based and hypothesis-directed information seeking. Evidence-based search centers on the pursuit of details and relevance of signs and symptoms. Hypothesis-directed search includes the pursuit of content on one or more illnesses, including risk factors, treatments, and therapies for illnesses, and on the discrimination among different diseases under the uncertainty that exists in advance of a confirmed diagnosis. These different goals of exploratory health search are not independent, and transitions can occur between them within or across search sessions. We construct a classifier that identifies medically-related search sessions in log data. Given a set of search sessions flagged as health-related, we show how we can identify different intentions persisting as foci of attention within those sessions. Finally, we discuss how insights about foci dynamics can help us better understand exploratory health search behavior and better support health search on the Web.
Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scale far better than document representations since a single template captures numerous documents. Finally, since templates group together structurally similar documents, they can propagate properties between all the documents that match the template. In this paper, we use these advantages for document classification by formulating an efficient and effective hierarchical label propagation and discovery algorithm. The labels are propagated first over a template graph (constructed based on either term-based or topic-based similarities), and then to the matching documents. We evaluate the performance of the proposed algorithm using a large donated email corpus and show that the resulting template graph is significantly more compact than the corresponding document graph and the hierarchical label propagation is both efficient and effective in increasing the coverage of the baseline document classification algorithm. We demonstrate that the template label propagation achieves more than 91% precision and 93% recall, while increasing the label coverage by more than 11%.
Unsupervised template induction over email data is a central component in applications such as information extraction, document classification, and auto-reply. The benefits of automatically generating such templates are known for structured data, e.g. machine generated HTML emails. However much less work has been done in performing the same task over unstructured email data.We propose a technique for inducing high quality templates from plain text emails at scale based on the suffix array data structure. We evaluate this method against an industry-standard approach for finding similar content based on shingling, running both algorithms over two corpora: a synthetically created email corpus for a high level of experimental control, as well as user-generated emails from the well-known Enron email corpus. Our experimental results show that the proposed method is more robust to variations in cluster quality than the baseline and templates contain more text from the emails, which would benefit extraction tasks by identifying transient parts of the emails.Our study indicates templates induced using suffix arrays contain approximately half as much noise (measured as entropy) as templates induced using shingling. Furthermore, the suffix array approach is substantially more scalable, proving to be an order of magnitude faster than shingling even for modestly-sized training clusters.Public corpus analysis shows that email clusters contain on average 4 segments of common phrases, where each of the segments contains on average 9 words, thus showing that templatization could help users reduce the email writing effort by an average of 35 words per email in an assistance or auto-reply related task.
Pseudo-relevance feedback (PRF) improves search quality by expanding the query using terms from high-ranking documents from an initial retrieval. Although PRF can often result in large gains in effectiveness, running two queries is time consuming, limiting its applicability. We describe a PRF method that uses corpus pre-processing to achieve query-time speeds that are near those of the original queries. Specifically, Relevance Modeling, a language modeling based PRF method, can be recast to benefit substantially from finding pairwise document relationships in advance. Using the resulting Fast Relevance Model (fastRM), we substantially reduce the online retrieval time and still benefit from expansion. We further explore methods for reducing the preprocessing time and storage requirements of the approach, allowing us to achieve up to a 10% increase in MAP over unexpanded retrieval, while only requiring 1% of the time of standard expansion.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2025 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.