Proceedings of the 19th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining 2013
DOI: 10.1145/2487575.2488195
|View full text |Cite
|
Sign up to set email alerts
|

Exploratory analysis of highly heterogeneous document collections

Abstract: We present an effective multifaceted system for exploratory analysis of highly heterogeneous document collections. Our system is based on intelligently tagging individual documents in a purely automated fashion and exploiting these tags in a powerful faceted browsing framework. Tagging strategies employed include both unsupervised and supervised approaches based on machine learning and natural language processing. As one of our key tagging strategies, we introduce the KERA algorithm (Keyword Extraction for Rep… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1

Citation Types

0
5
0

Year Published

2014
2014
2021
2021

Publication Types

Select...
5
1

Relationship

2
4

Authors

Journals

citations
Cited by 8 publications
(5 citation statements)
references
References 22 publications
0
5
0
Order By: Relevance
“…Buchanan and Loizides () and Loizides and Buchanan () mention that (a) the scientific understanding of document triage is limited, (b) the research on triage is fragmented, and (c) triage has attracted little attention. Existing studies of triage include text classifiers for prioritization (Macskassy & Provost, ), display configurations (Bae et al, ), user activity logging (Badi et al, ; Bae et al, ), paper versus electronic documents (Buchanan & Loizides, ), enhancing section headings (Buchanan & Owen, ), visual search patterns (Loizides & Buchanan, ), and tag clouds (Maiya, Thompson, Loaiza‐Lemos, & Rolfe, ). Aside from the latter, studies use uniform document formats and small information spaces (e.g., 200 documents).…”
Section: Related Workmentioning
confidence: 99%
“…Buchanan and Loizides () and Loizides and Buchanan () mention that (a) the scientific understanding of document triage is limited, (b) the research on triage is fragmented, and (c) triage has attracted little attention. Existing studies of triage include text classifiers for prioritization (Macskassy & Provost, ), display configurations (Bae et al, ), user activity logging (Badi et al, ; Bae et al, ), paper versus electronic documents (Buchanan & Loizides, ), enhancing section headings (Buchanan & Owen, ), visual search patterns (Loizides & Buchanan, ), and tag clouds (Maiya, Thompson, Loaiza‐Lemos, & Rolfe, ). Aside from the latter, studies use uniform document formats and small information spaces (e.g., 200 documents).…”
Section: Related Workmentioning
confidence: 99%
“…During the process of indexing and ingesting the DTIC document set into our search engine, we apply our extractors to encountered text and store both measured quantities and measured properties in the search engine index. In addition, the search engine performs keyphrase extraction on documents using the KERA algorithm described in [8]. Using Solr filter queries, extracted keyphrases can be used to produce a tag cloud for any subset of the document set.…”
Section: An Application: Mqsearchmentioning
confidence: 99%
“…From the tag cloud, we see that documents containing quantities measured in U/mL tend to cover topics such as breast cancer and prostate cancer research. 8 The search results can be filtered further along any of these dimensions. Filtering by LDA-discovered topics is also supported but not shown in the figure [9].…”
Section: An Application: Mqsearchmentioning
confidence: 99%
See 1 more Smart Citation
“…In this section, we show that we can utilize the labeled resources created by our system to learn discriminative patterns that help us gain insights into a dataset (Don et al (2007), Larsen and Aone (1999), Cheng et al (2007), Maiya et al (2013)). We utilize the top n/5 unambiguous labeled instances for this task, where n is size of the dataset.…”
Section: Mining Patterns and Insightsmentioning
confidence: 99%