Proceedings of the Ninth ACM International Conference on Web Search and Data Mining 2016
DOI: 10.1145/2835776.2835780
|View full text |Cite
|
Sign up to set email alerts
|

Hierarchical Label Propagation and Discovery for Machine Generated Email

Abstract: Machine-generated documents such as email or dynamic web pages are single instantiations of a pre-defined structural template. As such, they can be viewed as a hierarchy of template and document specific content. This hierarchical template representation has several important advantages for document clustering and classification. First, templates capture common topics among the documents, while filtering out the potentially noisy variabilities such as personal information. Second, template representations scal… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
2
1

Citation Types

0
10
0

Year Published

2016
2016
2019
2019

Publication Types

Select...
5
2
1

Relationship

4
4

Authors

Journals

citations
Cited by 22 publications
(10 citation statements)
references
References 21 publications
0
10
0
Order By: Relevance
“…For example, processing the cluster of emails belonging to a single template enables the ability to determine portions of these emails that are fixed across all instantiations. Resultant fixed text and even fixed images have been shown to be useful for classification tasks [30,37].…”
Section: Applications Of Templatesmentioning
confidence: 99%
“…For example, processing the cluster of emails belonging to a single template enables the ability to determine portions of these emails that are fixed across all instantiations. Resultant fixed text and even fixed images have been shown to be useful for classification tasks [30,37].…”
Section: Applications Of Templatesmentioning
confidence: 99%
“…• Categories -This attribute type models the content using a fixed small set of topics or categories. It might be, for instance, an output of some textual classifier or clustering algorithm that runs over the private content [5,17,29].…”
Section: Attribute Typesmentioning
confidence: 99%
“…• Structure -This attribute type models the content using its inherent structure, regardless of the content topic. For instance, for email corpora, a structure can be represented via structural templates [3,29]. For personal files, it can be represented, among other options, as a file type.…”
Section: Attribute Typesmentioning
confidence: 99%
“…Kiritchenko and Matwin deal with the sparsity issue by using a co-training algorithm to build weak classifiers, then label unlabeled examples, and add the most confident predictions to the labeled set [13]. Somewhat similarly, Wendt et al utilize a graph-based label propagation algorithm to label unlabeled emails from a small set of labeled emails, but do so at the template level to improve scalability in very large mail provider systems [25].…”
Section: Email Content Miningmentioning
confidence: 99%
“…While there is very little published work on using structural templates for processing commercial email data, templates have been used for annotating semantic types within the DOM trees of emails [27] and used in hierarchical classification of emails [25]. To our knowledge, no techniques have yet been published that propose template induction for plain text email content.…”
Section: Template Inductionmentioning
confidence: 99%