The World Wide Web Conference 2019
DOI: 10.1145/3308558.3313720
|View full text |Cite
|
Sign up to set email alerts
|

RiSER: Learning Better Representations for Richly Structured Emails

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
11
0

Year Published

2020
2020
2023
2023

Publication Types

Select...
5
1
1

Relationship

2
5

Authors

Journals

citations
Cited by 12 publications
(12 citation statements)
references
References 31 publications
1
11
0
Order By: Relevance
“…Topic classification categorizes Web pages based on their topic or subject (e.g., whether a page is about "news" or a "movie") for topic-specific search engines and Web content management [22]. Approaches to topic classification range from using only the URL [1,11], to using the page content [5,27], to including the structure as well as the content [14].…”
Section: Web Page Classificationmentioning
confidence: 99%
See 2 more Smart Citations
“…Topic classification categorizes Web pages based on their topic or subject (e.g., whether a page is about "news" or a "movie") for topic-specific search engines and Web content management [22]. Approaches to topic classification range from using only the URL [1,11], to using the page content [5,27], to including the structure as well as the content [14].…”
Section: Web Page Classificationmentioning
confidence: 99%
“…Thus, if we train a model on a subset of pages from a repository R and then use pages from R in our test data, it will be too easy for the model to identify them. To prevent the effects of memorization [14], we ensured that pages from the same host were either all in the training or all in the test set.…”
Section: Balance In Test Vs Training Setmentioning
confidence: 99%
See 1 more Smart Citation
“…Multi-modal extraction: The incorporation of visual information into IE was proposed by Aumann et al (2006), who attempted to learn a fitness function to calculate the visual similarity of a document to one in its training set to extract elements like headlines and authors. Other recent approaches that attempt to address the layout structure of documents are CharGrid (Katti et al, 2018), which represents a document as a two-dimensional grid of characters, RiSER, an extraction technique targeted at templated emails (Kocayusufoglu et al, 2019), and that by Liu et al (2018), which presents an RNN method for learning DOM-tree rules. However, none of these address the OpenIE setting, which requires understanding the relationship between different text fields on the page.…”
Section: Related Workmentioning
confidence: 99%
“…With the increase in machine-generated emails, recent studies have shifted their focus away from dialogs and towards parsing and categorizing (Aberdeen et al, 2010;Zhang et al, 2017) or threading notifications (Ailon et al, 2013), as well as automated template induction (Proskurnia et al, 2017;Castro et al, 2018;Kocayusufoglu et al, 2019).…”
Section: Related Workmentioning
confidence: 99%