Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing 2018
DOI: 10.18653/v1/d18-1099
|View full text |Cite
|
Sign up to set email alerts
|

Supervised and Unsupervised Methods for Robust Separation of Section Titles and Prose Text in Web Documents

Abstract: The text in many web documents is organized into a hierarchy of section titles and corresponding prose content, a structure which provides potentially exploitable information on discourse structure and topicality. However, this organization is generally discarded during text collection, and collecting it is not straightforward: the same visual organization can be implemented in a myriad of different ways in the underlying HTML. To remedy this, we present a flexible system for automatically extracting the hiera… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
5
0

Year Published

2019
2019
2023
2023

Publication Types

Select...
5
3

Relationship

1
7

Authors

Journals

citations
Cited by 9 publications
(6 citation statements)
references
References 8 publications
0
5
0
Order By: Relevance
“…Sentenceembeddingmodelsarepromisingtechniquesthatareusedtocapturesentencessemantics andtheirrelations.Therearedifferentapplicationsthatrelyonencodingsemanticmeaningofprivacy policiessentences,suchasapplicationsinterestedincheckingAndroidapps'behaviorsagainstwhat isstatedintheirprivacypolicies. Infact,manyprivacyrelatedapplicationssuchasLiu,Fella,& Liao(2016), Gopinath,Wilson,&Sadeh(2018), Sun(2018),andHarkousetal. (2018usewordor sentenceembeddingmodelsaspartoftheautomaticanalysisofprivacypolicies.However,itisnot entirelycleartowhatextentsentenceembeddingsareeffectiveincapturingthesemanticsofprivacy policiessentences.Therefore,toensurethesuccessfulnessofsuchapplications,itiscrucialtoreport theadvantagesanddisadvantagesofusingsentenceemebddingsandsuggestimprovementsifneeded.…”
Section: Introductionmentioning
confidence: 99%
“…Sentenceembeddingmodelsarepromisingtechniquesthatareusedtocapturesentencessemantics andtheirrelations.Therearedifferentapplicationsthatrelyonencodingsemanticmeaningofprivacy policiessentences,suchasapplicationsinterestedincheckingAndroidapps'behaviorsagainstwhat isstatedintheirprivacypolicies. Infact,manyprivacyrelatedapplicationssuchasLiu,Fella,& Liao(2016), Gopinath,Wilson,&Sadeh(2018), Sun(2018),andHarkousetal. (2018usewordor sentenceembeddingmodelsaspartoftheautomaticanalysisofprivacypolicies.However,itisnot entirelycleartowhatextentsentenceembeddingsareeffectiveincapturingthesemanticsofprivacy policiessentences.Therefore,toensurethesuccessfulnessofsuchapplications,itiscrucialtoreport theadvantagesanddisadvantagesofusingsentenceemebddingsandsuggestimprovementsifneeded.…”
Section: Introductionmentioning
confidence: 99%
“…Tuarob et al (2015) designed some features and use Random Forest (Breiman, 2001) and Support Vector Machine (Bishop and Nasrabadi, 2006) to predict section headings. Mysore Gopinath et al (2018) propose a system for section titles separation. MTD represents a more recent approach, fusing text, visual, and layout information to detect section headings from scientific papers in the Hi-erDoc dataset.…”
Section: Table Of Contents (Toc) Extractionmentioning
confidence: 99%
“…fectively also requires several additional capabilities such as reasoning over vagueness and ambiguity, understanding elements such as lists (including when they are intended to be exhaustive and when they are not (Bhatia et al, 2016)), effectively incorporating 'co-text'-aspects of web document structure such as document headers that are meaningful semantically to the content of privacy policies (Mysore Gopinath et al, 2018) and incorporating domain knowledge (for example, understanding whether information is sensitive requires background knowledge in the form of applicable regulation). Privacy policies also differ from several closely related domains, such as legal texts which are largely meant to be processed by domain experts.…”
Section: Taskmentioning
confidence: 99%
“…There have also been efforts to analyze vague statements in privacy policies Lebanoff and Liu, 2018), and explore how benchmarks in this domain can be constructed through crowdsourcing Wilson et al, 2016c;Audich et al, 2018). Lastly, there has been research focused on identifying header information in privacy policies (Mysore Gopinath et al, 2018) and generating them (Gopinath et al, 2020 (Harkous et al, 2018) 120…”
Section: Other Applicationsmentioning
confidence: 99%