Supervised and Unsupervised Methods for Robust Separation of Section Titles and Prose Text in Web Documents

Gopinath, Abhijith Athreya Mysore; Wilson, Shomir; Sadeh, Norman

doi:10.18653/v1/d18-1099

Cited by 9 publications

(6 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Sentenceembeddingmodelsarepromisingtechniquesthatareusedtocapturesentencessemantics andtheirrelations.Therearedifferentapplicationsthatrelyonencodingsemanticmeaningofprivacy policiessentences,suchasapplicationsinterestedincheckingAndroidapps'behaviorsagainstwhat isstatedintheirprivacypolicies. Infact,manyprivacyrelatedapplicationssuchasLiu,Fella,& Liao(2016), Gopinath,Wilson,&Sadeh(2018), Sun(2018),andHarkousetal. (2018usewordor sentenceembeddingmodelsaspartoftheautomaticanalysisofprivacypolicies.However,itisnot entirelycleartowhatextentsentenceembeddingsareeffectiveincapturingthesemanticsofprivacy policiessentences.Therefore,toensurethesuccessfulnessofsuchapplications,itiscrucialtoreport theadvantagesanddisadvantagesofusingsentenceemebddingsandsuggestimprovementsifneeded.…”

Section: Introductionmentioning

confidence: 99%

Utilizing Sentence Embedding for Dangerous Permissions Detection in Android Apps' Privacy Policies

Baalous

Poet

2021

International Journal of Information Security and Privacy

View full text Add to dashboard Cite

Privacy policies analysis relies on understanding sentences meaning in order to identify sentences of interest to privacy related applications. In this paper, the authors investigate the strengths and limitations of sentence embeddings to detect dangerous permissions in Android apps privacy policies. Sent2Vec sentence embedding model was utilized and trained on 130,000 Android apps privacy policies. The terminology extracted by the sentence embedding model was then compared with the gold standard on a dataset of 564 privacy policies. This work seeks to provide answers to researchers and developers interested in extracting privacy related information from privacy policies using sentence embedding models. In addition, it may help regulators interested in deploying sentence embedding models to check for privacy policies' compliance with the government regulations and to identify points of inconsistencies or violations.

show abstract

Section: Introductionmentioning

confidence: 99%

Utilizing Sentence Embedding for Dangerous Permissions Detection in Android Apps' Privacy Policies

Baalous

Poet

2021

International Journal of Information Security and Privacy

View full text Add to dashboard Cite

show abstract

“…Tuarob et al (2015) designed some features and use Random Forest (Breiman, 2001) and Support Vector Machine (Bishop and Nasrabadi, 2006) to predict section headings. Mysore Gopinath et al (2018) propose a system for section titles separation. MTD represents a more recent approach, fusing text, visual, and layout information to detect section headings from scientific papers in the Hi-erDoc dataset.…”

Section: Table Of Contents (Toc) Extractionmentioning

confidence: 99%

A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports

Wang,

Gui,

2023

Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing

View full text Add to dashboard Cite

Table of contents (ToC) extraction centres on structuring documents in a hierarchical manner. In this paper, we propose a new dataset, ESGDoc, comprising 1,093 ESG annual reports from 563 companies spanning from 2001 to 2022. These reports pose significant challenges due to their diverse structures and extensive length. To address these challenges, we propose a new framework for Toc extraction, consisting of three steps: (1) Constructing an initial tree of text blocks based on reading order and font sizes; (2) Modelling each tree node (or text block) independently by considering its contextual information captured in node-centric subtree; (3) Modifying the original tree by taking appropriate action on each tree node (Keep, Delete, or Move). This construction-modellingmodification (CMM) process offers several benefits. It eliminates the need for pairwise modelling of section headings as in previous approaches, making document segmentation practically feasible. By incorporating structured information, each section heading can leverage both local and long-distance context relevant to itself. Experimental results show that our approach outperforms the previous state-of-theart baseline with a fraction of running time. Our framework proves its scalability by effectively handling documents of any length. 1

show abstract

“…fectively also requires several additional capabilities such as reasoning over vagueness and ambiguity, understanding elements such as lists (including when they are intended to be exhaustive and when they are not (Bhatia et al, 2016)), effectively incorporating 'co-text'-aspects of web document structure such as document headers that are meaningful semantically to the content of privacy policies (Mysore Gopinath et al, 2018) and incorporating domain knowledge (for example, understanding whether information is sensitive requires background knowledge in the form of applicable regulation). Privacy policies also differ from several closely related domains, such as legal texts which are largely meant to be processed by domain experts.…”

Section: Taskmentioning

confidence: 99%

“…There have also been efforts to analyze vague statements in privacy policies Lebanoff and Liu, 2018), and explore how benchmarks in this domain can be constructed through crowdsourcing Wilson et al, 2016c;Audich et al, 2018). Lastly, there has been research focused on identifying header information in privacy policies (Mysore Gopinath et al, 2018) and generating them (Gopinath et al, 2020 (Harkous et al, 2018) 120…”

Section: Other Applicationsmentioning

confidence: 99%

Breaking Down Walls of Text: How Can NLP Benefit Consumer Privacy?

Ravichander¹,

Black²,

Norton³

et al. 2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Privacy plays a crucial role in preserving democratic ideals and personal autonomy. The dominant legal approach to privacy in many jurisdictions is the "Notice and Choice" paradigm, where privacy policies are the primary instrument used to convey information to users. However, privacy policies are long and complex documents that are difficult for users to read and comprehend. We discuss how language technologies can play an important role in addressing this information gap, reporting on initial progress towards helping three specific categories of stakeholders take advantage of digital privacy policies: consumers, enterprises, and regulators. Our goal is to provide a roadmap for the development and use of language technologies to empower users to reclaim control over their privacy, limit privacy harms, and rally research efforts from the community towards addressing an issue with large social impact. We highlight many remaining opportunities to develop language technologies that are more precise or nuanced in the way in which they use the text of privacy policies.

show abstract

Supervised and Unsupervised Methods for Robust Separation of Section Titles and Prose Text in Web Documents

Cited by 9 publications

References 8 publications

Utilizing Sentence Embedding for Dangerous Permissions Detection in Android Apps' Privacy Policies

Utilizing Sentence Embedding for Dangerous Permissions Detection in Android Apps' Privacy Policies

A Scalable Framework for Table of Contents Extraction from Complex ESG Annual Reports

Breaking Down Walls of Text: How Can NLP Benefit Consumer Privacy?

Contact Info

Product

Resources

About