PrivaSeer: A Privacy Policy Search Engine

Srinath, Mukund; Sundareswara, Soundarya Nurani; Giles, C. Lee; Wilson, Shomir

doi:10.1007/978-3-030-74296-6_22

Cited by 4 publications

(7 citation statements)

References 20 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Nokhbeh Zaeem and Barber [23] created a corpus of over 100,000 privacy policies, categorized into 15 website categories, utilizing the DMOZ directory. PrivaSeer [12] is a privacy policy dataset and search engine containing approximately 1.4 million website privacy policies. It was built using web crawls from 2019 and 2020, utilizing URLs from "Common Crawl" and the "Free Company Dataset".…”

Section: Privacy Policy Datasetsmentioning

confidence: 99%

“…Some scholars have previously evaluated policy comprehensibility, focusing on shorter periods or single time points [4,7,[9][10][11][12]. In contrast, this study advances the literature on consumer comprehension by conducting a large-scale analysis covering an extended period and a wide range of websites.…”

mentioning

confidence: 99%

“…Wagner [4] examined length (words and sentences), passive voice, various readability formulas (Flesch Reading Ease (FRE), Coleman-Liau score (CL), and Simple Measure Of Gobbledygook (SMOG)). Srinath et al [12] reported on the length of the privacy policy and the use of vague words in their private policy corpus. Compared to Srinath et al [12], Libert et al [13] and Wagner [4], the present work is based on the dataset of Amos et al [7], which extends substantially over several years.…”

mentioning

confidence: 99%

“…Srinath et al [12] reported on the length of the privacy policy and the use of vague words in their private policy corpus. Compared to Srinath et al [12], Libert et al [13] and Wagner [4], the present work is based on the dataset of Amos et al [7], which extends substantially over several years. Furthermore, herein the length and indeterminacy are analyzed in function of the GDPR, website category, popularity level, and domain.…”

mentioning

confidence: 99%

See 3 more Smart Citations

Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing

Belcheva,

Ermakova,

Fabian

2023

Information

View full text Add to dashboard Cite

Privacy policies are the main method for informing Internet users of how their data are collected and shared. This study aims to analyze the deficiencies of privacy policies in terms of readability, vague statements, and the use of pacifying phrases concerning privacy. This represents the undertaking of a step forward in the literature on this topic through a comprehensive analysis encompassing both time and website coverage. It characterizes trends across website categories, top-level domains, and popularity ranks. Furthermore, studying the development in the context of the General Data Protection Regulation (GDPR) offers insights into the impact of regulations on policy comprehensibility. The findings reveal a concerning trend: privacy policies have grown longer and more ambiguous, making it challenging for users to comprehend them. Notably, there is an increased proportion of vague statements, while clear statements have seen a decrease. Despite this, the study highlights a steady rise in the inclusion of reassuring statements aimed at alleviating readers’ privacy concerns.

show abstract

Section: Privacy Policy Datasetsmentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

mentioning

confidence: 99%

See 2 more Smart Citations

Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing

Belcheva,

Ermakova,

Fabian

2023

Information

View full text Add to dashboard Cite

show abstract

“…In [30], the authors developed the PrivaSeer search engine for searching and analyzing privacy policies according to the specified criteria such as readability level, completeness, and accuracy of formulations. This search engine has indexed more than 1.4 million privacy policies.…”

Section: Related Work and Their Comparative Analysismentioning

confidence: 99%

Privacy Policies of IoT Devices: Collection and Analysis

Kuznetsov

Novikova

Kotenko

et al. 2022

Sensors

View full text Add to dashboard Cite

Currently, personal data collection and processing are widely used while providing digital services within mobile sensing networks for their operation, personalization, and improvement. Personal data are any data that identifiably describe a person. Legislative and regulatory documents adopted in recent years define the key requirements for the processing of personal data. They are based on the principles of lawfulness, fairness, and transparency of personal data processing. Privacy policies are the only legitimate way to provide information on how the personal data of service and device users is collected, processed, and stored. Therefore, the problem of making privacy policies clear and transparent is extremely important as its solution would allow end users to comprehend the risks associated with personal data processing. Currently, a number of approaches for analyzing privacy policies written in natural language have been proposed. Most of them require a large training dataset of privacy policies. In the paper, we examine the existing corpora of privacy policies available for training, discuss their features and conclude on the need for a new dataset of privacy policies for devices and services of the Internet of Things as a part of mobile sensing networks. The authors develop a new technique for collecting and cleaning such privacy policies. The proposed technique differs from existing ones by the usage of e-commerce platforms as a starting point for document search and enables more targeted collection of the URLs to the IoT device manufacturers’ privacy policies. The software tool implementing this technique was used to collect a new corpus of documents in English containing 592 unique privacy policies. The collected corpus contains mainly privacy policies that are developed for the Internet of Things and reflect the latest legislative requirements. The paper also presents the results of the statistical and semantic analysis of the collected privacy policies. These results could be further used by the researchers when elaborating techniques for analysis of the privacy policies written in natural language targeted to enhance their transparency for the end user.

show abstract

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Srinath¹,

Wilson²,

Giles³

2021

Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Confer

Self Cite

View full text Add to dashboard Cite

Organisations disclose their privacy practices by posting privacy policies on their websites. Even though internet users often care about their digital privacy, they usually do not read privacy policies, since understanding them requires a significant investment of time and effort. Natural language processing has been used to create experimental tools to interpret privacy policies, but there has been a lack of large privacy policy corpora to facilitate the creation of large-scale semi-supervised and unsupervised models to interpret and simplify privacy policies. Thus, we present the Pri-vaSeer Corpus of 1,005,380 English language website privacy policies collected from the web. The number of unique websites represented in PrivaSeer is about ten times larger than the next largest public collection of web privacy policies, and it surpasses the aggregate of unique websites represented in all other publicly available privacy policy corpora combined. We describe a corpus creation pipeline with stages that include a web crawler, language detection, document classification, duplicate and near-duplicate removal, and content extraction. We employ an unsupervised topic modelling approach to investigate the contents of policy documents in the corpus and discuss the distribution of topics in privacy policies at web scale. We further investigate the relationship between privacy policy domain PageRanks and text features of the privacy policies. Finally, we use the corpus to pretrain PrivBERT, a transformer-based privacy policy language model, and obtain state of the art results on the data practice classification and question answering tasks.

show abstract

PrivaSeer: A Privacy Policy Search Engine

Cited by 4 publications

References 20 publications

Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing

Understanding Website Privacy Policies—A Longitudinal Analysis Using Natural Language Processing

Privacy Policies of IoT Devices: Collection and Analysis

Privacy at Scale: Introducing the PrivaSeer Corpus of Web Privacy Policies

Contact Info

Product

Resources

About