In this work we find that many current redactions of PDF text are insecure due to non-redacted character positioning information. In particular, subpixel-sized horizontal shifts in redacted and non-redacted characters can be recovered and used to effectively deredact first and last names. Unfortunately these findings affect redactions where the text underneath the black box is removed from the PDF. We demonstrate these findings by performing a comprehensive vulnerability assessment of common PDF redaction types. We examine 11 popular PDF redaction tools, including Adobe Acrobat, and find that they leak information about redacted text. We also effectively deredact hundreds of real-world PDF redactions, including those found in OIG investigation reports and FOIA responses. To correct the problem, we have released open source algorithms to fix vulnerable redactions and reduce the amount of information leaked by nonexcising redactions (where the text underneath the redaction is copy-pastable). We have also notified the developers of the studied redaction tools. We have notified the Office of Inspector General, the Free Law Project, PACER, Adobe, Microsoft, and the US Department of Justice. We are working with several of these groups to prevent our discoveries from being used for malicious purposes.
In the past redaction involved the use of black or white markers or paper cut-outs to obscure content on physical paper. Today many redactions take place on digital PDF documents and redaction is often performed by software tools. Typical redaction tools remove text from PDF documents and draw a black or white rectangle in its place, mimicking a physical redaction. This practice is thought to be secure when the redacted text is removed and cannot be "copy-pasted" from the PDF document. We find this common conception is false-existing PDF redactions can be broken by precise measurements of non-redacted character positioning information.We develop a deredaction tool for automatically finding and breaking these vulnerable redactions. We report on 11 different redaction tools, finding the majority do not remove redactionbreaking information, including some Adobe Acrobat workflows. We empirically measure the information leaks, finding some redactions leak upwards of 15 bits of information, creating a 32,768-fold reduction in the space of potential redacted texts. We demonstrate a lower bound on the impact of these leaks via a 22,120 document study, including 18,975 Office of the Inspector General (OIG) investigation reports, where we find 769 vulnerable named-entity redactions. We find leaked information reduces the contents for 164 of these redacted names to less than 494 possibilities from a 7 million name dictionary. We show these findings impact by breaking redactions from the Epstein/Maxwell case, Manafort case, and a released Snowden document. Moreover, we develop an efficient algorithm for locating copy-pastable redactions and find over 100,000 poorly redacted words in US court documents. Current PDF text redaction methods are insufficient for named entity protection.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.