Machine Learning and Knowledge Discovery in Databases

2020

DOI: 10.1007/978-3-030-46150-8_37

|View full text |Cite

|

Sign up to set email alerts

|

String Sanitization: A Combinatorial Approach

Giulia Bernardini

¹

,

²

,

³

et al.

Abstract: String data are often disseminated to support applications such as location-based service provision or DNA sequence analysis. This dissemination, however, may expose sensitive patterns that model confidential knowledge (e.g., trips to mental health clinics from a string representing a user's location history). In this paper, we consider the problem of sanitizing a string by concealing the occurrences of sensitive patterns, while maintaining data utility. First, we propose a time-optimal algorithm, TFS-ALGO, to… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...

Introduction5

Citation Types

Supporting

0

Mentioning

35

Contrasting

0

Year Published

2020

2020

2022

2022

Publication Types

Select...

Book2

Article2

Other1

Relationship

Self Cite4

Independent1

Authors

Journals

Cited by 6 publications

(35 citation statements)

References 24 publications

Supporting

0

Mentioning

35

Contrasting

0

Order By: Relevance

“…In this paper, we study the fundamental relation between data sanitization [1], [4], [27] (also known as knowledge hiding) and frequent pattern mining [19], [22], [25]. The objective of frequent pattern mining in strings is to obtain all patterns occurring frequently enough in a string, or in a collection of strings.…”

Section: Introductionmentioning

confidence: 99%

“…There may also be constraints for the mined strings (e.g., to be of fixed length k [3], [9]). In string sanitization, the privacy objective is to transform a string to ensure that a given set of sensitive patterns, modeling confidential knowledge, does not occur in the sanitized version of the string; sensitive patterns are selected based on domain expertise [4], [15], [27]. This transformation may incur some utility loss that should be minimized.…”

Section: Introductionmentioning

confidence: 99%

“…This transformation may incur some utility loss that should be minimized. Recent methods achieve this using combinatorial algorithms [4], [5]. Let W be the input string over Σ, k > 0 be an integer, and S be the set of sensitive length-k substrings.…”

Section: Introductionmentioning

confidence: 99%

“…Let W be the input string over Σ, k > 0 be an integer, and S be the set of sensitive length-k substrings. These methods construct a string X such that: (I) X contains no element of S as a substring; (II) the total order and thus the frequency of all non-sensitive length-k substrings of W is preserved in X; and (III) the length of X is minimized [4], or the edit distance between W and X is minimized [5]. These methods work by copying carefully selected substrings of W into X and separating them by a special letter # / ∈ Σ.…”

Section: Introductionmentioning

confidence: 99%

“…Further, let X TR = GAC#ACC#CCC#CAT, X MIN = GACCC#CAT and X ED = GAC#AA#ACCC#CAT be three sanitized strings. All three strings contain no sensitive pattern and preserve the total order and thus the frequency of all nonsensitive length-3 patterns of W : X TR is the trivial solution of interleaving the non-sensitive length-3 patterns of W with #; X MIN is the shortest possible such string [4]; and X ED is a string closest to W in terms of edit distance [5].…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Hide and Mine in Strings: Hardness and Algorithms

¹

,

²

,

³

et al. 2020

2020 IEEE International Conference on Data Mining (ICDM)

Self Cite

View full text Add to dashboard Cite

We initiate a study on the fundamental relation between data sanitization (i.e., the process of hiding confidential information in a given dataset) and frequent pattern mining, in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns introducing, however, a number of spurious patterns that may harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is twofold. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under certain realistic assumptions on the problem parameters.

“…In this paper, we study the fundamental relation between data sanitization [1], [4], [27] (also known as knowledge hiding) and frequent pattern mining [19], [22], [25]. The objective of frequent pattern mining in strings is to obtain all patterns occurring frequently enough in a string, or in a collection of strings.…”

Section: Introductionmentioning

confidence: 99%

“…There may also be constraints for the mined strings (e.g., to be of fixed length k [3], [9]). In string sanitization, the privacy objective is to transform a string to ensure that a given set of sensitive patterns, modeling confidential knowledge, does not occur in the sanitized version of the string; sensitive patterns are selected based on domain expertise [4], [15], [27]. This transformation may incur some utility loss that should be minimized.…”

Section: Introductionmentioning

confidence: 99%

“…This transformation may incur some utility loss that should be minimized. Recent methods achieve this using combinatorial algorithms [4], [5]. Let W be the input string over Σ, k > 0 be an integer, and S be the set of sensitive length-k substrings.…”

Section: Introductionmentioning

confidence: 99%

“…Let W be the input string over Σ, k > 0 be an integer, and S be the set of sensitive length-k substrings. These methods construct a string X such that: (I) X contains no element of S as a substring; (II) the total order and thus the frequency of all non-sensitive length-k substrings of W is preserved in X; and (III) the length of X is minimized [4], or the edit distance between W and X is minimized [5]. These methods work by copying carefully selected substrings of W into X and separating them by a special letter # / ∈ Σ.…”

Section: Introductionmentioning

confidence: 99%

“…Further, let X TR = GAC#ACC#CCC#CAT, X MIN = GACCC#CAT and X ED = GAC#AA#ACCC#CAT be three sanitized strings. All three strings contain no sensitive pattern and preserve the total order and thus the frequency of all nonsensitive length-3 patterns of W : X TR is the trivial solution of interleaving the non-sensitive length-3 patterns of W with #; X MIN is the shortest possible such string [4]; and X ED is a string closest to W in terms of edit distance [5].…”

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Hide and Mine in Strings: Hardness and Algorithms

¹

,

²

,

³

et al. 2020

2020 IEEE International Conference on Data Mining (ICDM)

Self Cite

View full text Add to dashboard Cite

We initiate a study on the fundamental relation between data sanitization (i.e., the process of hiding confidential information in a given dataset) and frequent pattern mining, in the context of sequential (string) data. Current methods for string sanitization hide confidential patterns introducing, however, a number of spurious patterns that may harm the utility of frequent pattern mining. The main computational problem is to minimize this harm. Our contribution here is twofold. First, we present several hardness results, for different variants of this problem, essentially showing that these variants cannot be solved or even be approximated in polynomial time. Second, we propose integer linear programming formulations for these variants and algorithms to solve them, which work in polynomial time under certain realistic assumptions on the problem parameters.

String Editing Under Pattern Constraints

2022

Communications in Computer and Information Science

View full text Add to dashboard Cite

No abstract

Hide and Mine in Strings: Hardness, Algorithms, and Experiments

¹

,

²

,

et al. 2022

IEEE Trans. Knowl. Data Eng.

Self Cite

View full text Add to dashboard Cite

If citing, it is advised that you check and use the publisher's definitive version for pagination, volume/issue, and date of publication details. And where the final published version is provided on the Research Portal, if citing you are again advised to check the publisher's website for any subsequent corrections.

scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.

Contact Info

customersupport@researchsolutions.com

10624 S. Eastern Ave., Ste. A-614

Henderson, NV 89052, USA

Product

Browser Extension Assistant by scite Citation Statement Search Reference Check Visualizations Dashboards Explore Journals Explore Organizations Explore Funders Embedding Badge Embedding Citation Search Pricing

Resources

Blog Help & FAQ Accessibility Statement API Terms For Universities & Governments For Researchers For Publishers For Corporate, Pharma & Enterprise Author Marketing Become an Affiliate Get an organization trial or quote scite Data & Services

About

News & Press Careers Read our Paper Coverage

This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.

Blog Terms and Conditions API Terms Privacy Policy Contact Cookie Preferences Do Not Sell or Share My Personal Information

Copyright © 2024 scite LLC. All rights reserved.

Made with 💙 for researchers

Part of the Research Solutions Family.