Abstraction of query auto completion logs for anonymity-preserving analysis

Krishnan, Unni; Billerbeck, Bodo; Moffat, Alistair; Zobel, Justin

doi:10.1007/s10791-019-09359-8

Cited by 4 publications

(7 citation statements)

References 46 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…We have explored a method for generating a synthetic QAC log from an abstract QAC log, by mapping the word lengths of the abstract QAC log to those of a publicly available string collection, and applying a range of corrective techniques. Synthetic QAC formation can also be posed as a language generation problem relying on various models of QAC systems [31,33]. We have demonstrated that the synthetic log generated from a pre-existing string collection encompasses many of the properties found in the original QAC log from which the abstract QAC log was derived.…”

Section: Discussionmentioning

confidence: 97%

“…Query auto completion (QAC) systems offer a list of completions while users enter queries in a search interface. Users can either submit one of the completions as their query, or advance their partial query by selecting a completion and then continuing to type [33]. A detailed QAC log capturing the sequence of partial queries, along with the completions presented and the user interactions with them, is required in order to evaluate a QAC system [37,38].…”

Section: Introductionmentioning

confidence: 99%

“…Here we explore a framework for generating synthetic QAC logs, extending the work of Krishnan et al [33], who suggest converting a QAC log to an abstracted format (an abstract QAC log) that records only the length of each partial query and the lengths of words used, minimizing privacy concerns but removing the possibility of performing evaluations on actual strings. Synthetic QAC log generation seeks to produce a list of plausible synthetic partial query sequences by mapping the word lengths from the abstract QAC log to strings from a publicly available dataset.…”

Section: Introductionmentioning

confidence: 99%

“…Error-tolerant QAC approaches [17,28,36,47] allow up to a fixed number of character mismatches to account for possible typing errors. User interactions are a key factor in implementation and evaluation of QAC systems [26,33,38] and have been captured using a wide range of models [31][32][33]37,38,43,44]. In particular, users are not limited to entering single characters, and can alter the partial query by selecting a completion or deleting characters already entered.…”

Section: Introductionmentioning

confidence: 99%

“…QAC system evaluations have typically been performed over large publicly available string collections [10,14,23], with strings taken sequentially from left-to-right to generate partial query sequences [10]. However this approach does not account for the full range of possible interactions [32,33]. In this work, we explore an approach to generation of synthetic partial query sequences that addresses this gap.…”

Section: Introductionmentioning

confidence: 99%

See 4 more Smart Citations

Generation of Synthetic Query Auto Completion Logs

Krishnan

Moffat

Zobel

et al. 2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

Privacy concerns can prohibit research access to large-scale commercial query logs. Here we focus on generation of a synthetic log from a publicly available dataset, suitable for evaluation of query auto completion (QAC) systems. The synthetic log contains plausible string sequences reflecting how users enter their queries in a QAC interface. Properties that would influence experimental outcomes are compared between a synthetic log and a real QAC log through a set of side-byside experiments, and confirm the applicability of the generated log for benchmarking the performance of QAC methods.

show abstract

Section: Discussionmentioning

confidence: 97%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 3 more Smart Citations

Generation of Synthetic Query Auto Completion Logs

Krishnan

Moffat

Zobel

et al. 2020

Lecture Notes in Computer Science

Self Cite

View full text Add to dashboard Cite

show abstract

Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping Study

Korzeniowski

Goczyła

2022

IEEE Access

View full text Add to dashboard Cite

Logging is a common practice in software engineering to provide insights into working systems. The main uses of log files have always been failure identification and root cause analysis. In recent years, novel applications of logging have emerged that benefit from automated analysis of log files, for example, real-time monitoring of system health, understanding users' behavior, and extracting domain knowledge. Although nearly every software system produces log files, the biggest challenge in log analysis is the lack of a common standard for both the content and format of log data. This paper provides a systematic review of recent literature (covering the period between 2000 and June 2021, concentrating primarily on the last five years of this period) related to automated log analysis. Our contribution is threefold: we present an overview of various research areas in the field; we identify different types of log files that are used in research, and we systematize the content of log files. We believe that this paper serves as a valuable starting point for new researchers in the field, as well as an interesting overview for those looking for other ways of utilizing log information.

show abstract

CC-News-En

Mackenzie

Benham

Petri

et al. 2020

Proceedings of the 29th ACM International Conference on Information &Amp; Knowledge Management

Self Cite

View full text Add to dashboard Cite

We describe a static, open-access news corpus using data from the Common Crawl Foundation, who provide free, publicly available web archives, including a continuous crawl of international news articles published in multiple languages. Our derived corpus, CC-News-En, contains 44 million English documents collected between September 2016 and March 2018. The collection is comparable in size with the number of documents typically found in a single shard of a large-scale, distributed search engine, and is four times larger than the news collections previously used in offline information retrieval experiments. To complement the corpus, 173 topics were curated using titles from Reddit threads, forming a temporally representative sampling of relevant news topics over the 583 day collection window. Information needs were then generated using automatic summarization tools to produce textual and audio representations, and used to elicit query variations from crowdworkers, with a total of 10,437 queries collected against the 173 topics. Of these, 10,089 include key-stroke level instrumentation that captures the timings of character insertions and deletions made by the workers while typing their queries. These new resources support a wide variety of experiments, including large-scale efficiency exercises and query auto-completion synthesis, with scope for future addition of relevance judgments to support offline effectiveness experiments and hence batch evaluation campaigns.

show abstract

Abstraction of query auto completion logs for anonymity-preserving analysis

Cited by 4 publications

References 46 publications

Generation of Synthetic Query Auto Completion Logs

Generation of Synthetic Query Auto Completion Logs

Landscape of Automated Log Analysis: A Systematic Literature Review and Mapping Study

CC-News-En

Contact Info

Product

Resources

About