Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Traub, Myriam C.; Samar, Thaer; Ossenbruggen, Jacco van; He, Jiyin; Vries, Arjen P. de; Hardman, Lynda

doi:10.1145/2910896.2910907

Cited by 20 publications

(28 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…BM25 induces the smallest inequality for both query sets and can therefore be considered to be the fairest model. This is in line with the findings of [7,43]. For each setup a number of documents in the collection are never retrieved by any retrieval model (r (d) = 0).…”

Section: Retrievability Biassupporting

confidence: 88%

Quantifying retrieval bias in Web archive search

et al. 2017

Self Cite

View full text Add to dashboard Cite

A Web archive usually contains multiple versions of documents crawled from the Web at different points in time. One possible way for users to access a Web archive is through full-text search systems. However, previous studies have shown that these systems can induce a bias, known as the retrievability bias, on the accessibility of documents in community-collected collections (such as TREC collections). This bias can be measured by analyzing the distribution of the retrievability scores for each document in a collection, quantifying the likelihood of a document's retrieval. We investigate the suitability of retrievability scores in retrieval systems that consider every version of a document in a Web archive as an independent document. We show that the retrievability of documents can vary for different versions of the same document and that retrieval systems induce biases to different extents. We quantify this bias for a retrieval system which is adapted to handle multiple versions of the same document. The retrieval system indexes each version of a document independently, and we refine the search results using two techniques to aggregate similar versions. The first approach is to collapse similar versions of a document based on content similarity. The second approach is to collapse all versions of the same document based on their URLs. In both cases, we found that the degree of bias Universiteit Utrecht, Utrecht, The Netherlands is related to the aggregation level of versions of the same document. Finally, we study the effect of bias across time using the retrievability measure. Specifically, we investigate whether the number of documents crawled in a particular year correlates with the number of documents in the search results from that year. Assuming queries are not inherently temporal in nature, the analysis is based on the timestamps of documents in the search results returned using the retrieval model for all queries. The results show a relation between the number of documents per year and the number of documents retrieved by the retrieval system from that year. We further investigated the relation between the queries' timestamps and the documents' timestamps. First, we split the queries into different time frames using a 1-year granularity. Then, we issued the queries against the retrieval system. The results show that temporal queries indeed retrieve more documents from the assumed time frame. Thus, the documents from the same time frame were preferred by the retrieval system over documents from other time frames.

show abstract

Section: Retrievability Biassupporting

confidence: 88%

Quantifying retrieval bias in Web archive search

et al. 2017

Self Cite

View full text Add to dashboard Cite

show abstract

“…Information retrieval researchers have coined the term retrievability, which refers to how accessible a document is in a system [4]. When a system systematically favors documents with particular characteristics, such that their retrievability is higher than that of others, the system exhibits a retrievability bias [44]. In a search for "person" images, it is clear that photos of men have significantly greater retrievability as compared to photos of women.…”

Section: Analysis Who Represents a "Person"?mentioning

confidence: 99%

Competent Men and Warm Women

Otterbacher

Bates

Clough

2017

Proceedings of the 2017 CHI Conference on Human Factors in Computing Systems

View full text Add to dashboard Cite

There is much concern about algorithms that underlie information services and the view of the world they present. We develop a novel method for examining the content and strength of gender stereotypes in image search, inspired by the trait adjective checklist method. We compare the gender distribution in photos retrieved by Bing for the query "person" and for queries based on 68 character traits (e.g., "intelligent person") in four regional markets. Photos of men are more often retrieved for "person," as compared to women. As predicted, photos of women are more often retrieved for warm traits (e.g., "emotional") whereas agentic traits (e.g., "rational") are represented by photos of men. A backlash effect, where stereotype-incongruent individuals are penalized, is observed. However, backlash is more prevalent for "competent women" than "warm men." Results underline the need to understand how and why biases enter search algorithms and at which stages of the engineering process.

show abstract

“…In [7], different methods of retrieval have been evaluated for their performance under standalone and combined way. Toward corpus retrieval of newspapers, an log based approach is presented in [8]. The retrieve ability measure is computed and evaluated towards performance.…”

Section: Literature Reviewmentioning

confidence: 99%

Hierarchical Semantic Relational Coverage Measure Based Web Document Clustering Using Semantic Ontology

Selvalakshmi¹,

Subramaniam²

2019

IJITEE

View full text Add to dashboard Cite

The problem of web document clustering has been well studied. Web documents has been grouped based various features like textual, topical and semantic features. Number of approaches has been discussed earlier for the clustering of web documents. However the method does not produce promising results towards web document clustering. To overcome this, an efficient hierarchical semantic relational coverage based approach is presented in this paper. The method extracts the features of web document by preprocessing the document. The features extracted have been used to measure the semantic relational coverage measure in different levels. As the documents are grouped in a hierarchical manner, the method estimates the relational coverage measure in each level of the cluster. Based on the semantic relational measure at different level, the method estimates the topical semantic support measure. Using these two, the method computes the class weight. The estimated class weight has been used to perform document clustering. The proposed method improves the performance of document clustering and reduces the false classification ratio.

show abstract

Querylog-based Assessment of Retrievability Bias in a Large Newspaper Corpus

Cited by 20 publications

References 14 publications

Quantifying retrieval bias in Web archive search

Quantifying retrieval bias in Web archive search

Competent Men and Warm Women

Hierarchical Semantic Relational Coverage Measure Based Web Document Clustering Using Semantic Ontology

Contact Info

Product

Resources

About