Index maintenance for time-travel text search

Anand, Avishek; Bedathur, Srikanta; Berberich, Klaus; Schenkel, Ralf

doi:10.1145/2348283.2348318

Cited by 24 publications

(13 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…Imagine the cost of rebuilding the indexes of tens of large-scale web collections each time a new crawl ends. Alternatives for rebuilding the indexes when using the term-based partition exist, but are also more complex and less efficient than using the document-based partition [3]. The decision of partitioning the index first or after time, presents itself as a trade-off between speed and space.…”

Section: Discussionmentioning

confidence: 99%

“…The fast development of Information and Communication Technology had a great impact on this growth. In the last decade, the world population with access to the Internet grew more than 1 000% in some regions 3 . Computer-based devices and mobile phones with Internet connectivity are now about 5 billion 4 , much of which are equipped with technology that empowers people to easily create data.…”

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

A survey of web archive search architectures

Costa

Gomes²,

Couto

et al. 2013

Proceedings of the 22nd International Conference on World Wide Web

View full text Add to dashboard Cite

Web archives already hold more than 282 billion documents and users demand full-text search to explore this historical information. This survey provides an overview of web archive search architectures designed for time-travel search, i.e. full-text search on the web within a user-specified time interval. Performance, scalability and ease of management are important aspects to take in consideration when choosing a system architecture. We compare these aspects and initialize the discussion of which search architecture is more suitable for a large-scale web archive.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

A survey of web archive search architectures

Costa

Gomes²,

Couto

et al. 2013

Proceedings of the 22nd International Conference on World Wide Web

View full text Add to dashboard Cite

show abstract

“…For a compact index structure, two mappings assign ids to tags and websites, which are used exclusively in all other indexes and mappings (i.e., IdTagMapping, IdUrlMapping). Even though there is much room for improvement and optimization of temporal indexes [8,9], the rather simple mappings, which can easily be constructed using a distributed data processing platform like Hadoop, already go a long way.…”

Section: Index Structuresmentioning

confidence: 99%

Tempas

Holzmann

Anand

2016

Proceedings of the 25th International Conference Companion on World Wide Web - WWW '16 Companion

Self Cite

View full text Add to dashboard Cite

Limited search and access patterns over Web archives have been well documented. One of the key reasons is the lack of understanding of the user access patterns over such collections, which in turn is attributed to the lack of effective search interfaces. Current search interfaces for Web archives are (a) either purely navigational or (b) have sub-optimal search experience due to ineffective retrieval models or query modeling. We identify that external longitudinal resources, such as social bookmarking data, are crucial sources to identify important and popular websites in the past. To this extent we present Tempas, a tag-based temporal search engine for Web archives.Websites are posted at specific times of interest on several external platforms, such as bookmarking sites like Delicious. Attached tags not only act as relevant descriptors useful for retrieval, but also encode the time of relevance. With Tempas we tackle the challenge of temporally searching a Web archive by indexing tags and time. We allow temporal selections for search terms, rank documents based on their popularity and also provide meaningful query recommendations by exploiting tag-tag and tag-document co-occurrence statistics in arbitrary time windows. Finally, Tempas operates as a fairly non-invasive indexing framework. By not dealing with contents from the actual Web archive it constitutes an attractive and low-overhead approach for quick access into Web archives.

show abstract

“…The focus in all of these, however, is on retrieving individual documents as opposed to analyzing a large collection of them. Indexing versioned document collections, such as web archives, has also received ample attention [7,12,22] -no existing work has looked into indexing temporal expressions contained in documents.…”

Section: Related Workmentioning

confidence: 99%

Important Events in the Past, Present, and Future

Abujabal

Berberich

2015

Proceedings of the 24th International Conference on World Wide Web

Self Cite

View full text Add to dashboard Cite

We address the problem of identifying important events in the past, present, and future from semantically-annotated large-scale document collections. Semantic annotations that we consider are named entities (e.g., persons, locations, organizations) and temporal expressions (e.g., during the 1990s). More specifically, for a given time period of interest, our objective is to identify, rank, and describe important events that happened. Our approach P 2 F Miner makes use of frequent itemset mining to identify events and group sentences related to them. It uses an information-theoretic measure to rank identified events. For each of them, it selects a representative sentence as a description. Experiments on ClueWeb09 using events listed in Wikipedia year articles as ground truth show that our approach is e↵ective and outperforms a baseline based on statistical language models.

show abstract

Index maintenance for time-travel text search

Cited by 24 publications

References 26 publications

A survey of web archive search architectures

A survey of web archive search architectures

Tempas

Important Events in the Past, Present, and Future

Contact Info

Product

Resources

About