Proceedings of the 23rd International Conference on World Wide Web 2014
DOI: 10.1145/2567948.2579045
|View full text |Cite
|
Sign up to set email alerts
|

Infrastructure for supporting exploration and discovery in web archives

Abstract: Web archiving initiatives around the world capture ephemeral web content to preserve our collective digital memory. However, unlocking the potential of web archives requires tools that support exploration and discovery of captured content. These tools need to be scalable and responsive, and to this end we believe that modern "big data" infrastructure can provide a solid foundation. We present Warcbase, an open-source platform for managing web archives built on the distributed datastore HBase. Our system provid… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
16
0

Year Published

2015
2015
2021
2021

Publication Types

Select...
4
3
2

Relationship

1
8

Authors

Journals

citations
Cited by 19 publications
(16 citation statements)
references
References 18 publications
0
16
0
Order By: Relevance
“…This is because the data ingestion, validation, periodic fixity checking, and refreshing are indeed on-demand data processing performed at compute nodes away from where the data is stored. 2) The Network Pattern, as exemplified by warcbase [11] -an open-source platform for managing web archives built on Hadoop and HBase -features a much tighter integration between data storage and processing. Typically involving a Hadoop cluster, this pattern uses a large number of interconnected nodes, each serving both as a storage and processing unit.…”
Section: Library Big Data Service Patternsmentioning
confidence: 99%
“…This is because the data ingestion, validation, periodic fixity checking, and refreshing are indeed on-demand data processing performed at compute nodes away from where the data is stored. 2) The Network Pattern, as exemplified by warcbase [11] -an open-source platform for managing web archives built on Hadoop and HBase -features a much tighter integration between data storage and processing. Typically involving a Hadoop cluster, this pattern uses a large number of interconnected nodes, each serving both as a storage and processing unit.…”
Section: Library Big Data Service Patternsmentioning
confidence: 99%
“…In this approach, the records of interest are selected by means of the CDX records and inserted into the database to be queried through a web service. Warcbase by Lin et al [37] follows a similar approach based on HBase, a Hadoop-based distributed database system, which is an opensource implementation of Google's Bigtable [38]. While being very efficient for lookups, major drawbacks of these database solutions are the limited flexibility as well as the additional effort to insert the records, which is expensive in terms of time and resources.…”
Section: Web Archive Data Processingmentioning
confidence: 99%
“…Scientifically published articles on data extraction from Web archives, like ArchiveSpark, have been very limited. To the best of our knowledge, the only comparable system is Warcbase by Lin et al [5], which will be discussed at the end of this section and serves as the baseline in our benchmarking process (s. Sec 6). There are also a number of other approaches in the area of accessing and mining Web archives including tools from industry.…”
Section: Related Workmentioning
confidence: 99%