Proceedings of the 16th ACM/IEEE-CS on Joint Conference on Digital Libraries 2016
DOI: 10.1145/2910896.2910902
|View full text |Cite
|
Sign up to set email alerts
|

ArchiveSpark

Abstract: Web archives are a valuable resource for researchers of various disciplines. However, to use them as a scholarly source, researchers require a tool that provides efficient access to Web archive data for extraction and derivation of smaller datasets. Besides efficient access we identify five other objectives based on practical researcher needs such as ease of use, extensibility and reusability.Towards these objectives we propose ArchiveSpark, a framework for efficient, distributed Web archive processing that bu… Show more

Help me understand this report
View preprint versions

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
4
0

Year Published

2016
2016
2024
2024

Publication Types

Select...
5
2
1

Relationship

1
7

Authors

Journals

citations
Cited by 25 publications
(5 citation statements)
references
References 8 publications
0
4
0
Order By: Relevance
“…The weakness of this approach, however, is its dependency on an inefficient storage format. Because ArchiveSpark [27] is the only published example we are aware of that leverages a derivative format in batch web archive analysis, we use it as a surrogate to evaluate the performance impact of WARC derivatives. The results show the performance is only comparable to more efficient formats for transactional workloads.…”
Section: Conclusion and Discussionmentioning
confidence: 99%
See 1 more Smart Citation
“…The weakness of this approach, however, is its dependency on an inefficient storage format. Because ArchiveSpark [27] is the only published example we are aware of that leverages a derivative format in batch web archive analysis, we use it as a surrogate to evaluate the performance impact of WARC derivatives. The results show the performance is only comparable to more efficient formats for transactional workloads.…”
Section: Conclusion and Discussionmentioning
confidence: 99%
“…Querying an ArchivesUnleashedToolkit application therefore requires repeated loading of the WARC files in full. ArchiveSpark [27], on the other hand, leverages CDX to selectively load WARC. Without sophisticated I/O scheduling, however, a full disk scan can still outperform many selective disk reads bundled together.…”
Section: Related Workmentioning
confidence: 99%
“…The most found studies on web archiving centred around technical aspects in which the research involved a few numbers of technology tools or software. These can be referred to the study by Yu et al (2009), Chu et al (2010), Kawano (2011), Anjum (2012), Holzmann et al (2016), and Bahry et al (2019 that involved technology application as the research approach in which many of them used experimentation method. Previous studies also experimented with the best practices and technology tools to be used in any step involved in archiving the web collections.…”
Section: Web Archiving Flow Process/ Approachmentioning
confidence: 99%
“…Several Python libraries, such as NLTK (Bird et al, 2009), TextBlob (Loria et al, 2014), spaCy (Honnibal & Montani, 2017), and Gensim (Řehůřek & Sojka, 2011), can be used in linguistic analysis and clustering. MLlib (Meng et al, 2016), for efficient machine learning, and the ArchiveSpark (Holzmann, Goel, & Anand, 2016) library and extension of Apache Spark (2019), each facilitate processing on the Hadoop cluster. Regarding cloud computing and deep learning, we held a guest lecture and instructed students on how to install both TensorFlow (Abadi et al, 2016) and PyTorch (Paszke et al, 2019) and on how to run their code on the two ARC platforms.…”
Section: Software Resourcesmentioning
confidence: 99%
“…As mentioned above, each event has two types of files: a data file (i.e., *.warc.gz) and an index file (i.e., *.cdx). We provided a Scala script built on ArchiveSpark (Holzmann et al, 2016) so that event teams can read the two files and extract Web page payloads for further processing. By sharing the Web archive files and our code, we expect students who are interested in Web archiving and big data to learn more about the accessible WARC format, Apache Spark, and parallel processing.…”
Section: Event Collectionsmentioning
confidence: 99%