ArchiveSpark

Holzmann, Helge; Goel, Vinay; Anand, Avishek

doi:10.1145/2910896.2910902

Cited by 25 publications

(5 citation statements)

References 8 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…The weakness of this approach, however, is its dependency on an inefficient storage format. Because ArchiveSpark [27] is the only published example we are aware of that leverages a derivative format in batch web archive analysis, we use it as a surrogate to evaluate the performance impact of WARC derivatives. The results show the performance is only comparable to more efficient formats for transactional workloads.…”

Section: Conclusion and Discussionmentioning

confidence: 99%

See 1 more Smart Citation

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

Wang

Xie

2020

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

View full text Add to dashboard Cite

The WARC file format is widely used by web archives to preserve collected web content for future use. With the rapid growth of web archives and the increasing interest to reuse these archives as big data sources for statistical and analytical research, the speed to turn these data into insights becomes critical. In this paper we show that the WARC format carries significant performance penalties for batch processing workload. We trace the root cause of these penalties to its data structure, encoding, and addressing method. We then run controlled experiments to illustrate how severe these problems can be. Indeed, performance gain of one to two orders of magnitude can be achieved simply by reformatting WARC files into Parquet or Avro formats. While these results do not necessarily constitute an endorsement for Avro or Parquet, the time has come for the web archiving community to consider replacing WARC with more efficient web archival formats.

show abstract

Section: Conclusion and Discussionmentioning

confidence: 99%

“…Querying an ArchivesUnleashedToolkit application therefore requires repeated loading of the WARC files in full. ArchiveSpark [27], on the other hand, leverages CDX to selectively load WARC. Without sophisticated I/O scheduling, however, a full disk scan can still outperform many selective disk reads bundled together.…”

Section: Related Workmentioning

confidence: 99%

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

Wang

Xie

2020

Proceedings of the ACM/IEEE Joint Conference on Digital Libraries in 2020

View full text Add to dashboard Cite

show abstract

“…The most found studies on web archiving centred around technical aspects in which the research involved a few numbers of technology tools or software. These can be referred to the study by Yu et al (2009), Chu et al (2010), Kawano (2011), Anjum (2012), Holzmann et al (2016), and Bahry et al (2019 that involved technology application as the research approach in which many of them used experimentation method. Previous studies also experimented with the best practices and technology tools to be used in any step involved in archiving the web collections.…”

Section: Web Archiving Flow Process/ Approachmentioning

confidence: 99%

Exploration of Web Contents of Selangor Royal Family using Web Archiving Tools

Mohd Nor,

Saiful Bahry,

Mohamed Shuhidan

et al. 2024

E-BPJ

View full text Add to dashboard Cite

This study aims to explore the web contents of Selangor Royal family as an initiative to preserve the contents of social and cultural values. Initially, an exploration of websites’ profiles and the contents of Selangor Royal family by using tools BuildWith and XML-Sitemaps. Then, the selected web pages were then archived using web archiving tools, HTTrack and Conifer. The findings indicate that there are different types of websites and website owners covering the Selangor Royal family-related content. The recorded information and lifestyles of royal family represent value in building Malaysian identity and are significant to be preserved for future discovery.

show abstract

“…Several Python libraries, such as NLTK (Bird et al, 2009), TextBlob (Loria et al, 2014), spaCy (Honnibal & Montani, 2017), and Gensim (Řehůřek & Sojka, 2011), can be used in linguistic analysis and clustering. MLlib (Meng et al, 2016), for efficient machine learning, and the ArchiveSpark (Holzmann, Goel, & Anand, 2016) library and extension of Apache Spark (2019), each facilitate processing on the Hadoop cluster. Regarding cloud computing and deep learning, we held a guest lecture and instructed students on how to install both TensorFlow (Abadi et al, 2016) and PyTorch (Paszke et al, 2019) and on how to run their code on the two ARC platforms.…”

Section: Software Resourcesmentioning

confidence: 99%

“…As mentioned above, each event has two types of files: a data file (i.e., *.warc.gz) and an index file (i.e., *.cdx). We provided a Scala script built on ArchiveSpark (Holzmann et al, 2016) so that event teams can read the two files and extract Web page payloads for further processing. By sharing the Web archive files and our code, we expect students who are interested in Web archiving and big data to learn more about the accessible WARC format, Apache Spark, and parallel processing.…”

Section: Event Collectionsmentioning

confidence: 99%

Teaching Natural Language Processing through Big Data Text Summarization with Problem-Based Learning

Geissinger

Ingram

et al. 2020

Data and Information Management

View full text Add to dashboard Cite

Natural language processing (NLP) covers a large number of topics and tasks related to data and information management, leading to a complex and challenging teaching process. Meanwhile, problem-based learning is a teaching technique specifically designed to motivate students to learn efficiently, work collaboratively, and communicate effectively. With this aim, we developed a problem-based learning course for both undergraduate and graduate students to teach NLP. We provided student teams with big data sets, basic guidelines, cloud computing resources, and other aids to help different teams in summarizing two types of big collections: Web pages related to events, and electronic theses and dissertations (ETDs). Student teams then deployed different libraries, tools, methods, and algorithms to solve the task of big data text summarization. Summarization is an ideal problem to address learning NLP since it involves all levels of linguistics, as well as many of the tools and techniques used by NLP practitioners. The evaluation results showed that all teams generated coherent and readable summaries. Many summaries were of high quality and accurately described their corresponding events or ETD chapters, and the teams produced them along with NLP pipelines in a single semester. Further, both undergraduate and graduate students gave statistically significant positive feedback, relative to other courses in the Department of Computer Science. Accordingly, we encourage educators in the data and information management field to use our approach or similar methods in their teaching and hope that other researchers will also use our data sets and synergistic solutions to approach the new and challenging tasks we addressed.

show abstract

ArchiveSpark

Cited by 25 publications

References 8 publications

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

The Case For Alternative Web Archival Formats To Expedite The Data-To-Insight Cycle

Exploration of Web Contents of Selangor Royal Family using Web Archiving Tools

Teaching Natural Language Processing through Big Data Text Summarization with Problem-Based Learning

Contact Info

Product

Resources

About