Web Scraping: From Tools to Related Legislation and Implementation Using Python

Nigam, Harshit; Biswas, Prantik

doi:10.1007/978-981-15-9651-3_13

Cited by 16 publications

(7 citation statements)

References 28 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…[14] Retrieval of structured data through online table extraction. In such cases, a complex approach is applied that includes the detection of certain patterns across rows and columns and determining the search content [15]Dynamic rendering usually demands more complicated technologies, which are particularly relevant for a page with JavaScript involved. To function, the headless browsers must allow for emulating user inputs and showing changing data.…”

Section: Methods For Extracting Datamentioning

confidence: 99%

Web Scraping Scientific Repositories: Springer and Nature for University of Basrah

Taufeeq Al-Madhhachi,

A. Mahmood

2023

J. Al-Qadisiyah Comp. Sci. Math.

View full text Add to dashboard Cite

This study explores the field of scientific data extraction using online scraping techniques, with a specific focus on the Springer and Nature archives within the University of Basrah's setting. This study aims to explicate the theoretical underpinnings of web scraping, emphasizing its importance in the acquisition of structured data from online sources. This study explores the many issues presented by dynamic content, captchas, and IP blocking and proposes novel solutions for each of these obstacles. The university's research objectives were supported by a rich dataset that was carefully constructed through a painstaking approach encompassing data collection, preparation techniques. The results highlight the effectiveness of web scraping, significant influence of preprocessing. This study not only enhances the existing body of academic research methodology but also advances the University of Basrah's pursuit of data-driven and influential scholarly pursuits.

show abstract

Section: Methods For Extracting Datamentioning

confidence: 99%

Web Scraping Scientific Repositories: Springer and Nature for University of Basrah

Taufeeq Al-Madhhachi,

A. Mahmood

2023

J. Al-Qadisiyah Comp. Sci. Math.

View full text Add to dashboard Cite

show abstract

“…Uma das técnicas de web scraping consiste em navegar pelos elementos da página HTML no formato de árvore [15], atividade facilita por ferramentas disponíveis para a linguagem de programação Python. Dentre elas, destacam-se as bibliotecas: Beautiful-Soup 8 , que possibilita a interpretação de elementos HTML na forma de árvore; requests 9 , utilizada para realizar requisições HTTP; e re 10 , que fornece operações com expressões regulares (regex).…”

Section: Percurso Metodológicounclassified

PySol: Uma Proposta Python para Automação de Busca na SBC OpenLib

Chagas Souza,

Barros de Sales,

G. Q. Palmeira

2024

Anais Do XV Computer on the Beach - COTB'24

View full text Add to dashboard Cite

RESUMOConducting a systematic review is complex, but it can be simplifiedusing computational resources. Utilizing multiple research sourcesis essential to cover most studies relevant to the investigated topic.In this context, SBC OpenLib (SOL), the open digital library ofthe Brazilian Society of Computing (SBC), is an important bibliographicsource for systematic reviews in Computing-related fields,offering access to all academic and scientific content produced bythe SBC. However, SOL’s limitation in its automatic search featureis the lack of flexibility in exporting results, a crucial criterionfor a database to be included in a systematic review’s search strategy.Addressing this issue, this paper introduces pySol, a Pythonbasedtool designed to automate searches and export results fromthe SOL database. PySol was developed using web scraping techniquesto extract automatic search results from the SOL database. Itsdevelopment followed Test-Driven Development (TDD) principles,resulting in over 280 unit tests and a code coverage of 93%. Theseindicators highlight the tool’s reliability for systematic searches.The tool intends to support the study identification stage in systematicreviews, significantly reducing the time and effort neededto search in the SOL database. The automation of this process notonly eases the execution of systematic searches in this databasebut also enables the export of results in BibTeX format, facilitatingintegration with reference managers.

show abstract

“…Several works in the literature have addressed the problem of data extraction from web pages either by accessing the databases through webpages or by APIs [9][10][11][12][13]. Furthermore, big data extraction has spread to many scientific fields, such as medicine, where the volume of medical data is exponentially increasing.…”

Section: Related Workmentioning

confidence: 99%

“…Many tools have been constructed to extract data for machine learning purposes. These include command-line-based methods, such as application programming interface (API) [9,10] and web scraping methods, that extract information from websites [11][12][13]. Alternatively, some researchers or companies hire people to extract data manually, which costs them time and money; therefore, a method that extracts candidate information automatically is desperately needed.…”

Section: Introductionmentioning

confidence: 99%

Big Data Bot with a Special Reference to Bioinformatics

Al-Omari¹,

Tawalbeh²,

Akkam³

et al. 2023

Computers, Materials &Amp; Continua

View full text Add to dashboard Cite

There are quintillions of data on deoxyribonucleic acid (DNA) and protein in publicly accessible data banks, and that number is expanding at an exponential rate. Many scientific fields, such as bioinformatics and drug discovery, rely on such data; nevertheless, gathering and extracting data from these resources is a tough undertaking. This data should go through several processes, including mining, data processing, analysis, and classification. This study proposes software that extracts data from big data repositories automatically and with the particular ability to repeat data extraction phases as many times as needed without human intervention. This software simulates the extraction of data from web-based (point-and-click) resources or graphical user interfaces that cannot be accessed using command-line tools. The software was evaluated by creating a novel database of 34 parameters for 1360 physicochemical properties of antimicrobial peptides (AMP) sequences (46240 hits) from various MARVIN software panels, which can be later utilized to develop novel AMPs. Furthermore, for machine learning research, the program was validated by extracting 10,000 protein tertiary structures from the Protein Data Bank. As a result, data collection from the web will become faster and less expensive, with no need for manual data extraction. The software is critical as a first step to preparing large datasets for subsequent stages of analysis, such as those using machine and deep-learning applications.

show abstract

Web Scraping: From Tools to Related Legislation and Implementation Using Python

Cited by 16 publications

References 28 publications

Web Scraping Scientific Repositories: Springer and Nature for University of Basrah

Web Scraping Scientific Repositories: Springer and Nature for University of Basrah

PySol: Uma Proposta Python para Automação de Busca na SBC OpenLib

Big Data Bot with a Special Reference to Bioinformatics

Contact Info

Product

Resources

About