RESUMEN-El objetivo de este artículo es hacer uso de la técnica Web Scraping para extraer datos de Google Scholar (GS)a través de diferentes métodos. El Web Scraping es una forma de minería de datos no estructurada, que permite extraer información de páginas web, escanear su código HTML y generar patrones de extracción de datos. Además, con el fin de realizar un análisis más profundo, se creó un algoritmo en el lenguaje R para comparar la velocidad de extracción de los datos y la eficiencia en el formato de salida de los datos. El artículo muestra las pruebas realizadas de estos métodos para medir la velocidad de extracción de los datos y buscar la mejor forma de extraer los datos de GS de forma estructurada.
Palabras claves-Web Scraping, Google Scholar, minería de datos, lenguaje R, análisis de datos.ABSTRACT-The purpose of this article is to show a study using the Web Scraping technique to extract data from Google Scholar through several methods. Web Scraping is a way of no strutured Data Miner which allow: to extract information from websites, to scan its HTML code and to generate patterns of data extraction. In addtion, to obtain better analysis in this study, an algorithm was created based on the R language in order to compare the speed of data extraction and the effciciency related to the format of out data as well as to identify a better way of extraction data from GS as structured way.
The need to measure the contribution of researchers through academic profiles is of great importance, which is why in 2018 we created an algorithm in R language to dynamically extract data from individual and institutional public p rofiles in Google Scholar
Citations. Although the algorithm has been of great use in the automatic extraction of data, allowing statistical reports and analyzes to be carried out with this data, it is only possible to use it if the user knows the R language, due to the multiple functions that the R language has integrated. algorithm. In this work we show the creation of a web application integrating the algorithm to extract data from Google Scholar Citation s but improving the ease of use of these scripts using the R Shiny package, which integrates web components from Rstudio but maintaining the programming characteristics of the language. . Shiny converts scripts into interactive web applications, without any knowledge of HTML, CSS or Javascript, making it e asy for users to use, manipulate, view, and allow for future updates to improve functionality. The results of the tests and tasks carried out in this work show that the use of the web application in Shiny, the extraction algorithm could be integrated without difficulty, improving the extraction time in seconds and minutes, because the user does not interact with it. R code but with the Web interface allowing users new to R who are dedicated to the analysis of Google Scholar data to use it.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.