Discovering and accessing geospatial data presents a significant challenge for the Earth sciences community as massive amounts of data are being produced on a daily basis. In this article, we report a smart web-based geospatial data discovery system that mines and utilizes data relevancy from metadata user behavior. Specifically, (1) the system enables semantic query expansion and suggestion to assist users in finding more relevant data; (2) machine-learned ranking is utilized to provide the optimal search ranking based on a number of identified ranking features that can reflect users' search preferences; (3) a hybrid recommendation module is designed to allow users to discover related data considering metadata attributes and user behavior; (4) an integrated graphic user interface design is developed to quickly and intuitively guide data consumers to the appropriate data resources. As a proof of concept, we focus on a well-defined domain-oceanography and use oceanographic data discovery as an example. Experiments and a search example show that the proposed system can improve the scientific community's data search experience by providing query expansion, suggestion, better search ranking, and data recommendation via a user-friendly interface.
Abstract. The Regional Climate Model Evaluation System (RCMES) is an enabling tool of
the National Aeronautics and Space Administration to support the United
States National Climate Assessment. As a comprehensive system for evaluating
climate models on regional and continental scales using observational
datasets from a variety of sources, RCMES is designed to yield information on
the performance of climate models and guide their improvement. Here, we
present a user-oriented document describing the latest version of RCMES, its
development process, and future plans for improvements. The main objective of
RCMES is to facilitate the climate model evaluation process at regional
scales. RCMES provides a framework for performing systematic evaluations of
climate simulations, such as those from the Coordinated Regional Climate
Downscaling Experiment (CORDEX), using in situ observations, as well as satellite and reanalysis data
products. The main components of RCMES are (1) a database of observations
widely used for climate model evaluation, (2) various data loaders to import
climate models and observations on local file systems and Earth System Grid
Federation (ESGF) nodes, (3) a versatile processor to subset and regrid
the loaded datasets, (4) performance metrics designed to assess and quantify
model skill, (5) plotting routines to visualize the performance metrics,
(6) a toolkit for statistically downscaling climate model simulations, and
(7) two installation packages to maximize convenience of users without Python
skills. RCMES website is maintained up to date with a brief explanation of
these components. Although there are other open-source software (OSS)
toolkits that facilitate analysis and evaluation of climate models, there is
a need for climate scientists to participate in the development and
customization of OSS to study regional climate change. To establish
infrastructure and to ensure software sustainability, development of RCMES is
an open, publicly accessible process enabled by leveraging the Apache
Software Foundation's OSS library, Apache Open Climate Workbench (OCW). The
OCW software that powers RCMES includes a Python OSS library for common
climate model evaluation tasks as well as a set of user-friendly interfaces
for quickly configuring a model evaluation task. OCW also allows users to
build their own climate data analysis tools, such as the statistical
downscaling toolkit provided as a part of RCMES.
The volume, variety, and velocity of different data, e.g., simulation data, observation data, and social media data, are growing ever faster, posing grand challenges for data discovery. An increasing trend in data discovery is to mine hidden relationships among users and metadata from the web usage logs to support the data discovery process. Web usage log mining is the process of reconstructing sessions from raw logs and finding interesting patterns or implicit linkages. The mining results play an important role in improving quality of search-related components, e.g., ranking, query suggestion, and recommendation. While researches were done in the data discovery domain, collecting and analyzing logs efficiently remains a challenge because (1) the volume of web usage logs continues to grow as long as users access the data; (2) the dynamic volume of logs requires on-demand computing resources for mining tasks; (3) the mining process is compute-intensive and time-intensive. To speed up the mining process, we propose a cloud-based log-mining framework using Apache Spark and Elasticsearch. In addition, a data partition paradigm, logPartitioner, is designed to solve the data imbalance problem in data parallelism. As a proof of concept, oceanographic data search and access logs are chosen to validate performance of the proposed parallel log-mining framework.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.