Scientific datasets are important in scientific research. Researchers always want to find a way to improve their research. Data discovery and reuse is a recent and popular way of finding more scientific datasets to help researchers. There is much current research work for data discovery and reuse. Nevertheless, most of these works are limited because researchers only trust and reuse data through social networks (such as colleagues or collaborators). Also, the scientific data from social networks bring limitations. For instance, they will only recommend data that they consider helpful. However, one scientist's useful data sometimes might be another's noisy data.
A dataset search engine is a solution to help researchers find data outside their social networks by exploring scientific datasets in open data repositories or hubs. Such a dataset search engine typically returns related or relevant datasets based on a keyword query from researchers. Several existing dataset search engines already provide millions of open datasets from open-access sources. However, the performance of a dataset search engine is highly related to the quality of the keyword queries submitted by researchers. For instance, two different queries with the same meaning would get different returns from the same dataset search engine.
The main question of this thesis is how to find more relevant scientific datasets from a query on a dataset search engine. A new idea is provided as the solution with dataset recommendation, which is adding one step, "if you like this dataset, do you also like these other datasets," after researchers or users submitted a query to a dataset search engine. Users submit keywords for query to a dataset search engine and then get returned list of datasets. Users can then tell the dataset search engine which returned dataset they like, and
the dataset search engine will say, "If you like this dataset, you would also like those other datasets," to recommend more datasets.
This thesis aims to help dataset search engines with such scientific dataset recommendations using Semantic Web technologies. As mentioned, the main idea of this thesis is "if you like this dataset, you would also like that other dataset". This thesis states the challenges and possibilities of using Semantic Web technologies on scientific dataset recommendation, as well as explores 1) how ontologies can be used for dataset recommendation, and 2) how knowledge graphs can be used for dataset recommendation? These two questions are addressed in six chapters.
A scientific item (dataset or paper) recommendation benchmark was provided during the exploring work of this thesis to provide a standard for evaluating scientific item recommendation methods. This benchmark contains a large-scale corpus with millions of datasets and papers as well as the trusted links between them. Also, a collection of entity embeddings was created in the work of this thesis. These embeddings are scientific item embeddings by using knowledge graph embedding on the citation network from Microsoft Academic knowledge graph.