A study of the discovery problem of Graph Entity Dependencies (GEDs).• A new and efficient approach for the discovery of GEDs in property graphs.• A minimum description length inspired definition of interestingness of GEDs to rank discovered rules.• A thorough empirical evaluation of the proposed technique, with examples of useful rules (mined) that are relevant in data quality/management applications.
Topic modelling is important for tackling several data mining tasks in information retrieval. While seminal topic modelling techniques such as Latent Dirichlet Allocation (LDA) have been proposed, the ubiquity of social media and the brevity of its texts pose unique challenges for such traditional topic modelling techniques. Several extensions including auxiliary aggregation, self aggregation and direct learning have been proposed to mitigate these challenges, however some still remain. These include a lack of consistency in the topics generated and the decline in model performance in applications involving disparate document lengths. There is a recent paradigm shift towards neural topic models, which are not suited for resource-constrained environments. This paper revisits LDA-style techniques, taking a theoretical approach to analyse the relationship between word co-occurrence and topic models. Our analysis shows that by altering the word co-occurrences within the corpus, topic discovery can be enhanced. Thus we propose a novel data transformation approach dubbed DATM to improve the topic discovery within a corpus. A rigorous empirical evaluation shows that DATM is not only powerful, but it can also be used in conjunction with existing benchmark techniques to significantly improve their effectiveness and their consistency by up to 2 fold.
Knowledge Graphs (KGs), as one of the key trends which are driving the next wave of technologies, have now become a new form of knowledge representation, and a cornerstone for several applications from generic to specific industrial use cases. However, in some specific domains such as law enforcement, a real and large domain-oriented KG is often unavailable due to data privacy concerns. In such domains it is necessary to generate a synthetic KG which mimics the properties of a real KG in the domain. Although during the last two decades, a variety of graph data generators has been proposed to achieve the generation of different kinds of networks, the state-of-the-art synthetic graph data generators are not feasible to generate a realistic and synthetic KGs because KGs always contain data characteristics with specified semantics. In this work, we propose a schema-driven synthetic KG generation approach with extended graph differential dependencies (GDD x ), which is an extension of the recently developed graph entity/differential dependencies that represent formal constraints for graph data to enable the generation of desired graph patterns in synthetic KG. Next, we develop an effective KG generation algorithm that employs the schema and the pre-defined GDD x s. Finally, we evaluate our synthetic KG generator and compare with several stateof-the-art synthetic graph generators. The results from the experiments show that our KG generation method can generate KGs that exhibit the desired graph patterns, node attributes and degree distributions associated with each entity type in the graph's schema.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.