Schema profiling of document-oriented databases

Gallinucci, Enrico; Golfarelli, Matteo; Rizzi, Stefano

doi:10.1016/j.is.2018.02.007

Cited by 56 publications

(41 citation statements)

References 26 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…[4] also identifies the smallest set of core attributes, but their approach is more complex and computationally expensive than the one we present here. Finally, [28] goes a step further by not finding a common schema, but trying to explain the different variants found in documents by means of association rules.…”

Section: Related Workmentioning

confidence: 99%

Approximating the Schema of a Set of Documents by Means of Resemblance

Abelló

Palol²,

Hacid

2018

J Data Semant

View full text Add to dashboard Cite

The WWW contains a huge amount of documents. Some of them share the same subject, but are generated by different people or even by different organizations. A semi-structured model allows to share documents that do not have exactly the same structure. However, it does not facilitate the understanding of such heterogeneous documents. In this paper, we offer a characterization and algorithm to obtain a representative (in terms of a resemblance function) of a set of heterogeneous semi-structured documents. We approximate the representative so that the resemblance function is maximized. Then, the algorithm is generalized to deal with repetitions and different classes of documents. Although an exact representative could always be found using an unlimited number of optional elements, it would cause an overfitting problem. The size of an exact representative for a set of heterogeneous documents may even make it useless. Our experiments show that, for users, it is easier and faster to deal with smaller representatives, even compensating the loss in the approximation.

show abstract

Section: Related Workmentioning

confidence: 99%

Approximating the Schema of a Set of Documents by Means of Resemblance

Abelló

Palol²,

Hacid

2018

J Data Semant

View full text Add to dashboard Cite

show abstract

“…Studying the inference of RE(&) has several practical motivations, such as schema inference. The presence of a schema for XML documents has many advantages, such as for query processing and optimization, data integration and exchange [11,30]. However, many XML documents in practice are not accompanied by a valid schema [16], making schema inference an attractive research topic [2,3,10,14,31].…”

Section: Introductionmentioning

confidence: 99%

Inferring Restricted Regular Expressions with Interleaving from Positive and Negative Samples

Chen

Zhang

et al. 2020

Advances in Knowledge Discovery and Data Mining

View full text Add to dashboard Cite

The presence of a schema for XML documents has numerous advantages. Unfortunately, many XML documents in practice are not accompanied by a schema or a valid schema. Therefore, it is essential to devise algorithms to infer schemas. The fundamental task in XML schema inference is to learn regular expressions. In this paper, we focus on learning the subclass of RE(&) called SIREs (the subclass of regular expressions with interleaving). Previous work in this direction lacks inference algorithms that support inference from positive and negative examples. We provide an algorithm to learn SIREs from positive and negative examples based on genetic algorithms and parallel techniques. Our algorithm also has better expansibility, which means that our algorithm not only supports learning with positive and negative examples, but also supports learning with positive or negative examples only. Experimental results demonstrate the effectiveness of our algorithm.

show abstract

“…Recent years have witnessed an erosion of the relational DBMS predominance to the benefit of DBMSs based on alternative representation models (e.g., document-oriented and graph-based) which adopt a schemaless representation for data. Schemaless databases are preferred to relational ones for storing heterogeneous data with variable schemas and structural forms; typical schema variants within a collection consist in missing or additional fields, in different names or types for an field, and in different structures for instances [1]. The absence of a unique schema grants flexibility to operational applications but adds complexity to analytical applications, in which a single analysis often involves large sets of data with different schemas.…”

Section: Introductionmentioning

confidence: 99%

“…In this paper we propose an original approach to multidimensional querying and OLAP on schemaless sources, in particular on collections stored in document-oriented databases (DODs) such as MongoDB 1 . The basic idea is to stop fighting against data heterogeneity and schema variety, and welcome it as an inherent source of information wealth in schemaless sources.…”

Section: Introductionmentioning

confidence: 99%