An Extensive Dataset of UML Models in GitHub

Robles, Gregório; Ho-Quang, Truong; Hebig, Regina; Chaudron, Michel R. V.; Fernández, Miguel Angel Castro

doi:10.1109/msr.2017.48

Cited by 45 publications

(40 citation statements)

References 6 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A recent and extensive study of models stored in various formats in GitHub, combining automatic processing and a lot of manual work, identified 93,596 UML models from 24,717 different repositories, of which 57,822 (61.8%) are images, the rest being files with extensions .xmi or .uml [17]. This confirms the existence of potentially useful and interesting information in repositories that is still difficult to access and reuse; and it also confirms that a large proportion of models is stored as images.…”

Section: Related Workmentioning

confidence: 99%

Automatic Classification of Web Images as UML Static Diagrams Using Machine Learning Techniques

et al. 2020

View full text Add to dashboard Cite

Our purpose in this research is to develop a method to automatically and efficiently classify web images as Unified Modeling Language (UML) static diagrams, and to produce a computer tool that implements this function. The tool receives a bitmap file (in different formats) as an input and communicates whether the image corresponds to a diagram. For pragmatic reasons, we restricted ourselves to the simplest kinds of diagrams that are more useful for automated software reuse: computer-edited 2D representations of static diagrams. The tool does not require that the images are explicitly or implicitly tagged as UML diagrams. The tool extracts graphical characteristics from each image (such as grayscale histogram, color histogram and elementary geometric forms) and uses a combination of rules to classify it. The rules are obtained with machine learning techniques (rule induction) from a sample of 19,000 web images manually classified by experts. In this work, we do not consider the textual contents of the images. Our tool reaches nearly 95% of agreement with manually classified instances, improving the effectiveness of related research works. Moreover, using a training dataset 15 times bigger, the time required to process each image and extract its graphical features (0.680 s) is seven times lower.

show abstract

Section: Related Workmentioning

confidence: 99%

Automatic Classification of Web Images as UML Static Diagrams Using Machine Learning Techniques

et al. 2020

View full text Add to dashboard Cite

show abstract

“…First of all, UML design artifacts are prevalent in open source software repositories but have received relatively little attention from our community. Secondly, researchers have currently made available a large collection of labeled UML diagrams [8], thus facilitating other research groups to reproduce and extend the work presented here. Finally, we believe that classifying sequence and class diagrams is a natural binary classification task for low-shot learning and each type diagram has tell-tale features that should be learnable with a relatively few number of instances and also generalizable to unseen data.…”

Section: An Application Of Low-shot Learningmentioning

confidence: 99%

“…This paper provides a proof-of-concept for the application of low-shot learning to mining software artifacts. In particular, we focus on the task of classifying unified modeling language (UML) diagrams from a recently-published, publicly-available dataset [8].…”

mentioning

confidence: 99%

Exploring the applicability of low-shot learning in mining software repositories

2019

View full text Add to dashboard Cite

IntroductionIn the past couple of years, applications of deep learning to mining software repositories have grown in number and diversity of methods [1][2][3][4][5]. Fueled in part by easy-touse libraries and graphics processing unit (GPU) computing, deep architectures have facilitated new avenues for research, often producing results that far surpass previous techniques. However, despite their advantages, the huge amount of labeled truth data traditionally required to train deep architectures for classification tasks, as well as the computational time required to iteratively improve such models, remains a substantial bottleneck [6]. As a result, some researchers are forced to turn away from deep architectures, despite the fact that for certain tasks (like image analysis and computer vision), deep learning consistently outperforms alternative algorithms and methodologies.Low-shot learning refers to the practice of training machine learning models, including deep neural networks, using far fewer samples of each classification category than what is typically standard practice. In the extreme case, training data consists of only one instance for each target class, which is known as one-shot learning [7]. These approaches Abstract Background: Despite the well-documented and numerous recent successes of deep learning, the application of standard deep architectures to many classification problems within empirical software engineering remains problematic due to the large volumes of labeled data required for training. Here we make the argument that, for some problems, this hurdle can be overcome by taking advantage of low-shot learning in combination with simpler deep architectures that reduce the total number of parameters that need to be learned. Findings:We apply low-shot learning to the task of classifying UML class and sequence diagrams from Github, and demonstrate that surprisingly good performance can be achieved by using only tens or hundreds of examples for each category when paired with an appropriate architecture. Using a large, off-the-shelf architecture, on the other hand, doesn't perform beyond random guessing even when trained on thousands of samples. Conclusion:Our findings suggest that identifying problems within empirical software engineering that lend themselves to low-shot learning could accelerate the adoption of deep learning algorithms within the empirical software engineering community. which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made.

show abstract

“…In search of evidence [8] to substantiate this belief, we start from a publicly available data set of open-source software projects on GITHUB that use UML models [9], and: 1) assemble a control group of GITHUB projects not known to use UML models; 2) mine data from the GITHUB issue trackers of both sets of projects (using and not using UML models), estimating their defect rates ("bug" issue reports) as a proxy for software quality; and 3) use multivariate statistical modeling to estimate the impact of having UML models on defect proneness, while controling for confounding factors. Our results reveal a small statistically significant effect of using UML models on defect proneness, i.e., projects with UML models tend to have fewer defects.…”

Section: Does Uml Modeling Associate With Lower Defect Proneness?: a mentioning

confidence: 99%

Does UML Modeling Associate with Lower Defect Proneness?: A Preliminary Empirical Investigation

Raghuraman

Ho-Quang

Chaudron

et al. 2019

2019 IEEE/ACM 16th International Conference on Mining Software Repositories (MSR)

Self Cite

View full text Add to dashboard Cite

An Extensive Dataset of UML Models in GitHub

Cited by 45 publications

References 6 publications

Automatic Classification of Web Images as UML Static Diagrams Using Machine Learning Techniques

Automatic Classification of Web Images as UML Static Diagrams Using Machine Learning Techniques

Exploring the applicability of low-shot learning in mining software repositories

Does UML Modeling Associate with Lower Defect Proneness?: A Preliminary Empirical Investigation

Contact Info

Product

Resources

About