Detecting Java software similarities by using different clustering techniques

Capiluppi, Andrea; Ruscio, Davide Di; Rocco, Juri Di; Nguyen, Phuong T.; Ajienka, Nemitari

doi:10.1016/j.infsof.2020.106279

Cited by 13 publications

(5 citation statements)

References 75 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…However, their work is limited by the external library call which may fool as the similarity will largely depends on it. Another study [33] has confirmed that CrossSim may identify dissimilarity based on external API usage while internally implementing similar functionalities.…”

Section: Related Workmentioning

confidence: 84%

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

Rokon¹,

Yan²,

Islam³

et al. 2021

Preprint

View full text Add to dashboard Cite

How can we identify similar repositories and clusters among a large online archive, such as GitHub? Determining repository similarity is an essential building block in studying the dynamics and the evolution of such software ecosystems. The key challenge is to determine the right representation for the diverse repository features in a way that: (a) it captures all aspects of the available information, and (b) it is readily usable by ML algorithms. We propose Repo2Vec, a comprehensive embedding approach to represent a repository as a distributed vector by combining features from three types of information sources. As our key novelty, we consider three types of information: (a) metadata, (b) the structure of the repository, and (c) the source code. We also introduce a series of embedding approaches to represent and combine these information types into a single embedding. We evaluate our method with two real datasets from GitHub for a combined 1013 repositories. First, we show that our method outperforms previous methods in terms of precision (93% vs 78%), with nearly twice as many Strongly Similar repositories and 30% fewer False Positives. Second, we show how Repo2Vec provides a solid basis for: (a) distinguishing between malware and benign repositories, and (b) identifying a meaningful hierarchical clustering. For example, we achieve 98% precision, and 96% recall in distinguishing malware and benign repositories. Overall, our work is a fundamental building block for enabling many repository analysis functions such as repository categorization by target platform or intention, detecting code-reuse and clones, and identifying lineage and evolution.

show abstract

Section: Related Workmentioning

confidence: 84%

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

Rokon¹,

Yan²,

Islam³

et al. 2021

Preprint

View full text Add to dashboard Cite

show abstract

“…To conduct the comparison with PAM, we exploited its original source code which has been made available online by its authors. 7 Furthermore, to facilitate future replications, we published all the artifacts together with the tools used in our evaluation in GITHUB [27].…”

Section: Discussionmentioning

confidence: 99%

“…This essentially means that these categories do not have much to do with similarity in API usages. Recently, attempts have been made to automatically assign a category to projects/apps [7], [43]. Among others, supervised learning techniques perform computation by exploiting labeled data, e.g., the apps and their corresponding categories specified by developers.…”

Section: Lessons Learnedmentioning

confidence: 99%

Recommending API Function Calls and Code Snippets to Support Software Development

Nguyen¹,

Rocco²,

Sipio³

et al. 2021

Preprint

Self Cite

View full text Add to dashboard Cite

Software development activity has reached a high degree of complexity, guided by the heterogeneity of the components, data sources, and tasks. The proliferation of open-source software (OSS) repositories has stressed the need to reuse available software artifacts efficiently. To this aim, it is necessary to explore approaches to mine data from software repositories and leverage it to produce helpful recommendations. We designed and implemented FOCUS as a novel approach to provide developers with API calls and source code while they are programming. The system works on the basis of a context-aware collaborative filtering technique to extract API usages from OSS projects. In this work, we show the suitability of FOCUS for Android programming by evaluating it on a dataset of 2,600 mobile apps. The empirical evaluation results show that our approach outperforms two state-of-the-art API recommenders, UP-Miner and PAM, in terms of prediction accuracy. We also point out that there is no significant relationship between the categories for apps defined in Google Play and their API usages. Finally, we show that participants of a user study positively perceive the API and source code recommended by FOCUS as relevant to the current development context.

show abstract

“…A study conducted by [38] stated that the cluster sampling technique is applied to get a sample from Java software that consists of a similar system and to display the differences between the clusters. The software is grouping into the cluster using the CrossSim algorithm to observe the similarities.…”

Section: ) Cluster Samplingmentioning

confidence: 99%

Data Clutter Reduction in Sampling Technique

Jamalludin¹,

Idrus²,

Idrus³

et al. 2022

IJACSA

View full text Add to dashboard Cite

Visualization is a process of converting data into its visual form as such data patterns can be extracted from the data. Data patterns are knowledge hidden behind the data. However, when data is big, it tends to overlap and clutter on visualization which distorts the data patterns. Data is overly crowded on visualization thus, it has become a challenge to extract knowledge patterns. Besides, big data is costly to visualize because it requires expensive hardware facilities due to its size. Moreover, it is timely to plot the data since it takes time for data to render on visualizations. Due to those reasons, there is a need to reduce the size of big datasets and at the same time maintain the data patterns. There are many methods of data reduction, which are preprocessing operations, dimension reduction, compression, network theory, redundancy elimination, data mining, machine learning, data filtering and sampling techniques. However, the commonly used data reduction technique is sampling technique that derives samples from data populations. Thus, sampling technique is chosen as a study for data reduction in this paper. However, the studies are scattered and are not discussed in a single paper. Consequently, the objective of this paper is to collect them in a single paper for further analysis in order to understand them in great detail. To achieve the objective, three interdisciplinary databases which are ACM Digital Library, IEEE Explore and Science Direct have been selected. From the database, a total of 48 studies have been extracted and they are from the years 2017 to 2021. Other than sampling techniques, this paper also seeks information on big data, data visualization, data clutter, and data reduction.

show abstract

Detecting Java software similarities by using different clustering techniques

Cited by 13 publications

References 75 publications

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

Repo2Vec: A Comprehensive Embedding Approach for Determining Repository Similarity

Recommending API Function Calls and Code Snippets to Support Software Development

Data Clutter Reduction in Sampling Technique

Contact Info

Product

Resources

About