Identifying bot activity in GitHub pull request and issue comments

Golzadeh, Mehdi; Decan, Alexandre; Constantinou, Eleni; Mens, Tom

doi:10.1109/botse52550.2021.00012

Cited by 18 publications

(12 citation statements)

References 12 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…However, several studies are dedicated to automatically identifying bot accounts in OSS repositories, [cf. 14,73,74]. Future studies can select these bot detectors to clean their dataset before training and testing their commit message generation models.…”

Section: Removing Irrelevant Commitsmentioning

confidence: 99%

Automatic Commit Message Generation: A Critical Review and Directions for Future Work

Zhang,

Qiu,

Stol

et al. 2024

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Commit messages are critical for code comprehension and software maintenance. Writing a high-quality message requires skill and effort. To support developers and reduce their effort on this task, several approaches have been proposed to automatically generate commit messages. Despite the promising performance reported, we have identified three significant and prevalent threats in these automated approaches: 1) the datasets used to train and evaluate these approaches contain a considerable amount of 'noise'; 2) current approaches only consider commits of a limited diff size; and 3) current approaches can only generate the subject of a commit message, not the message body. The first limitation may let the models 'learn' inappropriate messages in the training stage, and also lead to inflated performance results in their evaluation. The other two threats can considerably weaken the practical usability of these approaches. Further, with the rapid emergence of large language models (LLMs) that show superior performance in many software engineering tasks, it is worth asking: can LLMs address the challenge of long diffs and whole message generation? This article first reports the results of an empirical study to assess the impact of these three threats on the performance of the state-of-the-art auto generators of commit messages. We collected commit data of the Top 1,000 most-starred Java projects in GitHub and systematically removed noisy commits with bot-submitted and meaningless messages. We then compared the performance of four approaches representative of the state-of-the-art before and after the removal of noisy messages, or with different lengths of commit diffs. We also conducted a qualitative survey with developers to investigate their perspectives on simply generating message subjects. Finally, we evaluate the performance of two representative LLMs, namely UniXcoder and ChatGPT, in generating more practical commit messages. The results demonstrate that generating commit messages is of great practical value, considerable work is needed to mature the current state-of-the-art, and LLMs can be an avenue worth trying to address the current limitations. Our analyses provide insights for future work to achieve better performance in practice.

show abstract

Section: Removing Irrelevant Commitsmentioning

confidence: 99%

Automatic Commit Message Generation: A Critical Review and Directions for Future Work

Zhang,

Qiu,

Stol

et al. 2024

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

show abstract

“…We found that if there exist comments from others (other_comment), e.g., endusers or external developers, the pull request is more likely to be merged (Section 3.1.1). Different from Golzadeh et al [61], we validated on a much larger dataset and consider different kinds of projects instead of just Cargo ecosystem.…”

Section: Findings In Different Contextsmentioning

confidence: 99%

Pull Request Decisions Explained: An Empirical Overview

Zhang

Gousios

et al. 2023

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

The pull-based development model is widely used in open source projects, leading to the emergence of trends in distributed software development. One aspect that has garnered significant attention concerning pull request decisions is the identification of explanatory factors. Objective: This study builds on a decade of research on pull request decisions and provides further insights. We empirically investigate how factors influence pull request decisions and the scenarios that change the influence of such factors. Method: We identify factors influencing pull request decisions on GitHub through a systematic literature review and infer them by mining archival data. We collect a total of 3,347,937 pull requests with 95 features from 11,230 diverse projects on GitHub. Using these data, we explore the relations among the factors and build mixed effects logistic regression models to empirically explain pull request decisions. Results: Our study shows that a small number of factors explain pull request decisions, with that concerning whether the integrator is the same as or different from the submitter being the most important factor. We also note that the influence of factors on pull request decisions change with a change in context; e.g., the area hotness of pull request is important only in the early stage of project development, however it becomes unimportant for pull request decisions as projects become mature.

show abstract

“…In a qualitative study limited to two OSS projects, it was found that the common, most frequent reason for rejection is unnecessary functionality [117]. In a quantitative study of 4.8K GitHub repositories and 1M comments, it was found that there are proportionally more comments, participants and comment exchanges in rejected than in accepted pull requests [114]. Another aspect of decision-making in code reviews is multi-tasking.…”

Section: Mcr Themes and Contributionsmentioning

confidence: 99%

Modern Code Reviews—Survey of Literature and Practice

Badampudi

Unterkalmsteiner

Britto

2023

ACM Trans. Softw. Eng. Methodol.

View full text Add to dashboard Cite

Background: Modern Code Review (MCR) is a lightweight alternative to traditional code inspections. While secondary studies on MCR exist; it is unknown whether the research community has targeted themes that practitioners consider important. Objectives: The objectives are to provide an overview of MCR research, analyze the practitioners’ opinions on the importance of MCR research, investigate the alignment between research and practice, and propose future MCR research avenues. Method: We conducted a systematic mapping study to survey state-of-the-art until and including 2021, employed the Q-Methodology to analyze the practitioners’ perception of the relevance of MCR research, and analyzed the primary studies’ research impact. Results: We analyzed 244 primary studies, resulting in five themes. As a result of the 1300 survey data points, we found that the respondents are positive about research investigating the impact of MCR on product quality and MCR process properties. In contrast, they are negative about human factor- and support systems-related research. Conclusion: These results indicate a misalignment between the state-of-the-art and the themes deemed important by most survey respondents. Researchers should focus on solutions that can improve the state of MCR practice. We provide an MCR research agenda, which can potentially increase the impact of MCR research.

show abstract

Identifying bot activity in GitHub pull request and issue comments

Cited by 18 publications

References 12 publications

Automatic Commit Message Generation: A Critical Review and Directions for Future Work

Automatic Commit Message Generation: A Critical Review and Directions for Future Work

Pull Request Decisions Explained: An Empirical Overview

Modern Code Reviews—Survey of Literature and Practice

Contact Info

Product

Resources

About