Automatic Classifying Self-Admitted Technical Debt Using N-Gram IDF

Wattanakriengkrai, Supatsara; Srisermphoak, Napat; Sintoplertchaikul, Sahawat; Choetkiertikul, Morakot; Ragkhitwetsagul, Chaiyong; Kula, Raula Gaikovina; Hata, Hideaki; Matsumoto, Kenichi

doi:10.1109/apsec48747.2019.00050

Cited by 11 publications

(7 citation statements)

References 37 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…-Survey more advanced feature engineering in the active learning strategy for finding the rest of SATDs. For example, explore N-gram patterns [72] and word embeddings with deep neural networks [19]. -Explore other sampling techniques to help with unbalanced class data (one of the key characteristics for SATDs [55]).…”

Section: Discussionmentioning

confidence: 99%

“…Recently, some studies explore different feature engineering for identifying SATDs, e.g. Wattanakriengkrai et al [72] applied N-gram IDF as features, and Flisar and Podgorelec [19] explored how feature selection with word embedding can help the prediction. The latest progress are from Wang et al [71]'s HATD and Ren et al [55]'s tuned CNN utilized a deep convolutional neural network to achieve a higher F1 score than all the previous solutions.…”

Section: Automatic Labelingmentioning

confidence: 99%

See 1 more Smart Citation

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

Tu¹

2022

Preprint

View full text Add to dashboard Cite

Keeping track of and managing Self-Admitted Technical Debts (SATDs) is important for maintaining a healthy software project. Current active-learning SATD recognition tool involves manual inspection of 24% of the test comments on average to reach 90% of the recall. Among all the test comments, about 5% are SATDs. The human experts are then required to read almost a quintuple of the SATD comments which indicates the inefficiency of the tool. Plus, human experts are still prone to error: 95% of the false-positive labels from previous work were actually true positives.To solve the above problems, we propose DebtFree, a two-mode framework based on unsupervised learning for identifying SATDs. In mode1, when the existing training data is unlabeled, DebtFree starts with an unsupervised learner to automatically pseudo-label the programming comments in the training data. In contrasts, in mode2 where labels are available with the corresponding training data, DebtFree starts with a pre-processor that identifies the highly prone SATDs from the test dataset. Then, our machine learning model is employed to assist human experts in manually identifying the remaining SATDs. Our experiments on 10 software projects show that both models yield statistically significant improvement in effectiveness over the state-of-the-art automated and semi-automated models. Specifically, DebtFree can reduce the labeling effort by 99% in mode1 (unlabeled training data), and up to 63% in mode2 (labeled training data) while improving the current active learner's F1 relatively to almost 100%.

show abstract

Section: Discussionmentioning

confidence: 99%

Section: Automatic Labelingmentioning

confidence: 99%

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

Tu¹

2022

Preprint

View full text Add to dashboard Cite

show abstract

“…One of the threats to construct validity in the study concerns the potentially different interpretations of discussed topics between interviewees and researchers. Because we focus on SATD in this study and most Code Comments [6], [7], [12], [14], [15], [38], [39], [40], [41], [42], [43], [44], [45], [46], [47], [48], [49], [50], [51], [52], [53], [54], [55], [56], [57], [58], [59], [60], [61], [62], [63], [64], [65], [66] Issue Trackers [3], [12], [16] Commit Messages [12] Pull Requests [12] Automated Differentiation Between Fixed and Unfixed SATD -Automated Tracing Between SATD in Different Sources [11], [12], [36], [37] and Code and Related Development Tasks -Automated SATD Prioritization [9], [67], …”

Section: Threats To Validity 61 Construct Validitymentioning

confidence: 99%

Self-Admitted Technical Debt in the Embedded Systems Industry: An Exploratory Case Study

Soliman

Avgeriou

et al. 2023

IIEEE Trans. Software Eng.

View full text Add to dashboard Cite

Technical debt denotes shortcuts taken during software development, mostly for the sake of expedience. When such shortcuts are admitted explicitly by developers (e.g., writing a TODO/Fixme comment), they are termed as Self-Admitted Technical Debt or SATD. There has been a fair amount of work studying SATD management in Open Source projects, but SATD in industry is relatively unexplored. At the same time, there is no work focusing on developers' perspectives towards SATD and its management. To address this, we conducted an exploratory case study in cooperation with an industrial partner to study how they think of SATD and how they manage it. Specifically, we collected data by identifying and characterizing SATD in different sources (issues, source code comments, and commits) and carried out a series of interviews with 12 software practitioners. The results show: 1) the core characteristics of SATD in industrial projects; 2) developers' attitudes towards identified SATD and statistics; 3) triggers for practitioners to introduce and repay SATD; 4) relations between SATD in different sources; 5) practices used to manage SATD; 6) challenges and tooling ideas for SATD management.

show abstract

“…2) BOW (Bag Of Words): one way of extracting variables from text into numbers by representing textual documents as sparse vectors of word counts [34]. 3) N-Gram: a text preprocessing model that has a method to improve character transformations.…”

Section: Variable Extractionmentioning

confidence: 99%

Automatic Categorization of Multi Marketplace FMCGs Products using TF-IDF and PCA Features

Indasari

Tjahyanto

2023

SISFOKOM

View full text Add to dashboard Cite

The use of technology in line with the increasing number of internet users has caused a shift in the product sales ecosystem to the realm of electronic commerce (electronic commerce). A total of 73.23 customers made purchase transactions using e-commerce and the most purchased products were products classified as Fast Moving Consumer Goods (FMCGs). The increasingly varied FMCGs data coupled with the increasing number of marketplaces is felt to need to be broken down into specific groups. The process is carried out by analyzing e-commerce product information, especially product names, and descriptions. In this study, we propose an automatic categorization of multiple marketplaces using data from multiple marketplaces. Data text is converted into structured data with a series of preprocessing, and comprehensive experiments are carried out to see the extraction performance of variables including TF-IDF, BOW, and N-Gram. All three methods are used to validate text data sets with K-Means grouping results used with the help of PCA to reduce data dimensions. The results show that the performance of the TF-IDF algorithm with a dimension reduction value of 70 and the use of Python can provide optimal results for the percentage of grouping data.

show abstract

Automatic Classifying Self-Admitted Technical Debt Using N-Gram IDF

Cited by 11 publications

References 37 publications

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

DebtFree: Minimizing Labeling Cost in Self-Admitted Technical Debt Identification using Semi-Supervised Learning

Self-Admitted Technical Debt in the Embedded Systems Industry: An Exploratory Case Study

Automatic Categorization of Multi Marketplace FMCGs Products using TF-IDF and PCA Features

Contact Info

Product

Resources

About