TIDIER: an identifier splitting approach using speech recognition techniques

Guerrouj, Latifa; Penta, Massimiliano Di; Antoniol, Giuliano; Guéhéneuc, Yann‐Gaël

doi:10.1002/smr.539

Cited by 40 publications

(57 citation statements)

References 30 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…State-of-the-art approaches to split identifiers into separate words are the CamelCase splitter, the Samurai approach proposed by Enslen et al [11], and the recent TIDIER approach [14].…”

Section: B Background On Identifier Splitting Techniquementioning

confidence: 99%

“…3) TIDIER: Term IDentifier RecognizER TIDIER [14] is a novel approach to split program identifiers using high-level and domain concepts captured into multiple dictionaries. The approach is based on a thesaurus of words and abbreviations and uses a modified string-edit distance [20] between terms and words as a proxy for the distance between the terms and the concepts they represent.…”

Section: ) Camelcase Splitting Techniquementioning

confidence: 99%

“…Identifier splitting is one of the essential ingredients in any feature location or traceability recovery technique [1,8,21,24,27,29] , since it helps disambiguate conceptual information encoded in compound (or abbreviated) identifiers. The widely adopted approach is based on the CamelCase splitting algorithm, with more sophisticated strategies, such as Samurai [11] and TIDIER [14], recently proposed in the literature.…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Can Better Identifier Splitting Techniques Help Feature Location?

Dit

Guerrouj

Poshyvanyk

et al. 2011

2011 IEEE 19th International Conference on Program Comprehension

Self Cite

View full text Add to dashboard Cite

-The paper presents an exploratory study of two feature location techniques utilizing three strategies for splitting identifiers: CamelCase, Samurai and manual splitting of identifiers. The main research question that we ask in this study is if we had a perfect technique for splitting identifiers, would it still help improve accuracy of feature location techniques applied in different scenarios and settings? In order to answer this research question we investigate two feature location techniques, one based on Information Retrieval and the other one based on the combination of Information Retrieval and dynamic analysis, for locating bugs and features using various configurations of preprocessing strategies on two open-source systems, Rhino and jEdit. The results of an extensive empirical evaluation reveal that feature location techniques using Information Retrieval can benefit from better preprocessing algorithms in some cases, and that their improvement in effectiveness while using manual splitting over state-of-the-art approaches is statistically significant in those cases. However, the results for feature location technique using the combination of Information Retrieval and dynamic analysis do not show any improvement while using manual splitting, indicating that any preprocessing technique will suffice if execution data is available. Overall, our findings outline potential benefits of putting additional research efforts into defining more sophisticated source code preprocessing techniques as they can still be useful in situations where execution information cannot be easily collected.

show abstract

“…State-of-the-art approaches to split identifiers into separate words are the CamelCase splitter, the Samurai approach proposed by Enslen et al [11], and the recent TIDIER approach [14].…”

Section: B Background On Identifier Splitting Techniquementioning

confidence: 99%

Section: ) Camelcase Splitting Techniquementioning

confidence: 99%

Section: Introductionmentioning

confidence: 99%

See 1 more Smart Citation

Can Better Identifier Splitting Techniques Help Feature Location?

Dit

Guerrouj

Poshyvanyk

et al. 2011

2011 IEEE 19th International Conference on Program Comprehension

Self Cite

View full text Add to dashboard Cite

show abstract

“…TIDIER (Madani et al, 2010;Guerrouj et al, 2011) is another approach for identifiers splitting. This algorithm is based in the Dynamic Time Warping algorithm, initially devised to compute distances in the context of speech recognition.…”

Section: Related Workmentioning

confidence: 99%

“…Typically, many combinations of techniques are used. This is not the case for other languages, like Java for example, where there is a more traditional habit to use CamelCase for example Guerrouj et al (2011). Another relevant detail about these packages is they are quite old, and different programmers have changed the code, increasing the heterogeneity of ways to create identifiers (either by composition or abbreviation).…”

Section: Experimental Validationmentioning

confidence: 99%

From source code identifiers to natural language terms

Carvalho

Almeida

Henriques

et al. 2015

Journal of Systems and Software

View full text Add to dashboard Cite

a b s t r a c tProgram comprehension techniques often explore program identifiers, to infer knowledge about programs. The relevance of source code identifiers as one relevant source of information about programs is already established in the literature, as well as their direct impact on future comprehension tasks.Most programming languages enforce some constrains on identifiers strings (e.g., white spaces or commas are not allowed). Also, programmers often use word combinations and abbreviations, to devise strings that represent single, or multiple, domain concepts in order to increase programming linguistic efficiency (convey more semantics writing less). These strings do not always use explicit marks to distinguish the terms used (e.g., CamelCase or underscores), so techniques often referred as hard splitting are not enough. This paper introduces Lingua::IdSplitter a dictionary based algorithm for splitting and expanding strings that compose multi-term identifiers. It explores the use of general programming and abbreviations dictionaries, but also a custom dictionary automatically generated from software natural language content, prone to include application domain terms and specific abbreviations. This approach was applied to two software packages, written in C, achieving a f-measure of around 90% for correctly splitting and expanding identifiers. A comparison with current state-of-the-art approaches is also presented.

show abstract

Improving IR‐based traceability recovery via noun‐based indexing of software artifacts

Capobianco

Lucia

Oliveto

et al. 2012

J Software Evolu Process

View full text Add to dashboard Cite

One of the most successful applications of textual analysis in software engineering is the use of information retrieval (IR) methods to reconstruct traceability links between software artifacts. Unfortunately, because of the limitations of both the humans developing artifacts and the IR techniques any IR-based traceability recovery method fails to retrieve some of the correct links, while on the other hand it also retrieves links that are not correct. This limitation has posed challenges for researchers that have proposed several methods to improve the accuracy of IR-based traceability recovery methods by removing the 'noise' in the textual content of software artifacts (e.g., by removing common words or increasing the importance of critical terms). In this paper, we propose a heuristic to remove the 'noise' taking into account the linguistic nature of words in the software artifacts. In particular, the language used in software documents can be classified as a technical language, where the words that provide more indication on the semantics of a document are the nouns. The results of a case study conducted on five software artifact repositories indicate that characterizing the context of software artifacts considering only nouns significantly improves the accuracy of IR-based traceability recovery methods.

show abstract

TIDIER: an identifier splitting approach using speech recognition techniques

Cited by 40 publications

References 30 publications

Can Better Identifier Splitting Techniques Help Feature Location?

Can Better Identifier Splitting Techniques Help Feature Location?

From source code identifiers to natural language terms

Improving IR‐based traceability recovery via noun‐based indexing of software artifacts

Contact Info

Product

Resources

About