Thesaurus-based, code-related, and software-specific query expansion techniques are the main contributions in free-form query search. However, these techniques still could not put the most relevant query result in the first position because they lack the ability to infer the expansion words that represent the user needs based on a given query. In this paper, we discover that code changes can imply what users want and propose a novel query expansion technique with code changes (QECC). It exploits (changes, contexts) pairs from changed methods. On the basis of statistical learning from pairs, it can infer code changes for a given query. In this way, it expands a query with code changes and recommends the query results that meet actual needs perfectly. In addition, we implement InstaRec to perform QECC and evaluate it with 195 039 change commits from GitHub and our code tracker. The results show that QECC can improve the precision of 3 code search algorithms (ie, IR, Portfolio, and VF) by up to 52% to 62% and outperform the state-of-the-art query expansion techniques (ie, query expansion based on crowd knowledge and CodeHow) by 13% to 16% when the top 1 result is inspected.
KEYWORDScode changes, code search, information retrieval, software reuse, statistical learning, query expansion
INTRODUCTIONAs code repositories (eg, CodePlex, * GitHub, † and SourceForge ‡ ) become available, 1 code search has become a common activity during software development. 2,3 Especially, users are more interested in the free-form query search, which allows users to type natural language keywords to define queries. 4 The performance of this search strongly depends on word matches between queries and query results. However, queries and query results do not often use the same words. 5 Even the length of a query is usually short. Sadowski et al reported that the average number of words per query is 1.85 for the queries proposed to Google search. 6 Obviously, it is not an easy task to formulate a good query. This motivates the query expansion techniques. 7,8 Earlier, WordNet 9 reformulates a query with synonyms in a word thesaurus. However, Lu et al 10 showed that the general English-based similarity measurements of WordNet could not effectively suggest similar words