Abstract-Programmers search for code frequently utilizing syntactic queries. The effectiveness of this type of search depends on the ability of a programmer to specify a query that captures how the desired code may have been implemented, and the results often include many irrelevant matches that must be filtered manually. More semantic search approaches could address these limitations, yet the existing approaches either do not scale or require for the programmer to define complex queries. Instead, our approach to semantic search requires for the programmer to write lightweight, incomplete specifications, such as an example input and expected output of a desired function. Unlike existing approaches to semantic search, we use an SMT solver to identify programs in a repository, encoded as constraints, that match the programmer-provided specification. We instantiate the approach on subsets of the Java string library, Yahoo! Pipes mashup language, and SQL select statements, and begin to assess its effectiveness and efficiency through evaluations in each domain.
I. INTRODUCTIONToday, searching for code is a regular activity for most programmers [21]. Yet, the mechanisms to support this activity have barely evolved in the last decade, and the limitations are becoming more apparent as code repositories get richer and programmers' expertise and needs more diverse.Consider a novice Java programmer who is trying to find a snippet of code that extracts an alias from an e-mail address. The programmer turns to Google (like many others [21]) and issues a search query with the following keywords: extract alias from e-mail address in Java. As expected, the query returns millions of results. None of the top ten results (a typical IR measure to assess the precision of search engine results [5]), even provide a method for decomposing an e-mail address into parts, which is the first step towards extracting the alias. Now, if the programmer is knowledgable enough about the domain to refine the query with the term substring, then the top ten results include two relevant solutions. This illustrates what occurs in practice, where programmers must sift through many irrelevant results, especially when the desired behavior cannot be tied to source code syntax or documentation.Our work targets this limitation. The general idea is that programmers provide concrete behavioral specifications as inputs and outputs and an SMT solver identifies available source code, encoded as constraints, that matches the specifications.For example, when searching for a program that extracts the alias from an e-mail address, the input could be the string "susie@mail.com" and the output the string "susie". This form of query, while more costly than a keyword query, lets the programmer specify the desired behavior, without the need to know how to achieve a certain outcome, just what that outcome is.