Detecting a similar code between two systems has various applications such as comparing two software variants or versions or finding potential license violations. Techniques detecting suspiciously similar code must scale in terms of resources needed to very large code corpora and need to have high precision because a human needs to inspect the results. This paper demonstrates how suffix trees can be used to obtain a scalable comparison. The evaluation is carried out for very large code corpora. Our evaluation shows that our approach is faster than index-based techniques when the analysis is run only once. If the analysis is to be conducted multiple times, creating an index pays off. We report how much code can be filtered out from the analysis using an index-based filter. In addition to that, this paper proposes a method to improve precision through user feedback. A user validates a sample of the found clone candidates. An automated data mining technique learns a decision tree on the basis of the user decisions and different code metrics. We investigate the relevance of several metrics and whether criteria learned from one application domain can be generalized to other domains.All of the aforementioned variants of clone detection are facing challenges with respect to detection quality and scalability. Detection quality requires high recall and high precision in finding the relevant code. Relevance depends on the use case. In particular, inter-system and intra-system clone detections need to deal with re-occurring similar code that is similar from a lexical or syntactical point of view, but that is not interesting for the given task. Frequent examples of such irrelevant similar code are import statement lists, array initializers, setter/getter sequences, or sequences of pure declarations or simple assignments.Another challenge is scalability. Whereas intra-system clone detection searches only within one system, inter-system clone search may face a much larger code base, often larger by orders of magnitude. Also, fragment search may face this problem, when the code is searched in very large software repositories [3,4].Several researchers have recently proposed to use an index-based code search to address scalability for the search in very large code bases [13,3,4,17,18].The index-based techniques first create an index against which code of a subject system is compared later. The purpose of the index is to identify the code that has a chance of being similar. The code filtered out by the index is not compared. The index is a first seed of a similar code fragment. This seed is then extended by merging with neighboring similar code fragments [13,3,4].Creating the index can be expensive. The idea is to invest upfront in an index that is created only once but whose cost is amortized in multiple subsequent searches.Contributions. Our conference paper introduced a way to extend traditional suffix-tree-based clone detection for inter-system clone search that scales for very large programs [19]. This approach avoids the nee...