In model-based software estimation, using the right training data is a key contributor for making accurate predictions, which is crucial for the success of software projects.This study investigates the use of duration-based windows and estimation by analogy to calibrate COCOMO and assess their estimation performance. We compare these approaches as well as the use of all available historical data using the COCOMO data set of 341 projects and NASA data set of 93 projects. The results show that timing information exists in the data sets affecting estimation accuracy. Given sufficient data for calibration, using recently completed projects within short durations generates more accurate estimates than retaining all historical data or using k-nearest neighbors based on estimation by analogy. More training data spanning a long period of time may not lead to improved estimation accuracy. This study offers evidence to support the use of projects completed within recent years for training estimation models.
KEYWORDSCOCOMO, duration-based windows, estimation by analogy, k-nearest neighbors, moving windows, software estimation
| INTRODUCTIONCost and effort estimation is a key activity in software project management that can affect the outcome of software projects. Inaccurate cost estimates can lead to proposal rejection, financial losses, project management problems, and overall project failure. 1-3 Thus, the software engineering research community has introduced and evaluated many approaches to making reliable cost, effort, and other related predictions. 4 The most popular approach is model-based estimation which uses some algorithms and historical data to compute estimates. 5 Effort estimation models are often built or calibrated to past projects in organizations to compute the effort estimates of new projects. Thus, the performance of such models depends much on the relevance of training data. One important question is whether legacy data of past projects is useful for training estimation models.Existing studies proposed chronology-based approaches to splitting training data and investigating this challenging question. [6][7][8] These studies assume the basis that given a project p to be estimated, the model to estimate p is built using data selected from projects completed prior to the start of p. Kitchenham et al,7 which is one of the first studies that split training data chronologically, suggested the use of 30 most recent projects instead of all historical data in an organization to train and build regression models for estimation. Song et al 9 and Minku and Yao 10 investigated estimation methods that fetch one project at a time chronologically to build models, showing that parameters of the best resulting models change over time.Lokan and Mendes 6 investigated a chronological splitting method called moving windows in which estimation models are built using windows of n most recent projects (fixed-size windows). Lokan and Mendes 11 studied the effects of using windows of all projects completed within periods immediat...