The duration of a software project is a very important feature, closely related to its cost. Various methods and models have been proposed in order to predict not only the cost of a software project but also its duration. Since duration is essentially the random length of a time interval from a starting to a terminating event, in this paper we present a framework of statistical tools, appropriate for studying and modeling the distribution of the duration. The idea for our approach comes from the parallelism of duration to the life of an entity which is frequently studied in biostatistics by a certain statistical methodology known as survival analysis. This type of analysis offers great flexibility in modeling the duration and in computing various statistics useful for inference and estimation. As in any other statistical methodology, the approach is based on datasets of measurements on projects. However, one of the most important advantages is that we can use in our data information not only from completed projects, but also from ongoing projects. In this paper we present the general principles of the methodology for a comprehensive duration analysis and we also illustrate it with applications to known data sets. The analysis showed that duration is affected by various factors such as customer participation, use of tools, software logical complexity, user requirements volatility and staff tool skills.
Software Cost Estimation is a critical phase in the development of a software project, and over the years has become an emerging research area. A common problem in building software cost models is that the available datasets contain projects with lots of missing categorical data. The purpose of this chapter is to show how a combination of modern statistical and computational techniques can be used to compare the effect of missing data techniques on the accuracy of cost estimation. Specifically, a recently proposed missing data technique, the multinomial logistic regression, is evaluated and compared with four older methods: listwise deletion, mean imputation, expectation maximization and regression imputation with respect to their effect on the prediction accuracy of a least squares regression cost model. The evaluation is based on various expressions of the prediction error and the comparisons are conducted using statistical tests, resampling techniques and a visualization tool, the regression error characteristic curves.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.