It is common practice for requirements traceability research to consider method call dependencies within the source code (e.g., fan-in/fan-out analyses). However, current approaches largely ignore the role of data. The question this paper investigates is whether data dependencies have similar relationships to requirements as do call dependencies. For example, if two methods do not call one another, but do have access to the same data then is this information relevant? We formulated several research questions and validated them on three large software systems, covering about 120 KLOC. Our findings are that data relationships are roughly equally relevant to understanding the relationship to requirements traces than calling dependencies. However, most interestingly, our analyses show that data dependencies complement call dependencies. These findings have strong implications on all forms of code understanding, including trace capture, maintenance, and validation techniques (e.g., information retrieval).
Requirements traceability benefits many software engineering activities, such as change impact analysis and risk assessment. However, these activities require complete and correct traceability links which is not trivial, making traceability assessment an important field of study. In recent years, requirements traceability research has focused on using call dependencies within source code to understand how code properties contribute to the implementation of a requirement and to assess whether traceability links are correct and complete. These approaches largely ignore the role of existing data dependencies within the source code. That is, methods may never call each other, but may still depend upon another by sharing data. We identified five research questions and validated them on five software systems, covering 4 to 72 KLOC. We found that data dependencies are as relevant as call dependencies for assessing requirements traceability. Even more interesting, our analyses show that data dependencies complement call dependencies in the assessment. These findings have strong implications on code understanding, including trace capture, maintenance, and validation techniques.
The number of software vulnerabilities is increasing year by year. In the era of big data, data-processing software with many users is more concerned by hackers. It is essential to improve the efficiency of discovering vulnerabilities in data-processing software. We noticed that in the process of discovering vulnerabilities, some problems of existing technology such as fuzzing, symbolic execution, and taint analysis have more or fewer relationships with data-processing functions. In fuzzing, there are two types of sanity checks toward the target program: NCC (Non-critical check) and CC (critical check). It is usually challenging to bypass such a sanity check, which leads to low code coverage during fuzzing. In symbolic execution, the constraint solver still has the problem of trying to deal with the constraints of complex algorithms. In taint analysis, the problem of over-taint and under-taint is always the key to affect the accuracy of the results. Therefore, to solve the above problems, it is necessary to identify the data-processing function. Based on identifying data-processing functions, we could identify those sanity checks, ease the solution of complex constraints, and understand the way of taints propagation to assist in software vulnerability discovery and analysis. This paper proposed a method called DPFI(data-processing function identification) for identifying data-processing functions with deep neural networks. We collected 37000 functions from GitHub and implemented the method on the data set with several neural networks, among which the performance of CNN achieved best and F 1-score was 0.90. We then applied the trained model on CGC(cyber grand challenge) data and real softwares for testing. For CGC, we got 448 functions in 20 programs, in which 35 were identified as data-processing functions. For real softwares, such as FFmpeg, 7zip, jpeg, the precision rate all reached 0.90 and F 1-score was above 0.87.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.