Efficient Pairwise Document Similarity Computation in Big Datasets

Niyigena, Papias; Zhang, Zuping; Li, Weiqi; Long, Jun

doi:10.14257/ijdta.2015.8.4.07

Cited by 12 publications

(3 citation statements)

References 27 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…Each row represents tokens or words extracted from each document, and every column represents a document. Each row in the matrix corresponds to a score, and every column corresponds to the amount of similarity between documents [31,32]. In the case of larger numbers of documents, such as the requirements documents for several years of a multinational company, the computations will be extremely tedious.…”

Section: Lexical-based Similaritymentioning

confidence: 99%

See 1 more Smart Citation

Assessing Similarity between Software Requirements: A Semantic Approach

Ahmad¹,

Faisal²

2023

IJIEEB

View full text Add to dashboard Cite

The majority of projects fail to achieve their intended objectives, according to research. This could arise for a number of reasons, such as ensuring requirements are managed, excessive documentation of the code, or the difficulty in delivering software that includes all the requested features on time. An effort could be made to overcome such failure rates by establishing a proper management of requirements and concept of reusability. The correct requirements can be identified by checking similarity between the requirements received from the various stakeholders. A reusable software component can result in substantial savings in both time and money. It can be challenging to make a choice regarding the reuse of certain software components. A comparison of the requirements of a new project with those of previous projects prior to starting a new project or even at a later stage during development is useful for identifying reusable components. This paper proposes a framework (ReSim) for identifying software requirements' similarities, in an attempt to improve reusability and identify the correct requirements. A crucial component of ReSim is to measure similarity between software requirements. Different well-known similarity measurement techniques used by the researchers to evaluate the similarity between the software requirements. Some of the methods used to measure this include dice, jaccard, and cosine coefficients, but in this paper, we have used recently developed hybrid method which considers not only semantic information including lexical databases, word embeddings, and corpus statistics, but also implied word order information and produced significant improvements in the results related to the measurement of semantic similarity between words and sentences. As part of the experiments, the study used PURE dataset -in order to demonstrate the efficacy of the proposed framework. As a result, recently developed hybrid method of measuring the requirements similarity is more accurate than Dice, Jaccard, and Cosine, while Cosine is a better choice than Dice, and Jaccard is more accurate than Dice. Thus, ReSim outperforms existing approaches when tested on the PURE dataset, providing the most accurate results for both functional and non-functional requirements.

show abstract

Section: Lexical-based Similaritymentioning

confidence: 99%

“…An extensive set of text used for research purposes is referred to as a corpus. The use of semantic similarity in query answer systems helps users to find what they are looking for regardless of how the characters are written [29][30][31]. Similarity is also measured using ontologies.…”

Section: Semantic-based Similaritymentioning

confidence: 99%

Assessing Similarity between Software Requirements: A Semantic Approach

Ahmad¹,

Faisal²

2023

IJIEEB

View full text Add to dashboard Cite

show abstract

“…Niyigena et al [17], have presented a new method to compute the pairwise document similarity in a corpus in order to reduce the time execution and save space execution resources. Their algorithm provided an efficient solution for pairwise documents similarity in a corpus.…”

Section: Related Workmentioning

confidence: 99%