2022
DOI: 10.2478/ijssis-2022-0002
|View full text |Cite
|
Sign up to set email alerts
|

A novel approach to capture the similarity in summarized text using embedded model

Abstract: The presence of near duplicate textual content imposes great challenges while extracting information from it. To handle these challenges, detection of near duplicates is a prime research concern. Existing research mostly uses text clustering, classification and retrieval algorithms for detection of near duplicates. Text summarization, an important tool of text mining, is not explored yet for the detection of near duplicates. Instead of using the whole document, the proposed method uses its summary as it saves … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
1
1
1
1

Citation Types

0
2
0
2

Year Published

2023
2023
2024
2024

Publication Types

Select...
3
1

Relationship

0
4

Authors

Journals

citations
Cited by 4 publications
(4 citation statements)
references
References 37 publications
0
2
0
2
Order By: Relevance
“…In the field of text summarization, text matching techniques are also utilized to identify duplicate text content, facilitating the removal of redundant information from summaries, which is beneficial for text mining purposes. Mishra et al [33] proposed an embedded model to examine the similarity of summary texts, effectively addressing this issue. Long-text QA is also an important challenge.…”
Section: Text Matchingmentioning
confidence: 99%
“…In the field of text summarization, text matching techniques are also utilized to identify duplicate text content, facilitating the removal of redundant information from summaries, which is beneficial for text mining purposes. Mishra et al [33] proposed an embedded model to examine the similarity of summary texts, effectively addressing this issue. Long-text QA is also an important challenge.…”
Section: Text Matchingmentioning
confidence: 99%
“…Setelah didapatkan nilai shingle dokumen, kemudian dengan Cosine Similarity akan dihitung nilai kemiripan dari dua buah dokumen berdasarkan vektor nilai shingle dokumen tersebut. Nilai Cosine Similarity ini menandakan persentase kemiripan masing-masing dokumen, dimana terdapat nilai ambang batas (threshold) yang digunakan untuk menentukan apakah dua buah dokumen terindikasi melakukan plagiat, rentang nilai tersebut adalah < 60%, 60 -70%, 70 -80%, dan > 80% [17].…”
Section: Gambar 3: Flowchart Pengembangan Fitur Deteksi Kemiripan Dok...unclassified
“…Nilai K-Shingling yang dihasilkan dari masing-masing dokumen menandakan apakah shingle tersebut ditemukan pada dokumen bersangkutan, apabila shingle tersebut ditemukan, maka nilai nya 1, sedangkan jika shingle tidak ditemukan pada dokumen, maka nilai nya 0. Setelah didapatkan nilai K-Shingling dari kedua dokumen, kemiripan dokumen dapat diukur berdasarkan proporsi nilai shingle yang ditemukan dalam pasangan dokumen teks dari vektor kata yang didefinisikan untuk mewakili dokumen [17], dan kemudian vektor tersebut akan dibandingkan menggunakan Cosine Similarity Dengan menggunakan rumus pada "(2)", maka didapat nilai sebagai berikut :…”
Section: A Hasil Implementasi Metodeunclassified
“…The phase of document selection is employed to identify the most pertinent sections in relation to the stipulated learning objective. This is accomplished through a similarity search using cosine similarity, yielding the top n-documents, with n representing a parameter defined by the user (Mishra et al, 2020). This process employs the learning goal sentence, which is vectorized using embeddings.…”
Section: Technical Parts Of the Proposed Systemmentioning
confidence: 99%