2019
DOI: 10.1093/database/baz064
|View full text |Cite
|
Sign up to set email alerts
|

PubMed Text Similarity Model and its application to curation efforts in the Conserved Domain Database

Abstract: This study proposes a text similarity model to help biocuration efforts of the Conserved Domain Database (CDD). CDD is a curated resource that catalogs annotated multiple sequence alignment models for ancient domains and full-length proteins. These models allow for fast searching and quick identification of conserved motifs in protein sequences via Reverse PSI-BLAST. In addition, CDD curators prepare summaries detailing the function of these conserved domains and specific protein families, based on published p… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
9
0

Year Published

2020
2020
2024
2024

Publication Types

Select...
8

Relationship

0
8

Authors

Journals

citations
Cited by 13 publications
(9 citation statements)
references
References 22 publications
0
9
0
Order By: Relevance
“…Finally, the proposed model achieves the best results by the self2self-attention on the three datasets except for the Spearman correlation coefficient on CDD-ref. The final Pearson correlation coefficient is increased to 0.661, approaching the official value 0.678 [33] on CDD-ref. The reasons for the better performance obtained by our model on the three corpora may be that the proposed self2self-attention not only helps our model to more precisely represent the semantic via self-attention in a single sentence, but enhance the sentence semantic representation through cross self-attention.…”
Section: B Performance Comparison With Other Existing Methodsmentioning
confidence: 77%
“…Finally, the proposed model achieves the best results by the self2self-attention on the three datasets except for the Spearman correlation coefficient on CDD-ref. The final Pearson correlation coefficient is increased to 0.661, approaching the official value 0.678 [33] on CDD-ref. The reasons for the better performance obtained by our model on the three corpora may be that the proposed self2self-attention not only helps our model to more precisely represent the semantic via self-attention in a single sentence, but enhance the sentence semantic representation through cross self-attention.…”
Section: B Performance Comparison With Other Existing Methodsmentioning
confidence: 77%
“…The direct reason may be that the interaction semantic information causes a slight amount of noise in this corpus. However, on CDD-ref corpus found in the same literature [36], ISA-SNN performs better. Therefore, we compare the difference between the two datasets according to the labeled score distribution, sentence length, and text quality.…”
Section: F Analysis Of Cdd-ful/-refmentioning
confidence: 79%
“…This phase is representative of algorithm TF*IDF. The TF*IDF-statistic short for term frequency times inverse the document frequency can extract keywords from a document by considering a single document and all documents from the corpus [2] [21]. The promising candidate for a keyword in a specific document if it shows up relatively often within the document and rarely in the rest of the…”
Section: ) Representative Algorithm: Tf*idfmentioning
confidence: 99%
“…Large amounts of textual data could be collected as a part of the research, such as scientific literature, transcripts in the marketing and economic sectors, speeches in political discourse, such as presidential campaigns and inauguration speeches, and meeting transcripts [1]. PubMed dataset of MEDLINE also has grown enormously [2]. This large amount of textual information has created the problem of finding the relevance level between documents.…”
Section: Introductionmentioning
confidence: 99%
See 1 more Smart Citation