2013
DOI: 10.1002/minf.201300006
|View full text |Cite
|
Sign up to set email alerts
|

Structural Key Bit Occurrence Frequencies and Dependencies in PubChem and Their Effect on Similarity Searches

Abstract: Little published literature exists on the 881 bit structural keys used by PubChem for categorizing and comparing the compounds present in its database. We characterized these structural keys by examining their frequencies of occurrence within the PubChem compound database. In addition, bit dependencies, defined as the universal presence of a bit given the presence of another, were determined. We show that the vast majority of bits are rarely set and that substantial numbers of dependencies exist. A comparison … Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1

Citation Types

0
3
0

Year Published

2015
2015
2016
2016

Publication Types

Select...
2

Relationship

1
1

Authors

Journals

citations
Cited by 2 publications
(3 citation statements)
references
References 15 publications
0
3
0
Order By: Relevance
“…Chlorinated compounds will then be represented at a rate greater than expected, where the fingerprint uniqueness requirement is not present. Simulations similar to the one presented can be used to estimate the magnitude of the effect for cases involving multiple bits, as well as those involving bit dependencies such as those previously described [11].…”
Section: Resultsmentioning
confidence: 96%
“…Chlorinated compounds will then be represented at a rate greater than expected, where the fingerprint uniqueness requirement is not present. Simulations similar to the one presented can be used to estimate the magnitude of the effect for cases involving multiple bits, as well as those involving bit dependencies such as those previously described [11].…”
Section: Resultsmentioning
confidence: 96%
“…The NBC approach has a firm theoretical basis for the design of the fragment weights, but involves the assumption that the fragment occurrences are statistically independent, an assumption that is known to be incorrect 30,31 . It would be possible to try to relax this assumption, as has been done when the Robertson-Sparck Jones weights (such as R4) have been used in information retrieval 32 .…”
Section: Discussionmentioning
confidence: 99%
“…Both approaches have their limitations. The NBC approach has a firm theoretical basis for the design of the fragment weights but involves the assumption that the fragment occurrences are statistically independent, an assumption that is known to be incorrect. , It would be possible to try to relax this assumption, as has been done when the Robertson–Spärck Jones weights (such as R4) have been used in information retrieval . However, the resulting weights are far more complex in nature and have not proved to be any more effective in retrieving relevant documents than the basic approach that assumes independence, and there hence seems little reason to believe that things would be any different in the present context.…”
Section: Discussionmentioning
confidence: 99%