A major problem in the design of screening systems for substructure searches of chemical structure files is the development of a methodology for selection of an optimal set of structural characteristics to act as screens. The set chosen for a particular application will depend on the characteristics of the collection, as well as on its size and growth rate. A strategy which takes account of the disparate frequencies of the various species of fragments in a data-base by use of differential, and, in part, hierarchical levels of description is detailed. The distributions of a variety of structural characteristics, including bond-centered, atom-centered, and ring fragments in a 30,000-compound sample of the Chemical Abstracts Service Registry System are summarized. Implementation of the approach, using primarily bond-centered fragments, by means of simple and highly efficient computer programs, is detailed.The need to provide flexible and economic searches of chemical structure files to fulfil chemists' requirements for substructure searching within more general chemical information systems poses complex problems with interesting implications both practical and theoretical in nature. Many approaches have been advocated,l embodying a variety of viewpoints. In no respect has opinion been more varied than in the design of screening systems. These entail the selection of structural characteristics on the basis of which an approximate match between queries and potential answers is made. This stage may be followed by a more detailed search involving atom-bondatom path tracing. The adequacy of the selection of characteristics on the basis of which the collection is indexed is critical both to the extent to which the system can fulfil the variety of queries addressed to it and to the over-all costs of searching.The work reported in this paper arose from the conviction that it was essential to develop a general methodology for the design of screening systems, which could then be applied with equal validity to collections differing widely both in size and composition. (The need for such a methodology is borne out by even a cursory examination of the diversity of conventional fragmentation codes,* which generally reflect both of these factors. Thus a system devised for an alkaloid file will place heavy emphasis on ring-system skeletons and on the environments of nitrogen atoms, whereas a code devised for a large collection will, of necessity, be more specific and contain a greater number of characteristics than that for a small file.) In terms of size, therefore, the assumption was made that a greater level of selectivity is required in searches of larger files than in smaller ones; if a constant proportion of structures were retrieved, searches of large files might result in impractical numbers of structures being retrieved. In terms of composition, it was assumed that the queries addressed to a collection would roughly mirror the characteristics of the file; this is again borne out by experience with fragmentation codes,3 and...
Afile of structures from the Chemical Abstracts Service Registry System has been analysed to determine the distribution of a number of different types of atom-centred structural fragments. The largest fragments studied were augmented atoms containing up to five atoms and four bonds. The effects which differentiating between cyclic and acyclic bonds and varying the size of fragments have on the fragment distributions have been examined.
The Cambridge Crystallographic Data Centre is concerned with the retrieval, evaluation, synthesis, and dissemination of structural data based on diffraction methods. This paper is Part I of a series describing the work of the Centre and deals with the organization of a computerized bibliographic file. Examples are given of the use of the file for bibliographic services, computer-typeset publications, and statistical analysis of trends in publications.
The occurrence of elements in general collections of chemical structures is known to have a Poisson distribution, which must be compensated for in the design of screening systems for automatic substructure searching. Quantitative measures for the occurrence of elements and certain simple fragments are presented for a large sample of structures from the Chemical Abstracts Service Registry System. A computer technique is described which allows the distribution of fragments to be studied at differing levels of specificity. When selectively applied, the technique permits resolution of the commoner fragments into less frequent categories, producing a more even distribution. The method may be of value in the development of screening systems for substructure searching.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.