2008
DOI: 10.1021/ci800224h
|View full text |Cite
|
Sign up to set email alerts
|

Effect of Data Standardization on Chemical Clustering and Similarity Searching

Abstract: Standardization is used to ensure that the variables in a similarity calculation make an equal contribution to the computed similarity value. This paper compares the use of seven different methods that have been suggested previously for the standardization of integer-valued or real-valued data, comparing the results with unstandardized data. Sets of structures from the MDL Drug Data Report and IDAlert databases and represented by Pipeline Pilot physicochemical parameters, molecular holograms and Molconn-Z para… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

0
12
0
1

Year Published

2009
2009
2021
2021

Publication Types

Select...
6
1

Relationship

2
5

Authors

Journals

citations
Cited by 24 publications
(13 citation statements)
references
References 21 publications
0
12
0
1
Order By: Relevance
“…Our experiments here used two sets of molecules and activity classes that we have studied previously in a comparison of standardisation methods for clustering and similarity searching 32 . These datasets .…”
Section: Datasets and Clustering Methodsmentioning
confidence: 99%
“…Our experiments here used two sets of molecules and activity classes that we have studied previously in a comparison of standardisation methods for clustering and similarity searching 32 . These datasets .…”
Section: Datasets and Clustering Methodsmentioning
confidence: 99%
“…In contrast, the z-score formula scales the data vector to a standardized vector having zero mean and unit variance. Some other standardization techniques derived from these two basic methods are proposed in the literature [28,29].…”
Section: Standardization Standardizing the Training Data Priormentioning
confidence: 99%
“…In order to mitigate the negative effect of heavy-tailed data on a practical, even distribution of data into separate segments, some studies standardize feature values so that they all fall within the same range. This transforms the features to exhibit smaller variations and hence make them become more amenable to k-means clustering (Chu et al 2009). However, while standardization may be important in the presence of outliers that skew sample statistics and model fits, data in heavy tails are meaningful observations representing a characteristic trend in the data.…”
Section: Segmenting Users By Engagementmentioning
confidence: 99%
“…A framework that can meet these improvements to engagement analysis is necessary but challenging to develop. A simple approach may be to identify elements related to social engagement from a system, preprocess the data with common standardization and outlier removal procedures, and then partition users through an unsupervised clustering analysis (Chu et al 2009). But this could fail when faced with data from an OSS because, based on measurements from a number of studies, features related to social elements exhibit heavytailed tendencies (Benevenuto et al 2009).…”
Section: Introductionmentioning
confidence: 99%