A new measure of compound similarity/diversity
Presented by Prof. Roberto TODESCHINI, Dr. Viviana CONSONNI
Type: Oral presentation
Track: Mathematical Chemistry
Similarity searching is a standard tool for drug design, based on the idea that, given a target structure with desired properties, similar compounds chosen in large databases should have similar properties. Also for environmental property prediction, similarity relationships are currently gaining more importance in the framework of read-across. Read-across was proposed as a non-formalised approach in which endpoint information for one or more compounds (the source compounds) are used to make a prediction of the endpoint for another compound (the target compound), which are considered to be similar in some way (usually on the basis of structural similarity, i.e. molecular descriptors). A measure of similarity between the target structure and each of the database structures allows a ranking of decreasing similarity with the target for all the molecules. The numerical value of a similarity/diversity measure depends on three main components: a) the description of the objects (e.g. molecular descriptors), b) the weighting scheme of the description elements, and c) the selected similarity index or distance. Here, a novel similarity/diversity measure is proposed based on a modification of Mahalanobis distance, where the common data covariance matrix is replaced by a locally centred covariance matrix. For each pair of objects, two different distance measures can be calculated depending on the object taken as the centre for covariance matrix calculation: given a target, one distance can be thought of as the target perceives the other object whereas the other distance is how the target is perceived by the other object. The properties of the novel distance measure are discussed by comparison with Euclidean and classical Mahalanobis distance on some simulated data sets. Two main indices describing internal relationships between the data are also derived, called remoteness and sparseness.