Graph kernel based on iterative graph similarity and optimal assignmentshttp://mloss.orgUpdates and additions to Graph kernel based on iterative graph similarity and optimal assignmentsenMon, 22 Sep 2008 13:42:28 -0000Graph kernel based on iterative graph similarity and optimal assignments 2008-01-15<html><p>Java package implementing a kernel for (molecular) graphs based on iterative graph similarity and optimal assignments. </p> <hr /> <p>Abstract of the publication introducing the method (<em>Journal of Chemical Information and Modeling 47(6):2280-2286</em>): </p> <h3>Kernel Approach to Molecular Similarity Based on Iterative Graph Similarity</h3> <p>Similarity measures for molecules are of basic importance in chemical, biological, and pharmaceutical applications. We introduce a molecular similarity measure defined directly on the annotated molecular graph, based on iterative graph similarity and optimal assignments. We give an iterative algorithm for the computation of the proposed molecular similarity measure, prove its convergence and the uniqueness of the solution, and provide an upper bound on the required number of iterations necessary to achieve a desired precision. Empirical evidence for the positive semidefiniteness of certain parametrizations of our function is presented. We evaluated our molecular similarity measure by using it as a kernel in support vector machine classification and regression applied to several pharmaceutical and toxicological data sets, with encouraging results. </p> <hr /> <p>Abstract of a successful application (accepted for the <em>4th German Conference on Chemoinformatics, Goslar, November 9-11, 2008</em>): </p> <h3>Virtual screening for PPAR-y ligands using the ISOAK molecular graph kernel and Gaussian processes</h3> <p>Timon Schroeter, Matthias Rupp, Katja Hansen, Klaus-Robert Mueller, Gisbert Schneider </p> <p>In this work, we use a machine learning approach employing a graph kernel, Gaussian process regression and clustered cross-validation, to virtually screen for novel ligands of the peroxisome-proliferator activated receptor gamma (PPAR-y). The PPAR family of receptors belong to the steroid-thyroid-retinoid superfamily of nuclear receptors and act as transcription factors. They play a role in the regulation of lipid and glucose metabolism in vertebrates and are linked to various human processes and diseases. [1] For this study, we used a dataset of 176 PPAR-y agonists published by Ruecker et al [2]. </p> <p>We use a graph kernel based on iterative similarity and optimal assignments (ISOAK, [3]) for non-linear Bayesian regression with Gaussian process priors (GP regression, [4]). The ISOAK kernel was developed for topological comparison of molecules and has previously been used for virtual screening with support vector machines. </p> <p>GP models can provide confidence estimates with each individual prediction, thereby allowing to assess which compounds are inside of the models domain of applicability. This feature is useful in settings like virtual screening, where a large fraction of the tested compounds is outside of the models domain of applicability. In chemoinformatics, GPs have been applied to different classification and regression tasks using either radial basis function or rational quadratic kernels based on vectorial descriptors [4,5]. They have not been used with graph kernels so far. </p> <p>A number of kernel based learning algorithms (including GPs) are capable of multiple kernel learning [6], which allows to combine heterogeneous information by using multiple kernels at the same time. In this work, we combine rational quadratic kernels for vectorial molecular descriptors (MOE 2D, CATS2D and Ghose-Crippen fragment descriptors) with the ISOAK graph kernel. </p> <p>We evaluate our methodology in different ranking and regression settings. Ranking performance is asessed using the new labloss loss-function, which measures the number of false positives within the top k predicted compounds. For this, the predicted compounds are ranked based on both predicted binding affinity and the confidence in each prediction. In the regression setting, standard loss functions like mean absolute error (MEA) and root mean squared error are used. The established linear ridge regression (LRR) and support vector regression (SVR) algorithms served as baseline methods. </p> <p>In addition to standard test/training splits and cross-validation, we employed a clustered cross-validation strategy where groups of whole clusters of compounds are left out when constructing training sets. This results in less optimistic results, but has the advantage of favoring more robust and extrapolation-capable algorithms than standard training/test splits and normal cross-validation. </p> <p>In the regression setting, both GP and SVR models perform well, yielding MAEs as low as 0.66 +- 0.08 (clustered CV) and 0.51 +-0.3 (normal CV). In the ranking setting, GPs slightly outperform SVR (labloss 0.21 +- 0.09 vs. 0.3 +- 0.08). Both non-linear algorithms clearly outperfom LRR. </p> <p>In conclusion, Gaussian process regression using both the ISOAK molecular graph kernel and rational quadratic kernel (with standard molecular descriptors) simultaneously via multiple kernel learning performs well in retrospective evaluation. A prospective evaluation study is currently in progress. </p> <p>[1] Henke B, Progr. Med. Chem. 2004, 1-53.<br /> [2] Ruecker C, Scarsi M, Meringer M, Bioorg. Med. Chem. 2006, 5178-5195.<br /> [3] Rupp M, Proschak E, Schneider G, J. Chem. Inform. Model. 2007, 2280-2286.<br /> [4] Schwaighofer A, Schroeter T, Mika S, Laub J, Laak A, Suelzle D, Ganzer U, Heinrich N, Mueller K-R, J. Chem. Inform. Model. 2007, 407-424.<br /> [5] Obrezanova O, Csanyi G, Gola J, Segall M, J. Chem. Inf. Model, 2007, 1847 -1857.<br /> [6] Sonnenburg S, Raetsch G , Schaefer C, Schoelkopf B, J. Machine Learning Research, 2006, 1531-1565<br /> </p></html>Matthias Rupp, Ewgenij Proschak, Gisbert SchneiderMon, 22 Sep 2008 13:42:28 -0000 methodsnips2008graph kernelgraph similarityoptimal assignmentstructured data