Research Areas
I work in a number of areas of cheminformatics
including data mining and algorithm development, applications in
specific biological systems and infrastructural type projects. I've
provided some more detailed descriptions of some of my current
work. I strongly believe in opn-source and open-access and most of
the results of my research are freely available for both academia
and industry.
Some of my previous work in this area includes development of models to predict the anti-malarial activity of artemisinin analogs (Guha & Jurs, 2004) and the activities of PDGFR inhibitors (Guha & Jurs, 2004), anti-cancer activities using the NCI DTP collection (Wang et al, 2007), cytotoxicity (Guha & Schurer, 2008), feature and set selection methods (Guha et al, 2004, Guha et al, 2007) and local versus global methods (Guha et al, 2006).
More recently I am focusing on methodological developments that focus on the issue of model domain applicability. Given a QSAR model built on a collection of molecules (training set), we then use it to predict the property of a new molecule. But if the new molecule is very different from the training set, is the prediction reliable? How similar to the training set should a new molecule be, to obtain a reliable prediction? What is a good measure of similarity? These questions are important since there are a number of efforts (such as the REACH project ) are attempting to replace costly animal testing (for a variety of properties such as toxicity and skin irritation) with QSAR models. My previous work in this area focused on the development of an auxillary model that acted as a check on the main QSAR model (Guha & Jurs, 2005). I am currently collaborating with Dr. David Stanton of Proctor & Gamble to develop a new approach to this problem that takes into account similarity to the training set, but also tries to identify extra features in a new molecule, that might cause it to be dissimilar to the training set. I am also investigating the use of fingerprints to compare datasets in bulk, avoiding a pairwise comparison between molecules from the training set and a prediction set.
Some areas that I will be considering in the near future include utilizing multiple data types (chemical structure, microarray data, assay data etc) in a single regression or classification model, development of descriptors for polymeric materials and mixtures.
When one talks of chemical spaces a number of questions arise, including
- How do we choose a suitable space from many possible spaces?
- Can we characterize the spatial distribution of molecules in given space?
- How does the chemical space for a set of molecules affect the representation of structure-activity relationships?
At this point, two, micromolar inhibitors have been identified.
In collaboration with Marlon Pierce I am investigating techniques to parallelize the database, so that it can scale ot hundreds of millions of molecules from the current 17 million as well as alternative spatial indexing schemes to improve query efficiency. In addition, in collaboration with Kevin Gilbert I am extending the current single-conformer database to a multi-conformer version. Along with various benchmarking tests, we will utilize the database for a variety of applications including diversity analysis and fast docking filters.
As a contributor to the CDK much of my software development is based on this toolkit. Currently my work on this toolkit focuses on molecular descriptor implementations, improving the build and test systems and pharmacophore representation and searching. I recently implemented a basic pharmacophore search framework in the CDK, which only supports distance constraints (Guha & Van Drie, 2008). Short term plans include adding angle constraints and excluded volumes. More longer term, this framework will be used to develop pharmacophore discovery algorithms.
Being a regular R user much of algorithm development is done within R and I have released a number of packages - rcdk which integrates the CDK with R and rpubchem package which provides an interface to PubChem from within R (Guha, 2007). These projects are ongoing.
Another aspect of my software development work is the design and development of a web service infrastructure covering cheminformatics functionality, data sources and statistical functionality (Dong et al, 2007). One of my projects in this area is the development of a scheme to allow exchange of predictive models (using PMML along wth annotations). The result of this will be that models will be made available as XML documents and hence indexed by search engines allowing us to search for models directly. The next step, is to then support automatic execution of models.