Research Areas

I work in a number of areas of cheminformatics including data mining and algorithm development, applications in specific biological systems and infrastructural type projects. I've provided some more detailed descriptions of some of my current work. I strongly believe in opn-source and open-access and most of the results of my research are freely available for both academia and industry.


Networks in Chemistry & Biology
The focus of this project is to develop network models of PubChem bioassays and use these models to investigate promiscuity and polypharmacology. We consider assays as nodes in a graph and then connect two assays using a variety of conditions (similarity of targets, compounds active in both assays etc). We then investigate the properties of these graphs and are investigating ways in which we can map these graphs to other biological networks such as protien-protein interaction networks and drug-target networks (Yildirim et al, 2007). Along wth the analytical investigations we are investigating various network visualizations that will allow us to summarize the bioassay data in multiple ways.
Computational Toxicity
Predictive Modeling Methodologies
Projects falling within this field primarily focus on the development of Quantitative Structure-Activity Relationship (QSAR) models and methodological developments for various aspects of QSAR modeling protocols.

Some of my previous work in this area includes development of models to predict the anti-malarial activity of artemisinin analogs (Guha & Jurs, 2004) and the activities of PDGFR inhibitors (Guha & Jurs, 2004), anti-cancer activities using the NCI DTP collection (Wang et al, 2007), cytotoxicity (Guha & Schurer, 2008), feature and set selection methods (Guha et al, 2004, Guha et al, 2007) and local versus global methods (Guha et al, 2006).

More recently I am focusing on methodological developments that focus on the issue of model domain applicability. Given a QSAR model built on a collection of molecules (training set), we then use it to predict the property of a new molecule. But if the new molecule is very different from the training set, is the prediction reliable? How similar to the training set should a new molecule be, to obtain a reliable prediction? What is a good measure of similarity? These questions are important since there are a number of efforts (such as the REACH project ) are attempting to replace costly animal testing (for a variety of properties such as toxicity and skin irritation) with QSAR models. My previous work in this area focused on the development of an auxillary model that acted as a check on the main QSAR model (Guha & Jurs, 2005). I am currently collaborating with Dr. David Stanton of Proctor & Gamble to develop a new approach to this problem that takes into account similarity to the training set, but also tries to identify extra features in a new molecule, that might cause it to be dissimilar to the training set. I am also investigating the use of fingerprints to compare datasets in bulk, avoiding a pairwise comparison between molecules from the training set and a prediction set.

Some areas that I will be considering in the near future include utilizing multiple data types (chemical structure, microarray data, assay data etc) in a single regression or classification model, development of descriptors for polymeric materials and mixtures.

Characterizing & Exploring Chemical Spaces
Many applications in cheminformatics require one to numerically describe the structural features and properties of a molecule. We do this by calculating molecular descriptors. A set of descriptors used to characterize a collection of molecules, represents a (possibly high dimensional) space, within which the molecules are embedded. These chemical spaces can then be used for a variety of purposes such as QSAR modeling, diversity analysis and so on.

When one talks of chemical spaces a number of questions arise, including

  • How do we choose a suitable space from many possible spaces?
  • Can we characterize the spatial distribution of molecules in given space?
  • How does the chemical space for a set of molecules affect the representation of structure-activity relationships?
My previous work in this has focused on methodologies to characterize chemical spaces using a density-type method (Guha et al, 2006) and investigation of geometric hashing as a way to partition chemical spaces (Dutta et al, 2006). I am currently working with Dr. John Van Drie on methods to identify and characterize activity cliffs (Maggiora, 2006). We recently described a numerical approach to this problem (Guha & Van Drie, 2008) and are currently investigating a number of applications including feature selection, characterization of chemical spaces in terms of activity cliffs and measurement of the ability of arbitrary models (QSAR, docking, pharmacophore, etc.) to encode an SAR.
Anti-malarial Drug Discovery
This is a collaboration with Prof. Jean-Claude Bradley of Drexel University. He is synthesizing compounds using the Ugi reaction, which are then tested for ant-malarial activity by Prof. Phil Rosenthal of UCSF. I am helping generate virtual libraries of Ugi reaction products and screening them against falcipain-2 using docking. We will also be developing QSAR models using inhibitors that have been described in the literature, to provide an alternative virtual screening strategy. I am also developing QSAR models to predict the methanol solubility of the reaction products to further prioritize the synthetic efforts.

At this point, two, micromolar inhibitors have been identified.

Chemical Structure Databases
I am currently working on a 3D version of PubChem which is shape searchable. This project involves a number of components including 3D structure generation, storage in an RDBMS, shape representation and searching. The 3D structure generation is performed using code (smi23d) written by Kevin Gilbert. We employ the distance moment (Ballester & Graham Richards, 2007) to represent shapes in 12-D and then use an R-tree spatial index to allow efficient shape searches.

In collaboration with Marlon Pierce I am investigating techniques to parallelize the database, so that it can scale ot hundreds of millions of molecules from the current 17 million as well as alternative spatial indexing schemes to improve query efficiency. In addition, in collaboration with Kevin Gilbert I am extending the current single-conformer database to a multi-conformer version. Along with various benchmarking tests, we will utilize the database for a variety of applications including diversity analysis and fast docking filters.

Cheminformatics Software Development
I am involved in a number of software development projects, some of which are derived from other research areas whereas others focus on pure development. My development process uses modern software engineering tools and techniques (including version control, unit testing etc.) and most of it is released in an open-source form.

As a contributor to the CDK much of my software development is based on this toolkit. Currently my work on this toolkit focuses on molecular descriptor implementations, improving the build and test systems and pharmacophore representation and searching. I recently implemented a basic pharmacophore search framework in the CDK, which only supports distance constraints (Guha & Van Drie, 2008). Short term plans include adding angle constraints and excluded volumes. More longer term, this framework will be used to develop pharmacophore discovery algorithms.

Being a regular R user much of algorithm development is done within R and I have released a number of packages - rcdk which integrates the CDK with R and rpubchem package which provides an interface to PubChem from within R (Guha, 2007). These projects are ongoing.

Another aspect of my software development work is the design and development of a web service infrastructure covering cheminformatics functionality, data sources and statistical functionality (Dong et al, 2007). One of my projects in this area is the development of a scheme to allow exchange of predictive models (using PMML along wth annotations). The result of this will be that models will be made available as XML documents and hence indexed by search engines allowing us to search for models directly. The next step, is to then support automatic execution of models.