Publications & Reports

Refereed Publications

Inhibition of Ceramide Metabolism Sensitizes Human Leukemia Cells to Inhibition of BCL2-like Proteins

Casson, L.; Howell, L.; Mathews, L.A.; Ferrer, M.; Southall, N.; Guha, R.; Keller, J.M.; Thomas, C.; Varmus, H.; Siskind, L.J.; Beverly, L.J.

PLoS One, 2013, 8, e54525

The identification of novel combinations of effective cancer drugs is required for the successful treatment of cancer patients for a number of reasons. First, many ``cancer specific'' therapeutics display detrimental patient side-effects and second, there are almost no examples of single agent therapeutics that lead to cures. One strategy to decrease both the effective dose of individual drugs and the potential for therapeutic resistance is to combine drugs that regulate independent pathways that converge on cell death. BCL2-like family members are key proteins that regulate apoptosis. We conducted a screen to identify drugs that could be combined with an inhibitor of anti-apoptotic BCL2-like proteins, ABT-263, to kill human leukemia cells lines. We found that the combination of D,L-threo-1-phenyl-2-decanoylamino-3-morpholino-1-propanol (PDMP) hydrochloride, an inhibitor of glucosylceramide synthase, potently synergized with ABT-263 in the killing of multiple human leukemia cell lines. Treatment of cells with PDMP and ABT-263 led to dramatic elevation of two pro-apoptotic sphingolipids, namely ceramide and sphingosine. Furthermore, treatment of cells with the sphingosine kinase inhibitor, SKi-II, also dramatically synergized with ABT-263 to kill leukemia cells and similarly increased ceramides and sphingosine. Data suggest that synergism with ABT-263 requires accumulation of ceramides and sphingosine, as AMP-deoxynojirimycin, (an inhibitor of the glycosphingolipid pathway) did not elevate ceramides or sphingosine and importantly did not sensitize cells to ABT-263 treatment. Taken together, our data suggest that combining inhibitors of anti-apoptotic BCL2-like proteins with drugs that alter the balance of bioactive sphingolipids will be a powerful combination for the treatment of human cancers.

Targeting IRAK1 as a Novel Therapeutic Approach for Myelodysplastic Syndrome

Rhyasen, G.W.; Bolanos, L.; Fang, J.; Rasch, C.; Jerez, A.; Varney, M.; Wunderlicj, M.; Rigolino, C.; Mathews, L.; Ferrer, M.; Southall, N.; Guha, R.; Keller, J.; Thomas, C.; Beverly, L.J.; Agostino, C.; Oliva, E.N.; Cuzzola, M.; Maciejewski, J.P.; Mulloy, J.C.; Starczynowski, D.T.

J. Clinical Investigation, 2012, submitted

Large-Scale Screening Identifies a Novel microRNA, miR-15a-3p, which Induces Apoptosis in Human Cancer Cell Lines

Druz, A.; Chen, Y.C.; Guha, R.; Betenbaugh, M.; Martin, S.; Shiloaoch, J.

Nucl. Acids Res., 2012, submitted

Cisplatin Sensitivity Mediated by WEE1 and CHK1 is Mediated by miR-155 and the miR-15 Family

Pouliot, L.M.; Chen, Y.-C.; Bai, J.; Guha, R.; Martin, S.E.; Gottesman, M.M.; Hall, M.D.

Cancer Cell, 2012, 72, 5945-5955

[ Abstract ] [DOI 10.1158/0008-5472.CAN-12-1400 ]

Identification of Mammalian Protein Quality Control Factors by High-throughput Cellular Imaging

Pegoraro, G.; Voss, T.C.; Martin, S.E.; Tuzmen, P.; Guha, R.; Mistelli, T.

PLoS One, 2012, 7, e31684

[ Abstract ] [DOI 10.1371/journal.pone.0031684 ]

Protein Quality Control (PQC) pathways are essential to maintain the equilibrium between protein folding and the clearance of misfolded proteins. In order to discover novel human PQC factors, we developed a high-content, high-throughput cell-based assay to assess PQC activity. The assay is based on a fluorescently tagged, temperature sensitive PQC substrate and measures its degradation relative to a temperature insensitive internal control. In a targeted screen of 1591 siRNA genes involved in the Ubiquitin-Proteasome System (UPS) we identified 25 of the 33 genes encoding for 26S proteasome subunits and discovered several novel PQC factors. An unbiased genome-wide siRNA screen revealed the protein translation machinery, and in particular the EIF3 translation initiation complex, as a novel key modulator of misfolded protein stability. These results represent a comprehensive unbiased survey of human PQC components and establish an experimental tool for the discovery of genes that are required for the degradation of misfolded proteins under conditions of proteotoxic stress.

High-Throughput Screening For Genes That Prevent Excess DNA Replication In Human Cells And For Molecules That Inhibit Them

Lee, C.; Johnson, R.L.; Wichterman-Kouznetsova, J.; Guha, R.; Ferrer, M.; Tuzmen, P.; Martin, S.; Zhu, W.; Depamphilis, M.L.

Methods, 2012, 57, 234-248

[ Abstract ] [DOI 10.1016/j.ymeth.2012.03.031 ]

High-throughput screening (HTS) provides a rapid and comprehensive approach to identifying compounds that target specific biological processes as well as genes that are essential to those processes. Here we describe a HTS assay for small molecules that induce either DNA re-replication or endoreduplication (i.e. excess DNA replication) selectively in cells derived from human cancers. Such molecules will be useful not only to investigate cell division and differentiation, but they may provide a novel approach to cancer chemotherapy. Since induction of DNA re-replication results in apoptosis, compounds that selectively induce DNA re-replication in cancer cells without doing so in normal cells could kill cancers in vivo without preventing normal cell proliferation. Furthermore, the same HTS assay can be adapted to screen siRNA molecules to identify genes whose products restrict genome duplication to once per cell division. Some of these genes might regulate the formation of terminally differentiated polyploid cells during normal human development, whereas others will prevent DNA re-replication during each cell division. Based on previous studies, we anticipate that one or more of the latter genes will prove to be essential for proliferation of cancer cells but not for normal cells, since many cancer cells are deficient in mechanisms that maintain genome stability.

Cheminformatics, the Computer Science of Chemical Discovery Turning Open Source

Sterling, A.; Wegner, J.K.; Guha, R.; Bender, A.; Faulon, J.; Hastings, J.; O'Boyle, N.; Overington, J.P.; Vlijmen, H.V.; Willighagen, E.

Comm. ACM, 2012, 55, 65-75

[ Abstract ] [DOI 10.1145/2366316.2366334 ]

One of the most prominent success stories in all the sciences over the last decade has been the advance of bioinformatics: the interdisciplinary collaboration between computer scientists and molecular biologists that led to the sequencing of the human genome and other accomplishments. However, few computer scientists are familiar with a related discipline: cheminformatics, the use of computers to represent the structures of small molecules and analyze their properties. Cheminformatics has wide applicability, from the drug discovery to agrochemicals and materials design. While researchers in both academia and industry have made important contributions to this field for decades, new and exciting collaborative opportunities have arisen from an ``opening'' of data and software as an effect of changing mindsets, policy changes, and chemists volunteering time for ``Open Science''. Researchers have gained access to freely available open source software packages and open databases of tens of millions of chemicals, allowing academic chemists to confront a variety of algorithmic problems whose solutions will be critical to address current challenges ranging from determining the behavior of small molecules in biological pathways, to finding therapies for rare and neglected diseases. In this paper, we give a broad overview of the field of cheminformatics with a focus on open questions and challenges.

Exploring Uncharted Territories -- Predicting Activty Cliffs in Structure-Activity Landscapes

Guha, R.

J. Chem. Inf. Model., 2012, 52, 2181-2191

[ Abstract ] [DOI 10.1021/ci300047k ]

The notion of activity cliffs is an intuitive approach to characterizing structural features that play a key role in modulating biological activity of a molecule. A variety of methods have been described to quantitatively characterize activity cliffs, such as SALI and SARI. However, these methods are primarily retrospective in nature; highlighting cliffs that are already present in the dataset. The current study focuses on employing a pairwise characterization of a dataset to train a model to predict whether a new molecule will exhibit an activity cliff with one or more members of the dataset. The approach is based on predicting a value for pairs of objects rather than the individual objects themselves (and thus allows for robust models even for small structure-activity relationship datasets). We extracted structure-activity data for several ChEMBL assays and developed random forest models to predict SALI values, from pairwise combinations of molecular descriptors. The models exhibited reasonable RMSE's though, surprisingly, performance on the more significant cliffs tended to be better than on the lesser ones. While the models do not exhibit very high levels of accuracy, our results indicate that they are able to prioritize molecules in terms of their ability to activity cliffs, thus serving as a tool to prospectively identify activity cliffs.

Dealing with the Data Deluge: Handling the Multitude of Chemical Biology Data Sources

Guha, R.; Nguyen, D.-T.; Southall, N.; Jadhav, A.

Curr. Protocols Chem. Biol., 2012, 4, 193-209

[ Abstract ] [DOI 10.1002/9780470559277.ch110262 ]

Over the last 20 years, there has been an explosion in the amount and type of biological and chemical data that has been made publicly available in a variety of online databases. While this means that vast amounts of information can be found online, there is no guarantee that it can be found easily (or at all). A scientist searching for a specific piece of information is faced with a daunting task---many databases have overlapping content, use their own identifiers and, in some cases, have arcane and unintuitive user interfaces. In this overview, a variety of well-known data sources for chemical and biological information are highlighted, focusing on those most useful for chemical biology research. The issue of using data from multiple sources and the associated problems such as identifier disambiguation are highlighted. A brief discussion is then provided on Tripod, a recently developed platform that supports the integration of arbitrary data sources, providing users a simple interface to search across a federated collection of resources.

A Furoxan-Amodiaquine Hybrid as a Potential Therapeutic for Three Parasitic Diseases

Mott, B.T.; Cheng, C.C.; Guha, R.; Kommer, V.P.; Williams, D.L.; Vermeire, J.J.; Cappello, M.; Maloney, D.J.; Rai, G.; Jadhav, A.; Simeonov, A.; Inglese, J.; Posner, G.H; Thomas, C.J.

Med. Chem. Comm., 2012, 3, 1505-1511

[ Abstract ] [DOI 10.1039/C2MD20238G ]

Parasitic diseases continue to have a devastating impact on human populations worldwide. Lack of effective treatments, the high cost of existing ones, and frequent emergence of resistance to these agents provide a strong argument for the development of novel therapies. Here we report the results of a hybrid approach designed to obtain a dual acting molecule that would demonstrate activity against a variety of parasitic targets. The antimalarial drug amodiaquine has been covalently joined with a nitric oxide-releasing furoxan to achieve multiple mechanisms of action. Using in vitro and ex vivo assays, the hybrid molecule shows activity against three parasites -- Plasmodium falciparum, Schistosoma mansoni, and Ancylostoma ceylanicum.

Diversity-Oriented Synthesis Yields a Novel Lead for the Treatment of Malaria

Heidebrecht, R.W.; Mulrooney, C.; Austin, C.P.; Barker, R.H.; Beaudoin, J.A.; Cheng, K.Chih-Chien.; Comer, E.; Dandapani, S.; Dick, J.; Duvall, J.R.; Ekland, E.H.; Fidock, D.A.; Fitzgerald, M.E.; Foley, M.; Guha, R.; Hinkson, P.; Kramer, M.; Lukens, A.K.; Masi, D.; Marcaurelle, L.A.; Su, X.; Thomas, C.J.; Weïwer, M.; Wiegand, R.C.; Wirth, D.; Xia, M.; Yuan, J.; Zhao, J.; Palmer, M.; Munoz, B.; Schreiber, S.

ACS Med. Chem. Lett., 2012, 3, 112-117

[ Abstract ] [ Link ]

Here, we describe the discovery of a novel antimalarial agent using phenotypic screening of Plasmodium falciparum asexual blood-stage parasites. Screening a novel compound collection created using diversity-oriented synthesis (DOS) led to the initial hit. Structure--activity relationships guided the synthesis of compounds having improved potency and water solubility, yielding a subnanomolar inhibitor of parasite asexual blood-stage growth. Optimized compound 27 has an excellent off-target activity profile in erythrocyte lysis and HepG2 assays and is stable in human plasma. This compound is available via the molecular libraries probe production centers network (MLPCN) and is designated ML238.

Exploiting Synthetic Lethality for the Therapy of ABC Diffuse Large B Cell Lymphoma

Yang, Y.; Shaffer, A.; Emre, N. C. †Tolga; Ceribelli, M.; Zhang, M.; Wright, G.; Xiao, W.; Powell, J.; Platig, J.; Kohlhammer, H.; Young, R.; Zhao, H.; Yang, Y.; Xu, W.; Buggy, J.; Balasubramanian, S.; Mathews, L.; Shinn, P.; Guha, R.; Ferrer, M.; Thomas, C.; Waldmann, T.; Staudt, L.

Cancer Cell, 2012, 21, 723-737

[ Abstract ] [ Link ]

Exploring Structure-Activity Data Using the Landscape Paradigm

Guha, R.

WIREs Comput. Mol. Sci., 2012, 2, 829-841

[ Abstract ] [DOI 10.1002/wcms.1087 ]

In this article, we present an overview of the origin and applications of the activity landscape view of structure--activity relationship (SAR) data as conceived by Shanmugasundaram and Maggiora. Within this landscape, different regions exemplify different aspects of SAR trends---ranging from smoothly varying trends to discontinuous trends (also termed activity cliffs). We discuss the various definitions of landscapes and cliffs that have been proposed as well as different approaches to the numerical quantification of a landscape. We then highlight some of the landscape visualization approaches that have been developed, followed by a review of the various applications of activity landscapes and cliffs to topics in medicinal chemistry and SAR analysis.

A 1536-well Quantitative High Throughput Screen to Identify Compounds Targeting Cancer Stem Cells

Mathews, L.A.; Keller, J.M.; Goodwin, B.; Guha, R.; Shinn, P.; Mull, R.; Thomas, C.; de Kluyver, R.; Sayers, T.; Ferrer, M.

J. Biomol. Screen., 2012, 17, 1231-1242

[ Abstract ] [DOI 10.1177/1087057112458152 ]

Tumor cell subpopulations called cancer stem cells (CSCs) or tumor-initiating cells (TICs) have self-renewal potential and are thought to drive metastasis and tumor formation. Data suggest that these cells are resistant to current chemotherapy and radiation therapy treatments, leading to cancer recurrence. Therefore, finding new drugs and/or drug combinations that cause death of both the differentiated tumor cells as well as CSC populations is a critical unmet medical need. Here, we describe how cancer-derived CSCs are generated from cancer cell lines using stem cell growth media and nonadherent conditions in quantities that enable high-throughput screening (HTS). A cell growth assay in a 1536-well microplate format was developed with these CSCs and used to screen a focused collection of oncology drugs and clinical candidates to find compounds that are cytotoxic against these highly aggressive cells. A hit selection process that included potency and efficacy measurements during the primary screen allowed us to efficiently identify compounds with potent cytotoxic effects against spheroid-derived CSCs. Overall, this research demonstrates one of the first miniaturized HTS assays using CSCs. The procedures described here should enable further testing of the effect of compounds on CSCs and help determine which pathways need to be targeted to kill them.

A Survey of Quantitative Descriptions of Molecular Structure

Guha, R.; Willighagen, E.L.

Curr. Topics Med. Chem., 2012, 12, 1946-1956

[ Abstract ] [DOI 10.2174/156802612804910278 ]

Numerical characterization of molecular structure is a first step in many computational analysis of chemical structure data. These numerical representations, termed descriptors, come in many forms, ranging from simple atom counts and invariants of the molecular graph to distribution of properties, such as charge, across a molecular surface. In this article we first present a broad categorization of descriptors and then describe applications and toolkits that can be employed to evaluate them. We highlight a number of issues surrounding molecular descriptor calculations such as versioning and reproducibility and describe how some toolkits have attempted to address these problems.

Chemical Genomic Profiling for Antimalarial Therapies, Response Signatures, and Molecular Targets

Yuan, J.; Cheng, K.Chih-Chien.; Johnson, R.L.; Huang, R.; Pattaradilokrat, S.; Liu, A.; Guha, R.; Fidock, D.A.; Inglese, J.; Wellems, T.E.; Austin, C.P.; Su, X.

Science, 2011, 333, 724-729

[ Abstract ] [DOI 10.1021/ci200081k ]

Malaria remains a devastating disease largely because of widespread drug resistance. New drugs and a better understanding of the mechanisms of drug action and resistance are essential for fulfilling the promise of eradicating malaria. Using high-throughput chemical screening and genome-wide association analysis, we identified 32 highly active compounds and genetic loci associated with differential chemical phenotypes (DCPs), defined as greater than or equal to fivefold differences in half-maximum inhibitor concentration (IC(50)) between parasite lines. Chromosomal loci associated with 49 DCPs were confirmed by linkage analysis and tests of genetically modified parasites, including three genes that were linked to 96\% of the DCPs. Drugs whose responses mapped to wild-type or mutant pfcrt alleles were tested in combination in vitro and in vivo, which yielded promising new leads for antimalarial treatments.

KNIME Workflow to Assess PAINS Filters in SMARTS Format. Comparison of RDKit and Indigo Cheminformatics Libraries

Saubern, S.; Guha, R.; Baell, J.B.

Mol. Inf., 2011, 30, 847-850

[ Abstract ] [DOI 10.1021/ci200281v ]

Open Data, Open Source and Open Standards in Chemistry: The Blue Obelisk Five Years On

O'Boyle, N.; Guha, R.; Willighagen, E.; Adams, S.E.; Alvarsson, J.; Bradley, J.C.; Filippov, I.; Hanson, R.M.; Hanwell, M.D.; Hutchison, G.R.; James, C.A.; Jeliazkova, N.; Lang, A.; Langner, K.M.; Lonie, D.C.; Lowe, D.M.; Pansanel, J.; Pavlov, D.; Spjuth, O.; Steinbeck, C.; Tenderholt, A.; Theisen, K.; Murray-Rust, P.

J. Cheminf., 2011, 3,

[ Abstract ] [DOI 10.1186/1758-2946-3-37 ]

Background The Blue Obelisk movement was established in 2005 as a response to the lack of Open Data, Open Standards and Open Source (ODOSOS) in chemistry. It aims to make it easier to carry out chemistry research by promoting interoperability between chemistry software, encouraging cooperation between Open Source developers, and developing community resources and Open Standards. Results This contribution looks back on the work carried out by the Blue Obelisk in the past 5 years and surveys progress and remaining challenges in the areas of Open Data, Open Standards, and Open Source in chemistry. Conclusions We show that the Blue Obelisk has been very successful in bringing together researchers and developers with common interests in ODOSOS, leading to development of many useful resources freely available to the chemistry community.

Exploratory Analysis of Kinetic Solubility Measurements of a Small Molecule Library

Guha, R.; Dexheimer, T.S.; Kestranek, A.N.; Jadhav, A.; Chervenak, A.M.; Ford, M.G.; Simeonov, A.; Roth, G.P.; Thomas, C.J.

Bioorg. Med. Chem., 2011, 19, 4127-4134

[ Abstract ] [DOI 10.1016/j.bmc.2011.05.005 ]

Kinetic solubility measurements using prototypical assay buffer conditions are presented for a ∼58,000 member library of small molecules. Analyses of the data based upon physical and calculated properties of each individual molecule were performed and resulting trends were considered in the context of commonly held opinions of how physicochemical properties influence aqueous solubility. We further analyze the data using a decision tree model for solubility prediction and via a multi-dimensional assessment of physicochemical relationships to solubility in the context of specific 'rule-breakers' relative to common dogma. The role of solubility as a determinant of assay outcome is also considered based upon each compound's cross-assay activity score for a collection of publicly available screening results. Further, the role of solubility as a governing factor for colloidal aggregation formation within a specified assay setting is examined and considered as a possible cause of a high cross-assay activity score. The results of this solubility profile should aid chemists during library design and optimization efforts and represent a useful training set for computational solubility prediction.

RNAi Screening Identifies TAK1 as a Potential Target for the Enhanced Efficacy of Topoisomerase Inhibitors

Martin, S.E.; Wu, Z.H.; Gehlhaus, K.Jones.; Zhang, Y.W.; Guha, R.; Miyamoto, S.; Pommier, Y.; Caplen, N.J.

Curr. Cancer Drug Targets, 2011, 11, 976-986

[ Abstract ] [DOI 10.1002/cmdc.201100179 ]

In an effort to develop strategies that improve the efficacy of existing anticancer agents, we have conducted a siRNA-based RNAi screen to identify genes that, when targeted by siRNA, improve the activity of the topoisomerase I (Top1) poison camptothecin (CPT). Screening was conducted using a set of siRNAs corresponding to over 400 apoptosis-related genes in MDA-MB-231 breast cancer cells. During the course of these studies, we identified the silencing of MAP3K7 as a significant enhancer of CPT activity. Follow-up analysis of caspase activity and caspase-dependent phosphorylation of histone H2AX demonstrated that the silencing of MAP3K7 enhanced CPT-associated apoptosis. Silencing MAP3K7 also sensitized cells to additional compounds, including CPT clinical analogs. This activity was not restricted to MDA-MB-231 cells, as the silencing of MAP3K7 also sensitized the breast cancer cell line MDA-MB-468 and HCT-116 colon cancer cells. However, MAP3K7 silencing did not affect compound activity in the comparatively normal mammary epithelial cell line MCF10A, as well as some additional tumorigenic lines. MAP3K7 encodes the TAK1 kinase, an enzyme that is central to the regulation of many processes associated with the growth of cancer cells (e.g. NF-kB, JNK, and p38 signaling). An analysis of TAK1 signaling pathway members revealed that the silencing of TAB2 also sensitizes MDA-MB-231 and HCT-116 cells towards CPT. These findings may offer avenues towards lowering the effective doses of Top1 inhibitors in cancer cells and, in doing so, broaden their application.

Discovery of New Antimalarial Chemotypes Through Chemical Methodology and Library Development

Brown, L.E.; Chih-Chien Cheng, K.; Wei, W.; Yuan, P.; Dai, P.; Trilles, R.; Ni, F.; Yuan, J.; MacArthur, R.; Guha, R.; Johnson, R.L.; Su, X.; Dominguez, M.M.; Snyder, J.K.; Beeler, A.B.; Schaus, S.E.; Inglese, J.; Porco, J.

Proc. Nat. Acad. Sci., 2011, 108, 6775-6780

[ Abstract ] [DOI 10.1073/pnas.1017666108 ]

In an effort to expand the stereochemical and structural complexity of chemical libraries used in drug discovery, the Center for Chemical Methodology and Library Development at Boston University has established an infrastructure to translate methodologies accessing diverse chemotypes into arrayed libraries for biological evaluation. In a collaborative effort, the NIH Chemical Genomics Center determined IC(50)'s for Plasmodium falciparum viability for each of 2,070 members of the CMLD-BU compound collection using quantitative high-throughput screening across five parasite lines of distinct geographic origin. Three compound classes displaying either differential or comprehensive antimalarial activity across the lines were identified, and the nascent structure activity relationships (SAR) from this experiment used to initiate optimization of these chemotypes for further development.

Improving Usability and Accessibility of Cheminformatics Tools for Chemists Through Cyberinfrastructure and Education

Guha, R.; Wiggins, G.D.; Wild, D.J.; Baik, M.H.; Pierce, M.E.; Fox, G.C.

Cheminformatics, 2010, in press

[ Abstract ] [DOI 10.3233/CI-2010-0015 ]

Advances in Cheminformatics Methodologies and Infrastructure to Support the Data Mining of Large, Heterogeneous Chemical Datasets

Guha, R.; Gilbert, K.; Fox, G.C.; Pierce, M.; Wild, D.; Yuan, H.

Curr. Comp. Aid. Drug Des., 2010, 6, 50-67

In recent years, there has been an explosion in the availability of publicly accessible chemical information, including chemical structures of small molecules, structure-derived properties and associated biological activities in a variety of assays. These data sources present us with a significant opportunity to develop and apply computational tools to extract and understand the underlying structure-activity relationships. Furthermore, by integrating chemical data sources with biological information (protein structure, gene expression and so on), we can attempt to build up a holistic view of the effects of small molecules in biological systems. Equally important is the ability for non-experts to access and utilize state of the art cheminformatics method and models. In this review we present recent developments in cheminformatics methodologies and infrastructure that provide a robust, distributed approach to mining large and complex chemical datasets. In the area of methodology development, we highlight recent work on characterizing structure-activity landscapes, QSAR model domain applicability and the use of chemical similarity in text mining. In the area of infrastructure, we discuss a distributed web services framework that allows easy deployment and uniform access to computational (statistics, cheminformatics and computational chemistry) methods, data and models. We also discuss the development of PubChem derived databases and highlight techniques that allow us to scale the infrastructure to extremely large compound collections, by use of distributed processing on Grids. Given that the above work is applicable to arbitrary types of cheminformatics problems, we also present some case studies related to virtual screening for anti-malarials and predictions of anti- cancer activity.

Use of Genetic Algorithm and Neural Network Approaches for Risk Factor Selection: A Case Study of West Nile Virus Dynamics in an Urban Environment

Ghosh, D.; Guha, R.

Computers, Environment and Urban Systems, 2010, 34, 189-203

[ Abstract ] [DOI 10.1016/j.compenvurbsys.2010.02.007 ]

The West Nile virus (WNV) is an infectious disease spreading rapidly throughout the United States, causing illness among thousands of birds, animals, and humans. Yet, we only have a rudimentary understanding of how the mosquito-borne virus operates in complex avian--human environmental systems coupled with risk factors. The large array of multidimensional risk factors underlying WNV incidences is environmental, built-environment, socioeconomic, and existing mosquito abatement policies. Therefore it is essential to identify an optimal number of risk factors whose management would result in effective disease prevention and containment. Previous models built to select important risk factors assumed a priori that there is a linear relationship between these risk factors and disease incidences. However, it is difficult for linear models to incorporate the complexity of the WNV transmission network and hence identify an optimal number of risk factors objectively. There are two objectives of this paper, first, use combination of genetic algorithm (GA) and computational neural network (CNN) approaches to build a model incorporating the non-linearity between incidences and hypothesized risk factors. Here GA is used for risk factor (variable) selection and CNN for model building mainly because of their ability to capture complex relationships with higher accuracy than linear models. The second objective is to propose a method to measure the relative importance of the selected risk factors included in the model. The study is situated in the metropolitan area of Minnesota, which had experienced significant outbreaks from 2002 till present.

Towards Interoperable and Reproducible QSAR Analyses: Exchange of Data Sets'

Spujth, O.; Willighagen, E.L.; Guha, R.; Eklund, M.; Wikberg, J.E.S.

J. Cheminformatics, 2010, 2,

[ Abstract ] [ Link ]

QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effort has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue is the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analyses and drastically constrain collaborations and re-use of data.

Towards interoperable and reproducible QSAR analyses: Exchange of datasets.

Spjuth, O.; Willighagen, E.L.; Guha, R.; Eklund, M.; Wikberg, J.

Journal of Cheminformatics, 2010, 2,

[ Abstract ] [DOI 10.1186/1758-2946-2-5 ]

ABSTRACT: BACKGROUND: QSAR is a widely used method to relate chemical structures to responses or properties based on experimental observations. Much effo rt has been made to evaluate and validate the statistical modeling in QSAR, but these analyses treat the dataset as fixed. An overlooked but highly important issue i s the validation of the setup of the dataset, which comprises addition of chemical structures as well as selection of descriptors and software implementations prior to calculations. This process is hampered by the lack of standards and exchange formats in the field, making it virtually impossible to reproduce and validate analys es and drastically constrain collaborations and re-use of data. RESULTS: We present a step towards standardizing QSAR analyses by defining interoperable and reproduc ible QSAR datasets, consisting of an open XML format (QSAR-ML) which builds on an open and extensible descriptor ontology. The ontology provides an extensible way of uniquely defining descriptors for use in QSAR experiments, and the exchange format supports multiple versioned implementations of these descriptors. Hence, a datase t described by QSAR-ML makes its setup completely reproducible. We also provide a reference implementation as a set of plugins for Bioclipse which simplifies setup o f QSAR datasets, and allows for exporting in QSAR-ML as well as old-fashioned CSV formats. The implementation facilitates addition of new descriptor implementations from locally installed software and remote Web services; the latter is demonstrated with REST and XMPP Web services. CONCLUSIONS: Standardized QSAR datasets open up new ways to store, query, and exchange data for subsequent analyses. QSAR-ML supports completely reproducible creation of datasets, solving the problems of defining which software components were used and their versions, and the descriptor ontology eliminates confusions regarding descriptors by defining them crisply. This makes it easy to join, extend, combine datasets and hence work collectively, but also allows for analyzing the effect descriptors have on the statistical model's performan ce. The presented Bioclipse plugins equip scientists with graphical tools that make QSAR-ML easily accessible for the community.

PubChem as a Source of Polypharmacology

Chen, B.; Wild, D.; Guha, R.

J. Chem. Inf. Model., 2009, 49, 2044-2055

[ Abstract ] [DOI 10.1021/ci9001876 ]

Polypharmacology provides a new way to address the issue of high attrition rates arising from lack of efficacy and toxicity. However, the development of polypharmacology is hampered by the incomplete SAR data and limited resources for validating target combinations. The PubChem bioassay collection, reporting the activity of compounds in multiple assays, allows us to study polypharmacological behavior in the PubChem collection via cross-assay analysis. In this paper, we developed a network representation of the assay collection and then applied a bipartite mapping between this network and various biological networks (i.e., PPI, pathway) as well as artificial networks (i.e., drug-target network). Mapping to a drug-target network allows us to prioritize new selective compounds, while mapping to other biological networks enable us to observe interesting target pairs and their associated compounds in the context of biological systems. Our results indicate this approach could be a useful way to investigate polypharmacology in the PubChem bioassay collection.

Chemoinformatic Analysis of Drugs, Natural Products, Molecular Libraries Small Molecule Repository and Combinatorial Libraries

Singh, N.; Guha, R.; Guilianotti, M.; Houghten, R.; Medina-Franco, J.L.

J. Chem. Inf. Model., 2009, 49, 1010-1024

[ Abstract ] [DOI 10.1021/ci800426u ]

A multiple criteria approach is presented, that is used to perform a comparative analysis of four recently developed combinatorial libraries to drugs, Molecular Libraries Small Molecule Repository (MLSMR) and natural products. The compound databases were assessed in terms of physicochemical properties, scaffolds, and fingerprints. The approach enables the analysis of property space coverage, degree of overlap between collections, scaffold and structural diversity, and overall structural novelty. The degree of overlap between combinatorial libraries and drugs was assessed using the R-NN curve methodology, which measures the density of chemical space around a query molecule embedded in the chemical space of a target collection. The combinatorial libraries studied in this work exhibit scaffolds that were not observed in the drug, MLSMR, and natural products databases. The fingerprint-based comparisons indicate that these combinatorial libraries are structurally different than current drugs. The R-NN curve methodology revealed that a proportion of molecules in the combinatorial libraries is located within the property space of the drugs. However, the R-NN analysis also showed that there are a significant number of molecules in several combinatorial libraries that are located in sparse regions of the drug space.

Navigating Structure Activity Landscapes

Bajorath, J.; Peltason, L.; Wawer, M.; Guha, R.; Lajiness, M.S.; van Drie, J.H.

Drug Discov. Today, 2009, 14, 698-705

[ Abstract ] [DOI 10.1016/j.drudis.2009.04.003 ]

The problem of how to systematically explore structure-activity relationships (SARs) is still largely unsolved in medicinal chemistry. Recently, data analysis tools have been introduced to navigate activity landscapes and assess structure-activity relationships on a large scale. Initial investigations reveal a surprising heterogeneity among SARs and shed light on the relationship between `global' and `local' SAR features. Moreover, insights are provided into the fundamental issue of why modeling tools work well in some cases, but not in others.

Pharmacophore Representation and Searching

Guha, R.; Van Drie, J.H.

CDK News, 2008, ASAP

[ Abstract ] [ Link ]

In this article we describe the design and use of a set of Java classes to represent pharmacophores and use such representations in pharmacophore searching applications.

Assessing How Well a Modeling Protocol Captures a Structure-Activity Landscape

Guha, R.; Van Drie, J.H.

J. Chem. Inf. Model., 2008, 48, 1716-1728

[ Abstract ] [DOI 10.1021/ci8001414 ]

We introduce the notion of structure-activity landscape index (SALI) curves as a way to assess a model and a modeling protocol, applied to structure-activity relationships. We start from our earlier work [J. Chem. Inf. Model., 2008, 48, 646-658], where we show how to study a structure-activity relationship pairwise, based on the notion of "activity cliffs" - pairs of molecules that are structurally similar but have large differences in activity. There, we also introduced the SALI parameter, which allows one to identify cliffs easily, and which allows one to represent a structure-activity relationship as a graph. This graph orders every pair of molecules by their activity. Here, we introduce the new idea of a SALI curve, which tallies how many of these orderings a model is able to predict. Empirically, testing these SALI curves against a variety of models, ranging over two-dimensional quantitative structure-activity relationship (2D-QSAR), three-dimensional quantitative structure-activity relationship (3D-QSAR), and structure-based design models, the utility of a model seems to correspond to characteristics of these curves. In particular, the integral of these curves, denoted as SCI and being a number ranging from -1.0 to 1.0, approaches a value of 1.0 for two literature models, which are both known to be prospectively useful.

The Structure-Activity Landscape Index: Identifying and Quantifying Activity-Cliffs

Guha, R.; Van Drie, J.H.

J. Chem. Inf. Model., 2008, 48, 646-658

[ Abstract ] [DOI 10.1021/ci7004093 ]

A new method for analyzing a structure-activity relationship is proposed. By use of a simple quantitative index, one can readily identify "structure-activity cliffs": pairs of molecules which are most similar but have the largest change in activity. We show how this provides a graphical representation of the entire SAR, in a way that allows the salient features of the SAR to be quickly grasped. In addition, the approach allows us view the SARs in a data set at different levels of detail. The method is tested on two data sets that highlight its ability to easily extract SAR information. Finally, we demonstrate that this method is robust using a variety of computational control experiments and discuss possible applications of this technique to QSAR model evaluation.

A Flexible Web Service Infrastructure for the Development and Deployment of Predictive Models

Guha, R.

J. Chem. Inf. Model., 2008, 48, 456-464

[ Abstract ] [DOI 10.1021/ci700188u ]

The development of predictive statistical models is a common task in the field of drug design. The process of developing such models involves two main steps: building the model and then deploying the model. Traditionally such models have been deployed using web page interfaces. This approach restricts the user to using the specified web page and using the model in other ways can be cumbersome. In this paper we present a flexible and generalizable approach to the deployment of predictive models, based on a web service infrastructure using R. The infrastructure described allows one to access the functionality of these models using a variety of approach ranging from web pages to workflow tools. We highlight the advantages of this infrastructure by developing and subsequently deploying random forest models for two datasets.

On the Interpretation and Interpretability of QSAR Models

Guha, R.

J. Comp. Aid. Molec. Des., 2008, 22, 857-871

[ Abstract ] [DOI 10.1007/s10822-008-9240-5 ]

The goal of a quantitative structure--activity relationship (QSAR) model is to encode the relationship between molecular structure and biological activity or physical property. Based on this encoding, such models can be used for predictive purposes. Assuming the use of relevant and meaningful descriptors, and a statistically significant model, extraction of the encoded structure--activity relationships (SARs) can provide insight into what makes a molecule active or inactive. Such analyses by QSAR models are useful in a number of scenarios, such as suggesting structural modifications to enhance activity, explanation of outliers and exploratory analysis of novel SARs. In this paper we discuss the need for interpretation and an overview of the factors that affect interpretability of QSAR models. We then describe interpretation protocols for different types of models, highlighting the different types of interpretations, ranging from very broad, global, trends to very specific, case-by-case, descriptions of the SAR, using examples from the training set. Finally, we discuss a number of case studies where workers have provide some form of interpretation of a QSAR model.

Utilizing High Throughput Screening Data for Predictive Toxicology Models: Protocols and Application to MLSCN Assays

Guha, R.; Sch\"urer, S.C.

J. Comp. Aid. Molec. Des., 2008, 22, 367-384

[ Abstract ] [DOI 10.1007/s10822-008-9192-9 ]

Computational toxicology is emerging as an encouraging alternative to experimental testing. The Molecular Libraries Screening Center Network (MLSCN) as part of the NIH Molecular Libraries Roadmap has recently started generating large and diverse screening datasets, which are publicly available in PubChem. In this report, we investigate various aspects of developing computational models to predict cell toxicity based on cell proliferation screening data generated in the MLSCN. By capturing feature-based information in those datasets, such predictive models would be useful in evaluating cell-based screening results in general (for example from reporter assays) and could be used as an aid to identify and eliminate potentially undesired compounds. Specifically we present the results of random forest ensemble models developed using different cell proliferation datasets and highlight protocols to take into account their extremely imbalanced nature. Depending on the nature of the datasets and the descriptors employed we were able to achieve percentage correct classification rates between 70% and 85% on the prediction set, though the accuracy rate dropped significantly when the models were applied to in vivo data. In this context we also compare the MLSCN cell proliferation results with animal acute toxicity data to investigate to what extent animal toxicity can be correlated and potentially predicted by proliferation results. Finally, we present a visualization technique that allows one to compare a new dataset to the training set of the models to decide whether the new dataset may be reliably predicted.

Userscripts for the Life Sciences

Willighagen, E.L.; O'Boyle, N.; Gopalakrishnan, H.; Jiao, D.; Guha, R.; Steinbeck, C.; Wild, D.J.

BMC Bioinformatics, 2007, 8, 487

[ Abstract ] [DOI 10.1186/1471-2105-8-487 ]

The web has seen an explosion of chemistry and biology related resources in the last 15 years: thousands of scientific journals, databases, wikis, blogs and resources are available with a wide variety of types of information. There is a huge need to aggregate and organise this information. However, the sheer number of resources makes it unrealistic to link them all in a centralised manner. Instead, search engines to find information in those resources flourish, and formal languages like Resource Description Framework and Web Ontology Language are increasingly used to allow linking of resources. A recent development is the use of userscripts to change the appearance of web pages, by on-the-fly modification of the web content. This opens possibilities to aggregate information and computational results from different web resources into the web page of one of those resources.

Chemical Data Mining of the NCI Human Tumor Cell Line Database

Wang, H.; Klinginsmith, J.; Dong, X.; Lee, A.; Guha, R.; Wu, Y.; Crippen, G.; Wild, D.J.

J. Chem. Inf. Model., 2007, 47, 2063-2076

[ Abstract ] [DOI 10.1021/ci700141x ]

The NCI Developmental Therapeutics Program Human Tumor cell line data set is a publicly available database that contains cellular assay screening data for over 40 000 compounds tested in 60 human tumor cell lines. The database also contains microarray assay gene expression data for the cell lines, and so it provides an excellent information resource particularly for testing data mining methods that bridge chemical, biological, and genomic information. In this paper we describe a formal knowledge discovery approach to characterizing and data mining this set and report the results of some of our initial experiments in mining the set from a chemoinformatics perspective.

Counting Clusters Using R-NN Curves

Guha, R.; Dutta, D.; Chen, T.; Wild, D.J.

J. Chem. Inf. Model., 2007, 47, 1308-1318

[ Abstract ] [DOI 10.1021/ci600541f ]

Clustering is a common task in the field of cheminformatics. A key parameter that needs to be set for non-hierarchical clustering methods, such as $k$-means, is the number of clusters, k. Traditionally the value of $k$ is obtained by performing the clustering with different values of $k$ and selecting that value that leads to the optimal clustering. In this study we describe an approach to selecting k, a priori, based on the R-NN curve algorithm described by Guha et al. (J.~Chem.~Inf.~Model., 2006, 46, 1713-1722) which uses a nearest neighbor technique to characterize the spatial location of compounds in arbitrary descriptor spaces. The algorithm generates a set of curves for the dataset which are then analyzed to estimate the natural number of clusters. We then performed k-means clustering with the predicted value of k as well as with similar values to check that the correct number of clusters was obtained. In addition we compared the predicted value to the number indicated by the average silhouette width as a cluster quality measure. We tested the algorithm on simulated data as well as on two chemical datasets. Our results indicate the the R-NN curve algorithm is able to determine the natural number of clusters and is in general agreement the average silhouette width in identifying the optimal number of clusters

A Web Service Infrastructure for Chemoinformatics

Dong, X.; Gilbert, K.; Guha, R.; Heiland, R.; Kim, J.; Pierce, M.; Fox, G.; Wild, D.J.

J. Chem. Inf. Model., 2007, 47, 1303-1307

[ Abstract ] [DOI 10.1021/ci6004349 ]

The vast increase of pertinent information available to drug discovery scientists means that there is strong demand for tools and techniques for organizing and intelligently mining this information for manageable human consumption. At Indiana University, we have developed an infrastructure of chemoinformatics web services that simplify the access to this information and the computational techniques that can be applied to it. In this paper, we describe this infrastructure, give some examples of its use, and then discuss our plans to use it as a platform for chemoinformatics application development in the future.

Ensemble Feature Selection: Consistent Descriptor Subsets for Multiple QSAR Models

Dutta, D.; Guha, R.; Chen, T.; Wild, D.J.

J. Chem. Inf. Model., 2007, 47, 989-997

[ Abstract ] [DOI 10.1021/ci600563w ]

Selecting a small subset of descriptors from a large pool to build a predictive QSAR model is an important step in the QSAR modeling process. In general subset selection is very hard to solve, even approximately, with guaranteed performance bounds. Traditional approaches employ deterministic or stochastic methods to obtain a descriptor subset that leads to an optimal model of a single type (such as linear regression or a neural network). With the development of ensemble modeling approaches, multiple models of differing types are individually developed resulting in different descriptor subsets for each model type. However it is advantageous, from the point of view of developing interpretable QSAR models, to have a single set of descriptors that can be used for different model types. In this paper, we describe an approach to the selection of a single, optimal, subset of descriptors for multiple model types. We apply this approach to three datasets, covering both regression and classification, and show that the constraint of forcing different model types to use the same set of descriptors does not lead to a significant loss in predictive ability for the individual models considered. In addition, interpretations of the individual models developed using this approach indicate that they encode similar structure-activity trends.

Chemical Informatics Functionality in R

Guha, R.

J. Stat. Soft., 2007, 18,

[ Abstract ] [ Link ]

The flexibility and scope of the R programming environment has made it a popular choice for statistical modeling and scientific prototyping in a number of fields. In the field of chemistry, R provides several tools for a variety of problems related to statistical modeling of chemical information. However, one aspect common to these tools is that they do not have direct access to the information that is available from chemical structures, such as contained in molecular descriptors. We describe the rcdk package that provides the R user with access to the CDK, a Java framework for cheminformatics. As a result, it is possible to read in a variety of molecular formats, calculate molecular descriptors and evaluate fingerprints. In addition, we describe the rpubchem that will allow access to the data in PubChem, a public repository of molecular structures and associated assay data for approximately 8 million compounds. Currently the package allows access to structural information as well as some simple molecular properties from PubChem. In addition the package allows access to bio-assay data from the PubChem FTP servers.

Local Lazy Regression: Making Use of the Neighborhood to Improve QSAR Predictions.

Guha, R.; Dutta, D.; Jurs, P.C.; Chen, T.

J. Chem. Inf. Model., 2006, 46, 1836-1847

[ Abstract ] [DOI 10.1021/ci060064e ]

Traditional quantitative structure-activity relationship (QSAR) models aim to capture global structure-activity trends present in a data set. In many situations, there may be groups of molecules which exhibit a specific set of features which relate to their activity or inactivity. Such a group of features can be said to represent a local structure-activity relationship. Traditional QSAR models may not recognize such local relationships. In this work, we investigate the use of local lazy regression (LLR), which obtains a prediction for a query molecule using its local neighborhood, rather than considering the whole data set. This modeling approach is especially useful for very large data sets because no a priori model need be built. We applied the technique to three biological data sets. In the first case, the root-mean-square error (RMSE) for an external prediction set was 0.94 log units versus 0.92 log units for the global model. However, LLR was able to characterize a specific group of anomalous molecules with much better accuracy (0.64 log units versus 0.70 log units for the global model). For the second data set, the LLR technique resulted in a decrease in RMSE from 0.36 log units to 0.31 log units for the external prediction set. In the third case, we obtained an RMSE of 2.01 log units versus 2.16 log units for the global model. In all cases, LLR led to a few observations being poorly predicted compared to the global model. We present an analysis of why this was observed and possible improvements to the local regression approach.

R-NN Curves: An Intuitive Approach to Outlier Detection Using a Distance Based Method

Guha, R.; Dutta, D.; Jurs, P.C; Chen, T.

J. Chem. Inf. Model., 2006, 46, 1713-1722

[ Abstract ] [DOI 10.1021/ci060013h ]

Libraries of chemical structures are used in a variety of cheminformatics tasks such as virtual screening and QSAR modeling and are generally characterized using molecular descriptors. When working with libraries it is useful to understand the distribution of compounds in the space defined by a set of descriptors. We present a simple approach to the analysis of the spatial distribution of the compounds in a library in general and outlier detection in particular based on counts of neighbors within a series of increasing radii. The resultant curves, termed R-NN curves, appear to follow a logistic model for any given descriptor space, which we justify theoretically for the 2D case. The method can be applied to data sets of arbitrary dimensions. The R-NN curves provide a visual method to easily detect compounds lying in a sparse region of a given descriptor space. We also present a method to numerically characterize the R-NN curves thus allowing identification of outliers in a single plot.

The Blue Obelisk--Interoperability in Chemical Informatics.

Guha, R.; Howard, M.T.; Hutchison, G.R.; Murray-Rust, P.; Rzepa, H.; Steinbeck, C.; Wegner, J.; Willighagen, E.L.

J. Chem. Inf. Model., 2006, 46, 991-998

[ Abstract ] [DOI 10.1021/ci050400b ]

The Blue Obelisk Movement (http://www.blueobelisk.org/) is the name used by a diverse Internet group promoting reusable chemistry via open source software development, consistent and complimentary chemoinformatics research, open data, and open standards. We outline recent examples of cooperation in the Blue Obelisk group: a shared dictionary of algorithms and implementations in chemoinformatics algorithms drawing from our various software projects; a shared repository of chemoinformatics data including elemental properties, atomic radii, isotopes, atom typing rules, and so forth; and Web services for the platform-independent use of chemoinformatics programs.

Scalable Partitioning and Exploration of Chemical Spaces using Geometric Hashing

Dutta, D.; Guha, R.; Jurs, P.C.; Chen, T.

J. Chem. Inf. Model., 2006, 46, 321-333

[ Abstract ] [DOI 10.1021/ci050403o ]

Virtual screening (VS) has become a preferred tool to augment high-throughput screening1 and determine new leads in the drug discovery process. The core of a VS informatics pipeline includes several data mining algorithms that work on huge databases of chemical compounds containing millions of molecular structures and their associated data. Thus, scaling traditional applications such as classification, partitioning, and outlier detection for huge chemical data sets without a significant loss in accuracy is very important. In this paper, we introduce a data mining framework built on top of a recently developed fast approximate nearest-neighbor-finding algorithm called locality-sensitive hashing (LSH) that can be used to mine huge chemical spaces in a scalable fashion using very modest computational resources. The core LSH algorithm hashes chemical descriptors so that points close to each other in the descriptor space are also close to each other in the hashed space. Using this data structure, one can perform approximate nearest-neighbor searches very quickly, in sublinear time. We validate the accuracy and performance of our framework on three real data sets of sizes ranging from 4337 to 249,071 molecules. Results indicate that the identification of nearest neighbors using the LSH algorithm is at least 2 orders of magnitude faster than the traditional k-nearest-neighbor method and is over 94% accurate for most query parameters. Furthermore, when viewed as a data-partitioning procedure, the LSH algorithm lends itself to easy parallelization of nearest-neighbor classification or regression. We also apply our framework to detect outlying (diverse) compounds in a given chemical space; this algorithm is extremely rapid in determining whether a compound is located in a sparse region of chemical space or not, and it is quite accurate when compared to results obtained using principal-component-analysis-based heuristics.

Generating, Using and Visualizing Molecular Information in R

Guha, R.

R News, 2006, 3, 28-33

[ Abstract ] [ Link ]

Validation of the CDK Surface Area Routine

Guha, R.

CDK News, 2006, 3, 5-9

Recent Developments of the Chemistry Development Kit (CDK) - An Open-Source Java Library for Chemo- and Bioinformatics

Steinbeck, C.; Hoppe, C.; Kuhn, S.; Floris, M.; Guha, R.; Willighagen, E.L.

Curr. Pharm. Des., 2006, 12, 2110-2120

[ Abstract ] [DOI 10.2174/138161206777585274 ]

The Chemistry Development Kit (CDK) provides methods for common tasks in molecular informatics, including 2D and 3D rendering of chemical structures, I/O routines, SMILES parsing and generation, ring searches, isomorphism checking, structure diagram generation, etc. Implemented in Java, it is used both for server-side computational services, possibly equipped with a web interface, as well as for applications and client-side applets. This article introduces the CDK's new QSAR capabilities and the recently introduced interface to statistical software.

Interpreting Computational Neural Network QSAR Models: A Detailed Interpretation of the Weights and Biases

Guha, R.; Stanton, D.T.; Jurs, P.C.

J. Chem. Inf. Model., 2005, 45, 1109-1121

[ Abstract ] [DOI 10.1021/ci050110v ]

Interpreting Computational Neural Network QSAR Models: A Measure of Descriptor Importance

Guha, R.; Jurs, P.C.

J. Chem. Inf. Model., 2005, 45, 800-806

[ Abstract ] [DOI 10.1021/ci050022a ]

We present a method to measure the relative importance of the descriptors present in a QSAR model developed with a computational neural network (CNN). The approach is based on a sensitivity analysis of the descriptors. We tested the method on three published data sets for which linear and CNN models were previously built. The original work reported interpretations for the linear models, and we compare the results of the new method to the importance of descriptors in the linear models as described by a PLS technique. The results indicate that the proposed method is able to rank descriptors such that important descriptors in the CNN model correspond to the important descriptors in the linear model.

Determining the Validity of a QSAR Model--A Classification Approach

Guha, R.; Jurs, P.C.

J. Chem. Inf. Model., 2005, 45, 65-73

[ Abstract ] [DOI 10.1021/ci0497511 ]

The determination of the validity of a QSAR model when applied to new compounds is an important concern in the field of QSAR and QSPR modeling. Various scoring techniques can be applied to specific types of models. We present a technique with which we can state whether a new compound will be well predicted by a previously built QSAR model. In this study we focus on linear regression models only, though the technique is general and could also be applied to other types of quantitative models. Our technique is based on a classification method that divides regression residuals from a previously generated model into a good class and bad class and then builds a classifier based on this division. The trained classifier is then used to determine the class of the residual for a new compound. We investigated the performance of a variety of classifiers, both linear and nonlinear. The technique was tested on two data sets from the literature and a hand built data set. The data sets selected covered both physical and biological properties and also presented the methodology with quantitative regression models of varying quality. The results indicate that this technique can determine whether a new compound will be well or poorly predicted with weighted success rates ranging from 73% to 94% for the best classifier.

Using R to Provide Statistical Functionality for QSAR Modeling in CDK to Provide Statistical Functionality for QSAR Modeling in CDK

Guha, R.

CDK News, 2005, 2, 7-13

[ Abstract ] [ Link ]

Using the CDK as a Backend to R

Guha, R.

CDK News, 2005, 2, 2-6

[ Abstract ] [ Link ]

Development of Linear, Ensemble, and Nonlinear Models for the Prediction and Interpretation of the Biological Activity of a Set of PDGFR Inhibitors.

Guha, R.; Jurs, P.C.

J. Chem. Inf. Comput. Sci., 2004, 44, 2179-2189

[ Abstract ] [DOI 10.1021/ci049849f ]

A QSAR modeling study has been done with a set of 79 piperazyinylquinazoline analogues which exhibit PDGFR inhibition. Linear regression and nonlinear computational neural network models were developed. The regression model was developed with a focus on interpretative ability using a PLS technique. However, it also exhibits a good predictive ability after outlier removal. The nonlinear CNN model had superior predictive ability compared to the linear model with a training set error of 0.22 log(IC50) units (R2 = 0.93) and a prediction set error of 0.32 log(IC50) units (R2 = 0.61). A random forest model was also developed to provide an alternate measure of descriptor importance. This approach ranks descriptors, and its results confirm the importance of specific descriptors as characterized by the PLS technique. In addition the neural network model contains the two most important descriptors indicated by the random forest model.

The Development of QSAR Models To Predict and Interpret the Biological Activity of Artemisinin Analogues

Guha, R.; Jurs, P.C.

J. Chem. Inf. Comput. Sci., 2004, 44, 1440-1449

[ Abstract ] [DOI 10.1021/ci0499469 ]

This work presents the development of Quantitative Structure-Activity Relationship (QSAR) models to predict the biological activity of 179 artemisinin analogues. The structures of the molecules are represented by chemical descriptors that encode topological, geometric, and electronic structure features. Both linear (multiple linear regression) and nonlinear (computational neural network) models are developed to link the structures to their reported biological activity. The best linear model was subjected to a PLS analysis to provide model interpretability. While the best linear model does not perform as well as the nonlinear model in terms of predictive ability, the application of PLS analysis allows for a sound physical interpretation of the structure-activity trend captured by the model. On the other hand, the best nonlinear model is superior in terms of pure predictive ability, having a training error of 0.47 log RA units (R2 = 0.96) and a prediction error of 0.76 log RA units (R2 = 0.88).

Generation of QSAR Sets with a Self-Organizing Map.

Guha, R.; Serra, J.R.; Jurs, P.C.

J. Mol. Graph. Model., 2004, 23, 1-14

[ Abstract ] [DOI 10.1016/j.jmgm.2004.03.003 ]

A Kohonen self-organizing map (SOM) is used to classify a data set consisting of dihydrofolate reductase inhibitors with the help of an external set of Dragon descriptors. The resultant classification is used to generate training, cross-validation (CV) and prediction sets for QSAR modeling using the ADAPT methodology. The results are compared to those of QSAR models generated using sets created by activity binning and a sphere exclusion method. The results indicate that the SOM is able to generate QSAR sets that are representative of the composition of the overall data set in terms of similarity. The resulting QSAR models are half the size of those published and have comparable RMS errors. Furthermore, the RMS errors of the QSAR sets are consistent, indicating good predictive capabilities as well as generalizability.