R Packages & Functions

A collection of packages and functions written for R that I wrote for various things that have arisen during my research or simply while I was bored.

Writing JCAMP-DX files | VOTables to data.frame | Object Browser | Spectral Clustering | Removing Correlated Columns | Removing Constant or Near-Constant Columns | Parsing a call Object | Correlation-Based Feature Selection | Entropy Functions | 2D Numerical Integration | Neighborhood Behavior | Rescaling a Vector | Binary Fingerprint Tools | Levenshtein Distance | Fractal Dimension | Jarvis Patrick Clustering | Binary QSAR | kNN Regression | Stochastic Proximity Embedding | Integrating R & the CDK


Writing JCAMP-DX files
As part of a project I was trying to visualize spectrum-like data and rather than using static plots I wanted to view them in JSpecView. Since this only reads JCAMP-DX format the following function writes a spectrum (i.e., X,Y data) in this format. The current code is quite limited and assumes equispaced X values. The argument, xydata in the bspec.write function should be a matrix or data.frame with two columns.
chunklist <- function(vec, chunksize) {
  ret <- list()
  counter <- 1
  idx <- seq(1, length(vec), by=chunksize)
  lastidx <- idx[length(idx)]
  for (i in idx) {
    if (i == lastidx && i+chunksize-1 > length(vec)) {
      ret[[counter]] <- vec[i:length(vec)]
    } else {
    ret[[counter]] <- vec[i:(i+chunksize-1)]
  }
    counter <- counter +1
  }
  ret
}

bspec.write <- function(filename, xydata, name) {
  con = file(description=filename, open='w')
  writeLines(paste('##TITLE= ',name,sep=''), con=con)
  writeLines('##JCAMP-DX= 4.24\n##DATA TYPE= UV/VIS SPECTRUM\n##DATA CLASS= XYDATA', con=con)
  writeLines('##XLABEL= Bit Position\n##YLABEL= Normalized Frequency', con=con)
  writeLines('##XFACTOR= 1.0\n##YFACTOR= 1.0', con=con)
  writeLines('##FIRSTX= 1.0\n##LASTX= 1052.0', con=con)
  writeLines('##MAXY= 1.0\n##MINY= 0.0', con=con)    
  writeLines(paste('##NPOINTS= ',nrow(xydata),sep=''), con=con)
  writeLines('##XYDATA= (X++(Y..Y))', con=con)

  chunks <- .chunklist(round(xydata[,2],2), 10)
  i <- 1
  for (chunk in chunks) {
    writeLines(paste(i, paste(chunk, sep=' ', collapse=' '), sep=' '), con=con)
    i <- i+10
  }
  
  writeLines("##END=", con=con)
  close(con)
}
VOTables to data.frame
The VOTable format is an XML encoding for tabular data. The file votables.R provides a function to read in a VOTable encoded XML document and create one or more data.frames out of it. The syntax of the function is
vot2df( doc, isURL=FALSE, isString=FALSE)
If isString is TRUE then the contents of doc is considered to be string containing the XML document. Otherwise it is considered a filename or url depending on the value of isURL. The return value is a single data.frame object if the document contains a single table entry or a list of tables if there is more than one table in the document. For each table it will add an attribute called description to the resultant data.frame containing the value of the DESCRIPTION element for that table in the document, if it exists.

Currently it will only handle tabular data corresponding to the TABLEDATA element of the VOTable schema. In addition it will handle the following primitive data types and convert them to R types

Update (10/20/2007): The reader can now handle the case where cells are empty. The corresponding data.frame will contain an NA at those positions

VOTable type R type


double, float
long, short
double
intinteger
charcharacter
booleanlogical
Object Browser
In many cases it's useful to have a quick view of the various objects (variables or functions or both) present in the workspace. Useful information could include the class of each object as well as any associated comments.

obj.R provides a useful utility function which brings up a graphical table that basically lists all the variables (or functions or both) in the current workspace. That it is, it considers env = .GlobalEnv. A screenshot can be seen here

The code requires that you have Tcl/TK as well as the Tktable widget installed on your system. To use the function called obj.browse simply source obj.R. The synatx of the function is

obj.browse(which=c('all','funcs','vars'), height=-1, width=-1)

which='all'   - Display all variables and functions. In this case the rows are 
                sorted on the basis of the 'Class' column
which='funcs' - Display only functions
which='vars'  - Display only variables

height=-1     - How many rows to display. Default is to use the Tktable default 
width=-1      - How many columns to display. Default is to use 3
Note: For Fedora Core 4 you can simply do
yum install tktable
to get the required widget. For Windows users you'll need to install ActiveTcl available from here
Spectral Clustering
A number of algorithms are available for spectral clustering and an implementation is available in the kernlab package, which defines a number of kernels to evaluate the affinity matrix.

Fischer et al. (Fischer, I.,; Poland, J.,Amplifying the Block Matrix Structure for Spectral Clustering, In M. van Otterlo et al. (Editors), Proceedings of the 14th Annual Machine Conference of Belgium and the Netherlands, pp. 21-28, 2005) proposed an alternative method which amplifies the block structure of an affinity matrix by using asymmetric affinity matrices generated from Gaussian kernels. They then evaluate the Laplacian or a conductivity matrix which is then decomposed using the SVD.

specclusAS is an R package that is a translation of the original Matlab code written by Fischer et al. and has functions to perform both symmetric and asymmetric spectral clustering and also provides access to functions that can compute an asymmetric affinity matrix as well as the conductivity matrix.

Removing Correlated Columns
A quick method to remove random members of correlated pairs of columns of a matrix or data.frame was suggested to me by Andy Liaw on the R-help mailing list. Below is a function that returns the indices of the columns to be kept such that the all entries of the correlation matrix of the resultant matrix lie below a user specified value of R2.
r2test <- function(d, cutoff=0.8) {
  if (cutoff > 1 || cutoff <= 0) {
    stop(" 0 <= cutoff < 1")
  }
  if (!is.matrix(d) && !is.data.frame(d)) {
    stop("Must supply a data.frame or matrix")
  }
  r2cut = sqrt(cutoff);
  cormat <- cor(d);
  bad.idx <- which(abs(cormat)>r2cut,arr.ind=T);
  bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]], ncol=2);
  drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1], bad.idx [,2]);
  if (length(drop.idx) == 0) {
      1:ncol(d)
  } else {
      (1:ncol(d))[-unique(drop.idx)]
  }
}
Removing Constant or Near-Constant Columns
A common task before modeling data is to remove columns from a matrix or data.frame that have a constant value or near constant value for a certain percentage of the observations. A quick way to get rid of these columns is given below
d <- data.frame(...)
dropidx <- apply(d, 2, function(x) { length(which(x == 0))/length(x) > .8 })
d <- d[, !dropidx]
The above code will remove those columns which contain more than 80% zero's.
Parsing a call Object
R provides a number of functions to determine how a function was called or the names of a functions' parameters. However I did not find a method to determine the values associated with a functions parameters. More specifically, I wanted to know what was supplied as the value of the parameter x. If this value was a single number, say 5, I would like to know that the value of the parameter was 5. If however, the parameter was specified by some other function or variable I want to know the name of the variable or function, rather than its value.

parseCall() performs this job by parsing a call object to determine parameter names and associated values. The return value is a named list with the names being the names of the parameters and each element is the value that was passed to that parameter.

As an example consider the following function, which simply returns the call to itself

foo <- function(x, second.arg, y) {
        match.call()
}
We can then do
> val <- foo(2)            
> class(val)
[1] "call"
> parseCall(val)
$x
[1] "2"

> val <- foo(x=runif(10), second.arg=12.34)
> parseCall(val)
$x
[1] "runif(10)"

$second.arg
[1] "12.34"

> val <- foo(x=function(z) {z^2}, y=ls())
> parseCall(val)
$x
[1] "function(z){z^2}"

$y
[1] "ls()"
Correlation-Based Feature Selection
A feature selection method described by Yu et al. (Proc. 20th Intl. Conf. Mach. Learn., 2003, 865-853) based on the use of the symmetric uncertainty as a correlation function between a given independent variable and the observed class labels. An implementation of this method can be found here. It requires the entropy functions available from here.

The current implementation only considers categorical variables. Thus if numerical variables are present, they will need to be discretized beforehand. The syntax of the function is

fcbf(x, y, thresh)
where x is the data.frame of independent variables, each of which should be a factor and y is a factor of class labels. The value of thresh is obtained by trial and error and larger values will reduce the final number of selected variables.

The return value is a data.frame with two columns named idx & SU, which contain the column index, in the original data.frame, of the selected variables and the symmetric uncertainty value between the selected variables and the class labels respectively.

Entropy Functions
This file contains a number of functions to evaluate the entropy of a categorical variable as well as joint and conditional entropies for 2 categorical variables. In addition the symmetric uncertainty (1) can also be evaluated. The syntax for the current functions are
entropy(x, base=exp(1))            # Entropy of a variable
entropy.joint(x, y, base=exp(1))    # Joint entropy of two variables
entropy.cond(x, y, base=exp(1))     # Conditional entropy of X given Y
SU(x, y, base=exp(1))               # Symmetric uncertainty of X & Y
In all cases, x and y must be factors and the base indicates the base to which logarithms are calculated. By default the base is e.
2D Numerical Integration
The sfsmisc package provides the integrate.xy which performs 1D numerical integration. Though the adapt package does multi-dimensional integration it requires the user to specify a function.

The code below performs a numerical integration using the trapezoidal rule. It assumes that the X and Y values at which the data values (Z) are obtained are equispaced. Also the limits of integration are the minimum and maximum of the X and Y values.

integrate2D.xy <- function(x, y, z) {

    hx <- diff(x)[1]
    hy <- diff(y)[1]

    trap <- function(xx) {
        xx[1] + 2*sum(xx[-c(1,length(xx))]) + xx[length(xx)]
    }
    yvals <- hx * apply(z,2, trap) / 2
    trap(yvals) * hy/2
}
Thus an example of its use would be
> x <- seq(-1,1, length=100)
> y <- seq(0,1, length=100)
> z <- outer(x,y, function(x,y) {exp(x) + cos(y)})
> integrate2D.xy(x,y,z)
[1] 4.03341
This compares reasonable well with the value obtained by using the adapt package
> f <- function(args) { exp(args[1]) + cos(args[2]) }
> adapt(2, c(-1,0), c(1,1), 100, functn=f)
       value       relerr       minpts       lenwrk        ifail
    4.033344 3.458066e-07          165           73            0

Neighborhood Behavior
The neighborhood behavior was described by Patterson et al. ( J. Med. Chem., 1996, 39, 3049-3059) as a means of determining how likely similar molecules would exhibit similar activities. The behavior is analysed by generating a plot of pairwise distances (for a given set of descriptors) versus the absolute difference of activity values for each pair of molecules. For good neighborhood behavior the plot should have points localized in the lower triangle.

Currently the function considers only numeric descriptors. The syntax of the function is

nbr.behavior(act, desc, dist=FALSE, do.plot=TRUE, ...)
Here
  • act is a vector of activity values
  • desc is a matrix or data.frame of descriptor values for each molecule. However if dist is set to TRUE then desc is a distance matrix (see the dist function in R).
  • dist is a boolean indicating whether desc is a data matrix or distance matrix
  • do.plot is a boolean indicating whether to plot the result or not
  • ... are parameters that are passed to the plot() function
The return value is a list with 2 components, x & y containing the distances & activity differences for each pair. Note that the length of each component will be n*(n-1)/2 (as pair i,j is the same as j,i)

The code can be obtained here

Rescaling a Vector
The snippet below recales the supplied numeric vector to the range specified by the user
rescale <- function(x, newrange) {
    newmin <- min(newrange)
    newmax <- max(newrange)
    oldrange <- range(x)
    y <- (newmax - newmin) / (oldrange[2] - oldrange[1])
    newmin + (x - oldrange[1]) * y
}
Here, newrange should be a vector of length 2 containing the new range. You can get the function with some more error checking here. Example usage is
> x <- runif(20)
> newx <- rescale(x, c(-2,2))
Binary Fingerprint Tools
This package provides some utility functions to manipulate binary fingerprints. Currently, these are written to handle output from the fingerprint functions of the CDK package and MOE. However other formats can easily be read by providing a line handler function that parses a single line. See documentation and source for more details. Other functions (logical operations, distance calculation etc) are general.

A fingerprint is considered to be a numeric vector of integers, each element specifying the position in the original bit string that was set to 1. As a result, the functions require to be told the length of the original bitstring. One way to avoid this is to make a 'fingerprint' object which contains all the relevent information. This will probably be done in subsequent versions of the package

Updates:

  • Added the XOR function and a funtion to fold a fingerprint using the XOR
  • Updated to include the modified Tanimoto coefficient (based on Model 1) described in Fligner, M.A. et al., Technometrics, 2002, 44(2), 110-119.
  • Added a function to convert a fingerprint to a point on the unit hypersphere
Levenshtein Distance
A small function to evaluate the Levenstein distance between two strings. This is a measure of dissimilarity of two strings and characterizes the number of changes (consisting of substitutions & insertions) that are required to change one string to the other. Source the file and call as:
distance <- levdist(source, target) 
where source is the string we start with and target is the string we want to transform to.
Fractal Dimension
bc.R contains the function bc which evaluates the fractal dimension (1, 2, 3) of a a JPEG image using the box counting procedure. In addition to reporting the fractal dimension it also plots log(box count) vs log(1/box size) and returns the linear model. It requires the rimage library to be installed.

A few points

  • It takes the image matrix obtained by using read.jpeg() from the rimage library. So an example of its use would be
    > source('bc.R')
    > img <- read.jpeg('image.jpeg')
    > bc(img) # default threshold value = 0.5 
  • Returns the fitted model and the fractal dimension in a list with attributes model & fdim respectively.
  • Currently, the code assumes that the image is square with side equal to a power of two. I should get this changed.
  • For best results the image should be black and white, so that thresholding is simpler. It will probably also work with a greyscale image as well, though the threshold value might need to be tweaked (via the parameter thvalue).
  • A little slow (~ 18 sec on an Athlon 1GHz, 1GB RAM machine for a 1024x1024 image using box sizes from 2x2 to 512x512)

[ A modified version of this function is used for the online fractal dimension calculator which will evaluate the fractal dimension of an image and display the log-log plot. The app was developed for a diffusion limited aggregation experiment for the CHEM457 course at PSU. If you're interested in the PHP code that drives the application drop me a line. ]
Stochastic Proximity Embedding
I have implemented the stochastic proximity embedding (SPE) algorithm as described by Agrafiotis et al. (Proc. Nat. Acad. Sci., 2002, 99, 15869-15872 & J. Comp. Chem., 2003, 24, 1215-1221) as a C subroutine with R wrappers. Nearly all the parameters are configurable. You can get the package from here. To install, download the tgz and then do
R CMD INSTALL spe_1.0.tar.gz 
Currently it does not implement the connected component method of evaluating a good neighborhood radius, so you're stuck with trial and error to get a good value of the neighborhood radius. In addition the package does not implement the pivot SPE method (J. Mol. Graph. Model., 2003,22, 133-140).

The package provides two main functions: the SPE algorithm and a function to evaluate the Sammon stress using random sampling. You can also get a pure C implementation here
Jarvis Patrick Clustering:
JP clustering (IEEE Transactions on Computing, 1973, C22, 1025-1034) is fast clustering algorithm based on the kNN algorithm that requires a single pass through the data. It is not restricted to the detection of globular clusters and has been shown to be good chemical clustering algorithm (REF). A nice description may be found here. One downside to the algorithm is that it requires the user to specify two paramters - j and k - which control the number of clusters generated. (REF) has a good discussion of the choice of these parameters.  A few downsides are the quadratic nature of the algorithm (as it consider all pairs of clusters)  and  the two parameters.

You can download  an R implementation of the algorithm here. As far as I can see its correct  - my only doubt is when we have two objects that pass the tests but are already in different clusters (which are not singeltons) . My solution is to merge the two clusters to give a new one. As far as I can see this is the correct approach.

Currently its not in package form and so it contains two files:
  • jpc.c - a C function that evaluates all pairwise Euclidean distances
  • jpc.R - the actual R code that carries out the clustering

The R function will take a data matrix (observations in the rows and descriptors in the columns) or a distance matrix (which  may be similarities rather than distances). To use the code you need to compile the C funtion by typing
R CMD SHLIB jpc.c 
at the command line. Then start R and do
>dyn.load('jpc.so')
>source('jpc.R')
>clus <- jpc(dat, j=3, k=1, diss=FALSE) 
If anybody  seems interested I'll make an R package out of it. Otherwise I'll wait till I have something more to make a package out of.
kNN Regression
This package implements the kNN regression algorithm, where an observations predicted value is taken as the arithmetic mean of the dependent variable value of its k nearest neighbors
Binary QSAR
Binary QSAR is a non-linear modeling method that can be used to build models for binary categorical data (such as actives and inactives). It was described by Gao et al. (J. Chem. Inf. Comput. Sci., 1999, 39, 164-168) and Labute (Pacific Symposium on Biocomputing, 1999, 4, 444-455). The code consists of a single R function that can be called as
bqsar(x,y,sigma=.25) 
Here x is a data.frame of descriptors (independent variables) which must be all numeric and y is a factor (with two levels: 1 & 0) indicating actives and inactives. Even though the correlations between individual descriptor maybe low the method performs better if a PCA is performed on x and the rotated x matrix is passed to this function (the PCA rotation leads to statistical independence of the descriptors which is an assumption of this method). Sigma is a smoothing parameter and currently there is no method to optimize it (though CV should be implemented in the future). The return value is a list of probabilities for each observation to be active.

The original algorithm uses histogram binning to determine the density at a given point from the estimated distribution of each descriptor (for actives and inactive seperately). The implementation here uses the dpih function from the KernSmooth package. This function provides a bin width which is then used to determine the bin boundaries by doing

bwidth <- dpih(z)
b <- seq(min(z)-0.1, max(z)+0.1+bwidth, by=bwidth)
hist(x, breaks=b) 
As a result, the number of bins is determined automatically.

The code can be obtained here.

Integrating R & the CDK
The CDK provides a wide array of cheminformatics functionality. The ability to use this functionality as well as that provided by the Jmol and JChemPaint is very useful for cheminformatics workflows in the R environment.

The rcdk package aims to integrate the CDK, Jmol and JChemPaint projects with R, allowing one to generate, manipulate and visualize molecular information in a single environment.

Requirements:

  • The rJava package. (Older versions [ 1.1, 1.0 ] used SJava and are no longer supported)
  • R 2.0.0 and higher
  • Java 1.5 If you use other versions of Java you will probably need to recompile the rcdk Java sources and rebuild the rcdk package.


Installation

The code is currently available in two forms.

The R package contains the final rcdk.jar file which provides a number of Java utility methods to access the CDK API. It can be installed using the normal procedure for installing R packages.

To compile the rcdk.jar library you will require Ant in addition to the components listed above (excluding R and rJava). To generate rcdk.jar, untar the source tarball and then

cd rcdk
ant clean jar
This will create the rcdk.jar file. Note that for successful compilation you will need the CDK (including the JChemPaint) and Jmol libraries and any other dependencies they require in the CLASSPATH. Once you have the jar file you can then start using the functions in rcdk.R, visual.R and desc.R (found under package/rcdk/R/.

However it is much more convenient to simply use the R package. So if you recompile the jar file, copy it to package/rcdk/inst/cont/ and then in the package directory do

R CMD build rcdk
R CMD INSTALL rcdk*.gz


Usage

After starting up R load the rcdk package

library(rcdk)
Once this is done you can do a variety of things:
# load multiple structure files
moleculelist <- load.molecules(c('data/mol1.sdf','data/mol2.xyz','data/mol3.hin'))

# draw a 2D structure
molecule <- draw.molecule()

# view 2D structures as a grid
view.molecule.2d(moleculelist)

# view a single 2D structure
view.molecule.2d(moleculelist[[1]])

# get fingerprints
get.fingerprint(moleculelist[[1]])

# generate SMILES for a set of molecules
lapply(moleculelist, get.smiles)

# view a 3D structure
view.molecule.3d('structure.xyz')

# view a 3D structure and supply a Jmol script
view.molecule.3d('structure.xyz', 'select *; spacefill on;')

CAVEATS:

  • Loading lots of molecules into memory will require a lot of RAM

Further Possibilities:

  • Since the viewer can handle String representations of molecules, R's database capabilities make it very easy to view molecular structures stored in databases (which would be nice from a QSAR methodology point of view).
  • Add more helper functions to make common tasks easy (in progress)