R Packages & Functions
Writing JCAMP-DX files | VOTables to data.frame | Object Browser | Spectral Clustering | Removing Correlated Columns | Removing Constant or Near-Constant Columns | Parsing a call Object | Correlation-Based Feature Selection | Entropy Functions | 2D Numerical Integration | Neighborhood Behavior | Rescaling a Vector | Binary Fingerprint Tools | Levenshtein Distance | Fractal Dimension | Jarvis Patrick Clustering | Binary QSAR | kNN Regression | Stochastic Proximity Embedding | Integrating R & the CDK
chunklist <- function(vec, chunksize) { ret <- list() counter <- 1 idx <- seq(1, length(vec), by=chunksize) lastidx <- idx[length(idx)] for (i in idx) { if (i == lastidx && i+chunksize-1 > length(vec)) { ret[[counter]] <- vec[i:length(vec)] } else { ret[[counter]] <- vec[i:(i+chunksize-1)] } counter <- counter +1 } ret } bspec.write <- function(filename, xydata, name) { con = file(description=filename, open='w') writeLines(paste('##TITLE= ',name,sep=''), con=con) writeLines('##JCAMP-DX= 4.24\n##DATA TYPE= UV/VIS SPECTRUM\n##DATA CLASS= XYDATA', con=con) writeLines('##XLABEL= Bit Position\n##YLABEL= Normalized Frequency', con=con) writeLines('##XFACTOR= 1.0\n##YFACTOR= 1.0', con=con) writeLines('##FIRSTX= 1.0\n##LASTX= 1052.0', con=con) writeLines('##MAXY= 1.0\n##MINY= 0.0', con=con) writeLines(paste('##NPOINTS= ',nrow(xydata),sep=''), con=con) writeLines('##XYDATA= (X++(Y..Y))', con=con) chunks <- .chunklist(round(xydata[,2],2), 10) i <- 1 for (chunk in chunks) { writeLines(paste(i, paste(chunk, sep=' ', collapse=' '), sep=' '), con=con) i <- i+10 } writeLines("##END=", con=con) close(con) }
vot2df( doc, isURL=FALSE, isString=FALSE)If isString is TRUE then the contents of doc is considered to be string containing the XML document. Otherwise it is considered a filename or url depending on the value of isURL. The return value is a single data.frame object if the document contains a single table entry or a list of tables if there is more than one table in the document. For each table it will add an attribute called description to the resultant data.frame containing the value of the DESCRIPTION element for that table in the document, if it exists.
Currently it will only handle tabular data corresponding to the TABLEDATA element of the VOTable schema. In addition it will handle the following primitive data types and convert them to R types
Update (10/20/2007): The reader can now handle the case where cells are empty. The corresponding data.frame will contain an NA at those positions
VOTable type | R type |
double, float long, short | double |
int | integer |
char | character |
boolean | logical |
obj.R provides a useful utility function which brings up a graphical table that basically lists all the variables (or functions or both) in the current workspace. That it is, it considers env = .GlobalEnv. A screenshot can be seen here
The code requires that you have Tcl/TK as well as the Tktable widget installed on your system. To use the function called obj.browse simply source obj.R. The synatx of the function is
obj.browse(which=c('all','funcs','vars'), height=-1, width=-1) which='all' - Display all variables and functions. In this case the rows are sorted on the basis of the 'Class' column which='funcs' - Display only functions which='vars' - Display only variables height=-1 - How many rows to display. Default is to use the Tktable default width=-1 - How many columns to display. Default is to use 3Note: For Fedora Core 4 you can simply do
yum install tktableto get the required widget. For Windows users you'll need to install ActiveTcl available from here
Fischer et al. (Fischer, I.,; Poland, J.,Amplifying the Block Matrix Structure for Spectral Clustering, In M. van Otterlo et al. (Editors), Proceedings of the 14th Annual Machine Conference of Belgium and the Netherlands, pp. 21-28, 2005) proposed an alternative method which amplifies the block structure of an affinity matrix by using asymmetric affinity matrices generated from Gaussian kernels. They then evaluate the Laplacian or a conductivity matrix which is then decomposed using the SVD.
specclusAS is an R package that is a translation of the original Matlab code written by Fischer et al. and has functions to perform both symmetric and asymmetric spectral clustering and also provides access to functions that can compute an asymmetric affinity matrix as well as the conductivity matrix.
r2test <- function(d, cutoff=0.8) { if (cutoff > 1 || cutoff <= 0) { stop(" 0 <= cutoff < 1") } if (!is.matrix(d) && !is.data.frame(d)) { stop("Must supply a data.frame or matrix") } r2cut = sqrt(cutoff); cormat <- cor(d); bad.idx <- which(abs(cormat)>r2cut,arr.ind=T); bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]], ncol=2); drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1], bad.idx [,2]); if (length(drop.idx) == 0) { 1:ncol(d) } else { (1:ncol(d))[-unique(drop.idx)] } }
d <- data.frame(...) dropidx <- apply(d, 2, function(x) { length(which(x == 0))/length(x) > .8 }) d <- d[, !dropidx]The above code will remove those columns which contain more than 80% zero's.
parseCall() performs this job by parsing a call object to determine parameter names and associated values. The return value is a named list with the names being the names of the parameters and each element is the value that was passed to that parameter.
As an example consider the following function, which simply returns the call to itself
foo <- function(x, second.arg, y) { match.call() }We can then do
> val <- foo(2) > class(val) [1] "call" > parseCall(val) $x [1] "2" > val <- foo(x=runif(10), second.arg=12.34) > parseCall(val) $x [1] "runif(10)" $second.arg [1] "12.34" > val <- foo(x=function(z) {z^2}, y=ls()) > parseCall(val) $x [1] "function(z){z^2}" $y [1] "ls()"
The current implementation only considers categorical variables. Thus if numerical variables are present, they will need to be discretized beforehand. The syntax of the function is
fcbf(x, y, thresh)where x is the data.frame of independent variables, each of which should be a factor and y is a factor of class labels. The value of thresh is obtained by trial and error and larger values will reduce the final number of selected variables.
The return value is a data.frame with two columns named idx & SU, which contain the column index, in the original data.frame, of the selected variables and the symmetric uncertainty value between the selected variables and the class labels respectively.
entropy(x, base=exp(1)) # Entropy of a variable entropy.joint(x, y, base=exp(1)) # Joint entropy of two variables entropy.cond(x, y, base=exp(1)) # Conditional entropy of X given Y SU(x, y, base=exp(1)) # Symmetric uncertainty of X & YIn all cases, x and y must be factors and the base indicates the base to which logarithms are calculated. By default the base is e.
The code below performs a numerical integration using the trapezoidal rule. It assumes that the X and Y values at which the data values (Z) are obtained are equispaced. Also the limits of integration are the minimum and maximum of the X and Y values.
integrate2D.xy <- function(x, y, z) { hx <- diff(x)[1] hy <- diff(y)[1] trap <- function(xx) { xx[1] + 2*sum(xx[-c(1,length(xx))]) + xx[length(xx)] } yvals <- hx * apply(z,2, trap) / 2 trap(yvals) * hy/2 }Thus an example of its use would be
> x <- seq(-1,1, length=100) > y <- seq(0,1, length=100) > z <- outer(x,y, function(x,y) {exp(x) + cos(y)}) > integrate2D.xy(x,y,z) [1] 4.03341This compares reasonable well with the value obtained by using the adapt package
> f <- function(args) { exp(args[1]) + cos(args[2]) } > adapt(2, c(-1,0), c(1,1), 100, functn=f) value relerr minpts lenwrk ifail 4.033344 3.458066e-07 165 73 0
Currently the function considers only numeric descriptors. The syntax of the function is
nbr.behavior(act, desc, dist=FALSE, do.plot=TRUE, ...)Here
- act is a vector of activity values
- desc is a matrix or data.frame of descriptor values for each molecule. However if dist is set to TRUE then desc is a distance matrix (see the dist function in R).
- dist is a boolean indicating whether desc is a data matrix or distance matrix
- do.plot is a boolean indicating whether to plot the result or not
- ... are parameters that are passed to the plot() function
The code can be obtained here
rescale <- function(x, newrange) { newmin <- min(newrange) newmax <- max(newrange) oldrange <- range(x) y <- (newmax - newmin) / (oldrange[2] - oldrange[1]) newmin + (x - oldrange[1]) * y }Here, newrange should be a vector of length 2 containing the new range. You can get the function with some more error checking here. Example usage is
> x <- runif(20) > newx <- rescale(x, c(-2,2))
A fingerprint is considered to be a numeric vector of integers, each element specifying the position in the original bit string that was set to 1. As a result, the functions require to be told the length of the original bitstring. One way to avoid this is to make a 'fingerprint' object which contains all the relevent information. This will probably be done in subsequent versions of the package
Updates:
- Added the XOR function and a funtion to fold a fingerprint using the XOR
- Updated to include the modified Tanimoto coefficient (based on Model 1) described in Fligner, M.A. et al., Technometrics, 2002, 44(2), 110-119.
- Added a function to convert a fingerprint to a point on the unit hypersphere
distance <- levdist(source, target)where source is the string we start with and target is the string we want to transform to.
A few points
- It takes the image matrix obtained by using read.jpeg() from the rimage library. So an example of its use
would be
> source('bc.R') > img <- read.jpeg('image.jpeg') > bc(img) # default threshold value = 0.5
- Returns the fitted model and the fractal dimension in a list with attributes model & fdim respectively.
- Currently, the code assumes that the image is square with side equal to a power of two. I should get this changed.
- For best results the image should be black and white, so that thresholding is simpler. It will probably also work with a greyscale image as well, though the threshold value might need to be tweaked (via the parameter thvalue).
- A little slow (~ 18 sec on an Athlon 1GHz, 1GB RAM machine for a 1024x1024 image using box sizes from 2x2 to 512x512)
[ A modified version of this function is used for the online fractal dimension calculator which will evaluate the fractal dimension of an image and display the log-log plot. The app was developed for a diffusion limited aggregation experiment for the CHEM457 course at PSU. If you're interested in the PHP code that drives the application drop me a line. ]
R CMD INSTALL spe_1.0.tar.gzCurrently it does not implement the connected component method of evaluating a good neighborhood radius, so you're stuck with trial and error to get a good value of the neighborhood radius. In addition the package does not implement the pivot SPE method (J. Mol. Graph. Model., 2003,22, 133-140).
The package provides two main functions: the SPE algorithm and a function to evaluate the Sammon stress using random sampling. You can also get a pure C implementation here
You can download an R implementation of the algorithm here. As far as I can see its correct - my only doubt is when we have two objects that pass the tests but are already in different clusters (which are not singeltons) . My solution is to merge the two clusters to give a new one. As far as I can see this is the correct approach.
Currently its not in package form and so it contains two files:
- jpc.c - a C function that evaluates all pairwise Euclidean distances
- jpc.R - the actual R code that carries out the clustering
The R function will take a data matrix (observations in the rows and descriptors in the columns) or a distance matrix (which may be similarities rather than distances). To use the code you need to compile the C funtion by typing
R CMD SHLIB jpc.cat the command line. Then start R and do
>dyn.load('jpc.so') >source('jpc.R') >clus <- jpc(dat, j=3, k=1, diss=FALSE)If anybody seems interested I'll make an R package out of it. Otherwise I'll wait till I have something more to make a package out of.
bqsar(x,y,sigma=.25)Here x is a data.frame of descriptors (independent variables) which must be all numeric and y is a factor (with two levels: 1 & 0) indicating actives and inactives. Even though the correlations between individual descriptor maybe low the method performs better if a PCA is performed on x and the rotated x matrix is passed to this function (the PCA rotation leads to statistical independence of the descriptors which is an assumption of this method). Sigma is a smoothing parameter and currently there is no method to optimize it (though CV should be implemented in the future). The return value is a list of probabilities for each observation to be active.
The original algorithm uses histogram binning to determine the density at a given point from the estimated distribution of each descriptor (for actives and inactive seperately). The implementation here uses the dpih function from the KernSmooth package. This function provides a bin width which is then used to determine the bin boundaries by doing
bwidth <- dpih(z) b <- seq(min(z)-0.1, max(z)+0.1+bwidth, by=bwidth) hist(x, breaks=b)As a result, the number of bins is determined automatically.
The code can be obtained here.
The rcdk package aims to integrate the CDK, Jmol
and JChemPaint projects with R, allowing one to generate, manipulate and visualize
molecular information in a single environment.
Requirements:
- The rJava package. (Older versions [ 1.1, 1.0 ] used SJava and are no longer supported)
- R 2.0.0 and higher
- Java 1.5 If you use other versions of Java you will probably need to recompile the rcdk Java sources and rebuild the rcdk package.
Installation
The code is currently available in two forms.
- Java sources + R package source: rcdk.tgz
- R package: rcdk_2.6.1.tar.gz
To compile the rcdk.jar library you will require Ant in addition to the components listed above (excluding R and rJava). To generate rcdk.jar, untar the source tarball and then
cd rcdk ant clean jarThis will create the rcdk.jar file. Note that for successful compilation you will need the CDK (including the JChemPaint) and Jmol libraries and any other dependencies they require in the CLASSPATH. Once you have the jar file you can then start using the functions in rcdk.R, visual.R and desc.R (found under package/rcdk/R/.
However it is much more convenient to simply use the R package. So if you recompile the jar file, copy it to package/rcdk/inst/cont/ and then in the package directory do
R CMD build rcdk R CMD INSTALL rcdk*.gz
Usage
After starting up R load the rcdk package
library(rcdk)Once this is done you can do a variety of things:
# load multiple structure files moleculelist <- load.molecules(c('data/mol1.sdf','data/mol2.xyz','data/mol3.hin')) # draw a 2D structure molecule <- draw.molecule() # view 2D structures as a grid view.molecule.2d(moleculelist) # view a single 2D structure view.molecule.2d(moleculelist[[1]]) # get fingerprints get.fingerprint(moleculelist[[1]]) # generate SMILES for a set of molecules lapply(moleculelist, get.smiles) # view a 3D structure view.molecule.3d('structure.xyz') # view a 3D structure and supply a Jmol script view.molecule.3d('structure.xyz', 'select *; spacefill on;')
CAVEATS:
- Loading lots of molecules into memory will require a lot of RAM
Further Possibilities:
- Since the viewer can handle String representations of molecules, R's database capabilities make it very easy to view molecular structures stored in databases (which would be nice from a QSAR methodology point of view).
- Add more helper functions to make common tasks easy (in progress)