## R Packages & Functions

Writing JCAMP-DX files | VOTables to data.frame | Object Browser | Spectral Clustering | Removing Correlated Columns | Removing Constant or Near-Constant Columns | Parsing a call Object | Correlation-Based Feature Selection | Entropy Functions | 2D Numerical Integration | Neighborhood Behavior | Rescaling a Vector | Binary Fingerprint Tools | Levenshtein Distance | Fractal Dimension | Jarvis Patrick Clustering | Binary QSAR | kNN Regression | Stochastic Proximity Embedding | Integrating R & the CDK

*xydata*in the

*bspec.write*function should be a matrix or data.frame with two columns.

chunklist <- function(vec, chunksize) { ret <- list() counter <- 1 idx <- seq(1, length(vec), by=chunksize) lastidx <- idx[length(idx)] for (i in idx) { if (i == lastidx && i+chunksize-1 > length(vec)) { ret[[counter]] <- vec[i:length(vec)] } else { ret[[counter]] <- vec[i:(i+chunksize-1)] } counter <- counter +1 } ret } bspec.write <- function(filename, xydata, name) { con = file(description=filename, open='w') writeLines(paste('##TITLE= ',name,sep=''), con=con) writeLines('##JCAMP-DX= 4.24\n##DATA TYPE= UV/VIS SPECTRUM\n##DATA CLASS= XYDATA', con=con) writeLines('##XLABEL= Bit Position\n##YLABEL= Normalized Frequency', con=con) writeLines('##XFACTOR= 1.0\n##YFACTOR= 1.0', con=con) writeLines('##FIRSTX= 1.0\n##LASTX= 1052.0', con=con) writeLines('##MAXY= 1.0\n##MINY= 0.0', con=con) writeLines(paste('##NPOINTS= ',nrow(xydata),sep=''), con=con) writeLines('##XYDATA= (X++(Y..Y))', con=con) chunks <- .chunklist(round(xydata[,2],2), 10) i <- 1 for (chunk in chunks) { writeLines(paste(i, paste(chunk, sep=' ', collapse=' '), sep=' '), con=con) i <- i+10 } writeLines("##END=", con=con) close(con) }

vot2df( doc, isURL=FALSE, isString=FALSE)If

*isString*is TRUE then the contents of

*doc*is considered to be string containing the XML document. Otherwise it is considered a filename or url depending on the value of

*isURL*. The return value is a single data.frame object if the document contains a single table entry or a list of tables if there is more than one table in the document. For each table it will add an attribute called

*description*to the resultant data.frame containing the value of the

*DESCRIPTION*element for that table in the document, if it exists.

Currently it will only handle tabular data corresponding to the *TABLEDATA* element
of the VOTable schema. In addition it will handle the following primitive data types and
convert them to R types

**Update** (10/20/2007): The reader can now handle the case where cells are empty. The
corresponding data.frame will contain an NA at those positions

VOTable type |
R type |

double, float long, short | double |

int | integer |

char | character |

boolean | logical |

obj.R provides a useful utility function which brings up
a graphical table that basically lists all the variables (or functions or both)
in the current workspace. That it is, it considers *env = .GlobalEnv*. A
screenshot can be seen here

The code requires that you have Tcl/TK as well as the Tktable widget installed on your system. To use the function
called *obj.browse* simply source obj.R. The synatx of
the function is

obj.browse(which=c('all','funcs','vars'), height=-1, width=-1) which='all' - Display all variables and functions. In this case the rows are sorted on the basis of the 'Class' column which='funcs' - Display only functions which='vars' - Display only variables height=-1 - How many rows to display. Default is to use the Tktable default width=-1 - How many columns to display. Default is to use 3

**Note**: For Fedora Core 4 you can simply do

yum install tktableto get the required widget. For Windows users you'll need to install ActiveTcl available from here

Fischer et al. (Fischer, I.,; Poland, J.,*Amplifying the Block Matrix Structure for Spectral Clustering*,
In M. van Otterlo et al. (Editors), Proceedings of the 14th Annual Machine Conference of Belgium and the Netherlands, pp. 21-28, 2005)
proposed an alternative method which amplifies the block structure
of an affinity matrix by using asymmetric affinity matrices generated from Gaussian kernels. They then evaluate
the Laplacian or a conductivity matrix which is then decomposed using the
SVD.

specclusAS is an R package that is a translation of the original Matlab code written by Fischer et al. and has functions to perform both symmetric and asymmetric spectral clustering and also provides access to functions that can compute an asymmetric affinity matrix as well as the conductivity matrix.

^{2}.

r2test <- function(d, cutoff=0.8) { if (cutoff > 1 || cutoff <= 0) { stop(" 0 <= cutoff < 1") } if (!is.matrix(d) && !is.data.frame(d)) { stop("Must supply a data.frame or matrix") } r2cut = sqrt(cutoff); cormat <- cor(d); bad.idx <- which(abs(cormat)>r2cut,arr.ind=T); bad.idx <- matrix( bad.idx[bad.idx[,1] > bad.idx[,2]], ncol=2); drop.idx <- ifelse(runif(nrow(bad.idx)) > .5, bad.idx[,1], bad.idx [,2]); if (length(drop.idx) == 0) { 1:ncol(d) } else { (1:ncol(d))[-unique(drop.idx)] } }

d <- data.frame(...) dropidx <- apply(d, 2, function(x) { length(which(x == 0))/length(x) > .8 }) d <- d[, !dropidx]The above code will remove those columns which contain more than 80% zero's.

*x*. If this value was a single number, say 5, I would like to know that the value of the parameter was 5. If however, the parameter was specified by some other function or variable I want to know the name of the variable or function, rather than its value.

parseCall() performs this job by parsing a *call*
object to determine parameter names and associated values. The return value is a named
list with the names being the names of the parameters and each element is the value that
was passed to that parameter.

As an example consider the following function, which simply returns the call to itself

foo <- function(x, second.arg, y) { match.call() }We can then do

> val <- foo(2) > class(val) [1] "call" > parseCall(val) $x [1] "2" > val <- foo(x=runif(10), second.arg=12.34) > parseCall(val) $x [1] "runif(10)" $second.arg [1] "12.34" > val <- foo(x=function(z) {z^2}, y=ls()) > parseCall(val) $x [1] "function(z){z^2}" $y [1] "ls()"

*Proc. 20*, 2003, 865-853) based on the use of the symmetric uncertainty as a correlation function between a given independent variable and the observed class labels. An implementation of this method can be found here. It requires the entropy functions available from here.

^{th}Intl. Conf. Mach. Learn.The current implementation only considers categorical variables. Thus if numerical variables are present, they will need to be discretized beforehand. The syntax of the function is

fcbf(x, y, thresh)where

*x*is the data.frame of independent variables, each of which should be a factor and

*y*is a factor of class labels. The value of

*thresh*is obtained by trial and error and larger values will reduce the final number of selected variables.

The return value is a data.frame with two columns named idx & SU, which contain the column index, in the original data.frame, of the selected variables and the symmetric uncertainty value between the selected variables and the class labels respectively.

entropy(x, base=exp(1)) # Entropy of a variable entropy.joint(x, y, base=exp(1)) # Joint entropy of two variables entropy.cond(x, y, base=exp(1)) # Conditional entropy of X given Y SU(x, y, base=exp(1)) # Symmetric uncertainty of X & YIn all cases,

*x*and

*y*must be factors and the base indicates the base to which logarithms are calculated. By default the base is

*e*.

*integrate.xy*which performs 1D numerical integration. Though the adapt package does multi-dimensional integration it requires the user to specify a function.

The code below performs a numerical integration using the trapezoidal rule. It assumes that the X and Y values at which the data values (Z) are obtained are equispaced. Also the limits of integration are the minimum and maximum of the X and Y values.

integrate2D.xy <- function(x, y, z) { hx <- diff(x)[1] hy <- diff(y)[1] trap <- function(xx) { xx[1] + 2*sum(xx[-c(1,length(xx))]) + xx[length(xx)] } yvals <- hx * apply(z,2, trap) / 2 trap(yvals) * hy/2 }Thus an example of its use would be

> x <- seq(-1,1, length=100) > y <- seq(0,1, length=100) > z <- outer(x,y, function(x,y) {exp(x) + cos(y)}) > integrate2D.xy(x,y,z) [1] 4.03341This compares reasonable well with the value obtained by using the adapt package

> f <- function(args) { exp(args[1]) + cos(args[2]) } > adapt(2, c(-1,0), c(1,1), 100, functn=f) value relerr minpts lenwrk ifail 4.033344 3.458066e-07 165 73 0

*J. Med. Chem.*,

**1996**,

*39*, 3049-3059) as a means of determining how likely similar molecules would exhibit similar activities. The behavior is analysed by generating a plot of pairwise distances (for a given set of descriptors) versus the absolute difference of activity values for each pair of molecules. For good neighborhood behavior the plot should have points localized in the lower triangle.

Currently the function considers only numeric descriptors. The syntax of the function is

nbr.behavior(act, desc, dist=FALSE, do.plot=TRUE, ...)Here

*act*is a vector of activity values*desc*is a matrix or data.frame of descriptor values for each molecule. However if*dist*is set to TRUE then*desc*is a distance matrix (see the dist function in R).*dist*is a boolean indicating whether*desc*is a data matrix or distance matrix*do.plot*is a boolean indicating whether to plot the result or not*...*are parameters that are passed to the plot() function

The code can be obtained here

rescale <- function(x, newrange) { newmin <- min(newrange) newmax <- max(newrange) oldrange <- range(x) y <- (newmax - newmin) / (oldrange[2] - oldrange[1]) newmin + (x - oldrange[1]) * y }Here,

*newrange*should be a vector of length 2 containing the new range. You can get the function with some more error checking here. Example usage is

> x <- runif(20) > newx <- rescale(x, c(-2,2))

A fingerprint is considered to be a numeric vector of integers, each element specifying the position in the original bit string that was set to 1. As a result, the functions require to be told the length of the original bitstring. One way to avoid this is to make a 'fingerprint' object which contains all the relevent information. This will probably be done in subsequent versions of the package

Updates:

- Added the XOR function and a funtion to fold a fingerprint using the XOR
- Updated to include the modified Tanimoto coefficient (based on Model 1) described in Fligner, M.A. et al., Technometrics, 2002, 44(2), 110-119.
- Added a function to convert a fingerprint to a point on the unit hypersphere

distance <- levdist(source, target)where

*source*is the string we start with and

*target*is the string we want to transform to.

*bc*which evaluates the fractal dimension (1, 2, 3) of a a JPEG image using the box counting procedure. In addition to reporting the fractal dimension it also plots log(box count) vs log(1/box size) and returns the linear model. It requires the rimage library to be installed.

A few points

- It takes the image matrix obtained by using
*read.jpeg()*from the rimage library. So an example of its use would be> source('bc.R') > img <- read.jpeg('image.jpeg') > bc(img) # default threshold value = 0.5

- Returns the fitted model and the fractal dimension in a list with attributes
*model*&*fdim*respectively. - Currently, the code assumes that the image is square with side equal to a power of two. I should get this changed.
- For best results the image should be black and white, so that thresholding is simpler. It will probably
also work with a greyscale image as well, though the threshold value might need to be tweaked (via the parameter
*thvalue*). - A little slow (~ 18 sec on an Athlon 1GHz, 1GB RAM machine for a 1024x1024 image using box sizes from 2x2 to 512x512)

**[**A modified version of this function is used for the online fractal dimension calculator which will evaluate the fractal dimension of an image and display the log-log plot. The app was developed for a diffusion limited aggregation experiment for the CHEM457 course at PSU. If you're interested in the PHP code that drives the application drop me a line.

**]**

*Proc. Nat. Acad. Sci.*, 2002,

**99**, 15869-15872 &

*J. Comp. Chem.*, 2003,

**24**, 1215-1221) as a C subroutine with R wrappers. Nearly all the parameters are configurable. You can get the package from here. To install, download the tgz and then do

R CMD INSTALL spe_1.0.tar.gzCurrently it does not implement the connected component method of evaluating a good neighborhood radius, so you're stuck with trial and error to get a good value of the neighborhood radius. In addition the package does not implement the pivot SPE method (

*J. Mol. Graph. Model.*, 2003,

**22**, 133-140).

The package provides two main functions: the SPE algorithm and a function to evaluate the Sammon stress using random sampling. You can also get a pure C implementation here

*IEEE Transactions on Computing*, 1973,

**C22**, 1025-1034) is fast clustering algorithm based on the

*k*NN algorithm that requires a single pass through the data. It is not restricted to the detection of globular clusters and has been shown to be good chemical clustering algorithm (REF). A nice description may be found here. One downside to the algorithm is that it requires the user to specify two paramters - j and k - which control the number of clusters generated. (REF) has a good discussion of the choice of these parameters. A few downsides are the quadratic nature of the algorithm (as it consider all pairs of clusters) and the two parameters.

You can download an R implementation of the algorithm here. As far as I can see its correct - my only doubt is when we have two objects that pass the tests but are already in different clusters (which are not singeltons) . My solution is to merge the two clusters to give a new one. As far as I can see this is the correct approach.

Currently its not in package form and so it contains two files:

- jpc.c - a C function that evaluates all pairwise Euclidean distances
- jpc.R - the actual R code that carries out the clustering

The R function will take a data matrix (observations in the rows and descriptors in the columns) or a distance matrix (which may be similarities rather than distances). To use the code you need to compile the C funtion by typing

R CMD SHLIB jpc.cat the command line. Then start R and do

>dyn.load('jpc.so') >source('jpc.R') >clus <- jpc(dat, j=3, k=1, diss=FALSE)If anybody seems interested I'll make an R package out of it. Otherwise I'll wait till I have something more to make a package out of.

*J. Chem. Inf. Comput. Sci.*,

**1999**,

*39*, 164-168) and Labute (

*Pacific Symposium on Biocomputing*,

**1999**,

*4*, 444-455). The code consists of a single R function that can be called as

bqsar(x,y,sigma=.25)Here x is a data.frame of descriptors (independent variables) which must be all numeric and y is a factor (with two levels: 1 & 0) indicating actives and inactives. Even though the correlations between individual descriptor maybe low the method performs better if a PCA is performed on x and the rotated x matrix is passed to this function (the PCA rotation leads to statistical independence of the descriptors which is an assumption of this method). Sigma is a smoothing parameter and currently there is no method to optimize it (though CV should be implemented in the future). The return value is a list of probabilities for each observation to be active.

The original algorithm uses histogram binning to determine the density at a given point from the estimated distribution of each descriptor (for actives and inactive seperately). The implementation here uses the dpih function from the KernSmooth package. This function provides a bin width which is then used to determine the bin boundaries by doing

bwidth <- dpih(z) b <- seq(min(z)-0.1, max(z)+0.1+bwidth, by=bwidth) hist(x, breaks=b)As a result, the number of bins is determined automatically.

The code can be obtained here.

The **rcdk** package aims to integrate the CDK, Jmol
and JChemPaint projects with R, allowing one to generate, manipulate and visualize
molecular information in a single environment.

**Requirements:**

- The rJava package. (Older versions [ 1.1, 1.0 ] used SJava and are no longer supported)
- R 2.0.0 and higher
- Java 1.5 If you use other versions of Java you will probably need to recompile the rcdk Java sources and rebuild the rcdk package.

**Installation**The code is currently available in two forms.

- Java sources + R package source: rcdk.tgz
- R package: rcdk_2.6.1.tar.gz

To compile the rcdk.jar library you will require Ant in addition to the components listed above (excluding R and rJava). To generate rcdk.jar, untar the source tarball and then

cd rcdk ant clean jarThis will create the rcdk.jar file. Note that for successful compilation you will need the CDK (including the JChemPaint) and Jmol libraries and any other dependencies they require in the CLASSPATH. Once you have the jar file you can then start using the functions in rcdk.R, visual.R and desc.R (found under

*package/rcdk/R/*.

However it is much more convenient to simply use the R package. So if you recompile
the jar file, copy it to *package/rcdk/inst/cont/* and then in the
*package* directory do

R CMD build rcdk R CMD INSTALL rcdk*.gz

**Usage**After starting up R load the rcdk package

library(rcdk)Once this is done you can do a variety of things:

# load multiple structure files moleculelist <- load.molecules(c('data/mol1.sdf','data/mol2.xyz','data/mol3.hin')) # draw a 2D structure molecule <- draw.molecule() # view 2D structures as a grid view.molecule.2d(moleculelist) # view a single 2D structure view.molecule.2d(moleculelist[[1]]) # get fingerprints get.fingerprint(moleculelist[[1]]) # generate SMILES for a set of molecules lapply(moleculelist, get.smiles) # view a 3D structure view.molecule.3d('structure.xyz') # view a 3D structure and supply a Jmol script view.molecule.3d('structure.xyz', 'select *; spacefill on;')

*CAVEATS:*

- Loading lots of molecules into memory will require a lot of RAM

*Further Possibilities:*

- Since the viewer can handle String representations of molecules, R's database capabilities make it very easy to view molecular structures stored in databases (which would be nice from a QSAR methodology point of view).
- Add more helper functions to make common tasks easy (in progress)