Hadoop Programs

Sources for the programs described on this page can be found here. Also note that the latest version of the code can always be had from my GitHub repository. Some of the sources may be incomplete, but are not described here. To compile some of the sources you will need to have the CDK locally installed. All code is released under the LGPL.
Pharmacophore Searches
This is a simple Hadoop program that performs pharmacophore searches over a set of 3D structures. It is based on the CDK pharmacophore classes and a test case on EC2 has been described here. The code is written based on Hadoop 0.18.3.

mr-psearch.jar contains all the necessary classes (including a recent version of the CDK) to run the program. It will require as inputs, the path to the SD file containing the 3D structures, the output path and a pharmacophore definition file. An example definition file can be found here. With these in hand an example run might look like

hadoop fs -copyFromLocal structures.sdf input
hadoop jar mr-psearch.jar input output d1.xml
Based on the above invocation, the output result can be viewed by doing
hadoop fs -cat output/part-00000
and you should see a list of molecule titles followed by the digit 1 (indicating a match). Since the code implements the Tool interface, you can pass the usual stuff on the command line. Thus, to run the program in "local" mode you could do
hadoop jar mr-psearch.jar -Dmapred.job.tracker=local input output d1.xml
Unlike most MapReduce programs, this program does not actually need a reduce class (even though one is imoplemented). This is because the output of the map stage will always have unique keys and we do not need to aggregate multiple values for a given key. As a result, the run time can be improved by indicating that the number of reduce tasks should be set to 0. Simultaneously, one can specify a large number of map tasks:
hadoop jar mr-psearch.jar -Dmapred.reduce.tasks=0 -Dmapred.map.tasks=10 input output d1.xml
Note that when specifying zero reduce tasks, the output will consist of multiple files rather than a single file as generated by the reduce stage.
Substructure Searching
This program is quite similar to the pharmacophore search program, but performs 2D substructure searches on a collection of SMILES strings. The input requires a SMARTS pattern. mr-subsearch.jar contains all the required class for this application. An example run would be
hadoop jar mr-subsearch.jar input output "[R3]"
Note that input should be a SMILES file. If molecule titles are provided the output will list the titles of the matching molecules. If no titles are provided in the input file, the SMILES themselves will be output for matching molecules.

As with the pharmacophore search, run time can sped up by specifying -Dmapred.reduce.tasks=0 on the command line.