Python Programs, Scripts and Snippets

Hadoop Programs

Sources for the programs described on this page can be found here. Also note that the latest version of the code can always be had from my GitHub repository. Some of the sources may be incomplete, but are not described here. To compile some of the sources you will need to have the CDK locally installed. All code is released under the LGPL.

Pharmacophore Searches

This is a simple Hadoop program that performs pharmacophore searches over a set of 3D structures. It is based on the CDK pharmacophore classes and a test case on EC2 has been described here. The code is written based on Hadoop 0.18.3.

mr-psearch.jar contains all the necessary classes (including a recent version of the CDK) to run the program. It will require as inputs, the path to the SD file containing the 3D structures, the output path and a pharmacophore definition file. An example definition file can be found here. With these in hand an example run might look like

hadoop fs -copyFromLocal structures.sdf input
hadoop jar mr-psearch.jar input output d1.xml

Based on the above invocation, the output result can be viewed by doing

hadoop fs -cat output/part-00000

and you should see a list of molecule titles followed by the digit 1 (indicating a match). Since the code implements the Tool interface, you can pass the usual stuff on the command line. Thus, to run the program in "local" mode you could do

hadoop jar mr-psearch.jar -Dmapred.job.tracker=local input output d1.xml

Unlike most MapReduce programs, this program does not actually need a reduce class (even though one is imoplemented). This is because the output of the map stage will always have unique keys and we do not need to aggregate multiple values for a given key. As a result, the run time can be improved by indicating that the number of reduce tasks should be set to 0. Simultaneously, one can specify a large number of map tasks:

hadoop jar mr-psearch.jar -Dmapred.reduce.tasks=0 -Dmapred.map.tasks=10 input output d1.xml

Note that when specifying zero reduce tasks, the output will consist of multiple files rather than a single file as generated by the reduce stage.

Substructure Searching