Hadoop Programs
Pharmacophore Searches
This is a simple Hadoop program that performs
pharmacophore searches over
a set of 3D structures. It is based on the
CDK
pharmacophore classes and a
test case on EC2 has been described
here. The code is written based on Hadoop 0.18.3.
Substructure Searchingmr-psearch.jar contains all the necessary classes (including a recent version of the CDK) to run the program. It will require as inputs, the path to the SD file containing the 3D structures, the output path and a pharmacophore definition file. An example definition file can be found here. With these in hand an example run might look like
hadoop fs -copyFromLocal structures.sdf input hadoop jar mr-psearch.jar input output d1.xmlBased on the above invocation, the output result can be viewed by doing
hadoop fs -cat output/part-00000and you should see a list of molecule titles followed by the digit 1 (indicating a match). Since the code implements the Tool interface, you can pass the usual stuff on the command line. Thus, to run the program in "local" mode you could do
hadoop jar mr-psearch.jar -Dmapred.job.tracker=local input output d1.xmlUnlike most MapReduce programs, this program does not actually need a reduce class (even though one is imoplemented). This is because the output of the map stage will always have unique keys and we do not need to aggregate multiple values for a given key. As a result, the run time can be improved by indicating that the number of reduce tasks should be set to 0. Simultaneously, one can specify a large number of map tasks:
hadoop jar mr-psearch.jar -Dmapred.reduce.tasks=0 -Dmapred.map.tasks=10 input output d1.xmlNote that when specifying zero reduce tasks, the output will consist of multiple files rather than a single file as generated by the reduce stage.