## Python Programs, Scripts and Snippets

This page is a collection of various programs and scripts I've written in Python. I've also included code snippets which show off some algorithm or concept which I finally got working :) Most things might not be very useful, and this page is as much for me as for other people. On the other hand you might find something that might help you out of a tight spot!
Using the CDK from Python | CATS Descriptors | Extracting pages from a PDF file | Spoofing the HTTP referer | Cheminformatics | Misc | PyBibTeX | PDF to HTML | Cellular Automata | Hydrophobicity Surfaces | KMail to Evolution | DNA representation

< name="splitob">

Splitting Molecules in OB
I've been dealing with a collection of SMILES, some of which are salts, some of which are non-covalent dimers and so on. The StripSalts() method is not a good solution for getting the largest fragment, such that it is neutral with the appropriate hydrogens added. Hence the following script which makes use of the Separate method.
from openbabel import *
import sys

ifile = sys.argv[1]
ofile = sys.argv[2]

conv = OBConversion()
conv.SetInAndOutFormats('smi', 'smi')
conv.OpenInAndOutFiles(ifile,ofile)

mol = OBMol()
nmol = 1

while ok:
frags = mol.Separate()
maxFrag = None
maxNAtom = -9999
for frag in frags:
if frag.NumAtoms() > maxNAtom:
maxNAtom = frag.NumAtoms()
maxFrag = OBMol(frag)

maxFrag.ConvertDativeBonds()
for atom in OBMolAtomIter(maxFrag):
atom.SetFormalCharge(0)
conv.Write(maxFrag)

mol = OBMol()

Cheminformatics
pathlen.py: Calculates all the paths of length N in a graph. The snippet includes a basic graph class representing an undirected graph. This was written as a prototype for a program which would calculate the number of paths (length N) in a molecular graph. dft.py is a class that will do a depth first traversal on a graph. It uses a slightly modified version of the Graph class from pathlen.py (since now vertices can be initialised with data). Currently the only thing it does on reaching a vertex is to print that it did :) breadthfirst.py is an implementation of a breadth first traversal and uses a slightly more useful implementation of the Graph class along with a very simple Queue class.

ringset.py is a snippet that will find the smallest ring in a graph that contains a specified vertex. It is based on a breadth first search and is described by Figueras in J. Chem. Inf. Comput. Sci., 1996, 36, 986-991

An implementation of the Figueras algorithm to determine the SSSR maybe found here. The main file is fig.py which contains the actual SSSR search algorithm alongwith the ring search algorithm described above. It utilizes graph, queue and set classes which are included in the tarball. The test example is the graph corresponding to cubane.

HinFile.py: A class which holds the details of a Hyperchem HIN files. Pretty much documented and contains several helper functions to manipulate the molecule data.
reorder.py is a program that can reorder a HIN file such that all the hydrogens come after the heavy atoms. This is required by ADAPT. Currently only handles HIN files which contain 1 molecule (ie only one set of mol - endmol markers)
splitsdf.py can be used to split a large SDF file containing multiple structures into individual SDF files each containing a single structure.

setbin.py and setbinmod.py: The latter provides function which are used by setbin.py to generate sets in QSAR studies (check out the ADAPT menu item, alongside) using the activity ranked binning method. setbin.py is a replacement for the original setbin program in the ADAPT suite. setbinmod.py is also used in ksomsetbin.py which uses the results from a KSOM run to generate QSAR sets (using the binning algorithm)

morgan.py: An implementation of the Morgan algorithm to get a canonical numbering of the atoms in a molecule. Currently it only takes the Hyperchem HIN file format as input. It requires the HinFile class to run.

Currently it is very simplistic:

• If two or more atoms have the same EC values, they are simply numbered serially (ie, ordering via atom type or bond type is ignored).
• Oscillating EC values and atoms with isospectral points are not handled
• The script considers the whole molecule (i.e., hydrogens are included)
The algorithm was implemented from the discussion given in Cheminformatics, A Textbook, Gasteiger & Engel (Eds.)

cml101.py: This is a very basic CML parser. Its more a proof of concept (the concept being, I could parse CML :) and uses a SAX parser. CML looks pretty interesting as a way to store molecule information independently of vendor formats. It parses CML 1.01 and has no error handling, so unless its supplied a correct CML file, it will probably crash. The information from a <molecule> element is stored in a Molecule class - read the source for more info on how the info is stored in the program.
I hope to fiddle some more with CML and the Python XML tools, so more code (and even a CML section at some time) should pop up

Miscellaneous
tess.py performs the tesselation of a unit sphere using the method of recursive subdivision. It can start from an initial tetrahedral, octahedral or icosahedral representation and the level of subdivision (i.e., number of times it will recurse) can be specified.

gengrid.py: This code snippet will generate a set of points on an N dimensional grid. It's useful as a brute force method of parameter optimization. It requires you to pass the axes which are defined as lists of numbers (basically each axis corresponds to a single parameter and the values of the axis correspond to the possible values of the parameter). Thus in the 2D case if our axes are defined as

x = [1,2,3]
y = [1,2,3]

The the code will return 9 points: (1,1) , (1,2) ... , (3,2) , (3,1). For the 3D case we would get 27 points. The problem with the code is that it runs in exponential time so for 6 axes of length 5 or more, it can take more than a few minutes to generate the grid. The use of copy.deepcopy() also seems very inelegant.

NOTE: gengrid2.py is a modification of gengrid.py suggested by Simon Place which generates a list comprehension expression which is then exec'ed. The resulting code is much faster (though still exponential) and is more elegant.

xbel2netscape.py: Converts Galeon bookmarks (output in XBEL format) to plain HTML. Uses the expat parser and is pretty simple. I need to fiddle around a little more with XML parsing.

run.py: A small script which provides a pop up text box (uses PyGTK) to run a program, without having to open up a terminal.

Exam Score System: This is a set of Python and shell scripts which manage the storage of test scores in a MySQL database. Its a pretty plain application, but shows how to get the values out of a HTML form, store and retrieve data from a MySQL database, and display the results as HTML.

Email -> HTML Email: A small function that uses the Python email module to parse a mbox style mailbox (KMail and mutt use this format) and generates an HTML page with the individual mails formatted. Converts links and email addresses to HTML links. Currently skips multipara mails & mails that the email module chokes on.

PyBibTeX - BibTeX Reference Manager
This is a PyGTK based program that is modeled on Endnote. Currently it has very limited functionality and simply reads in a .bib file and displays the author, title and year for each reference. Future work involves addition of BibTeX entries, multiple libraries, export to various formats, better packaging. NOTE: Its very very alpha stage work. It's essentially a proof of concept (the concept being the connection to VIM, see below). There are'nt even any README's!!

Install: Extract the tarball and run bibgui.py

Currently an interesting feature is the ability to be used from within a VIM session. This is possible with vim.py, a Python script that can be run by doing :pyfile vim.py on a visible buffer. It assumes that the VIM program has Python support compiled in. The procedure to use it is:

1. Start up PyBibTeX
3. Start up VIM with your document
4. Set the cursor at the point you want to insert a \cite{}
5. Select the citation(s) from the PyBibTeX GUI
6. Back in VIM go to normal mode and do :pyfile vim.py (assuming vim.py is in your current working directory).
As you can see its not a very efficient way to do things. What I would like is to have a toolbar button that can be clicked and the selected citations inserted.

The connection between VIM and PyBibTeX is managed through a Unix domain socket. Thus this setup could concievably be used to interface PyBibTeX (if it gets to be usable!) to other word processing packages (OOfice, KWord, AbiWord)

A small script used to convert a PDF file to a series of graphic images (currently jpeg) and create a slideshow out of them. I use it to convert my PDF presentations made with Prosper to HTML form. It requires the program 'convert' (from the ImageMagick suite) to be present in the users $PATH Tested with ImageMagick 5.3.8, Ghostscript 6.5.1 and Python 2.2 & 1.5.2 • Update (15/11/2002): I've updated the code to allow user defined image format and output directory. It also does some error checking. • Update (16/11/2002): A small update to change the page numbering scheme. In addition it will not barf on PDF's having a single page. • Update (17/11/2002): I got a report that the version of convert in ImageMagick 5.4.7 was'nt converting the PDF properly - it only converted the first page and then exited. So the code now uses GhostScript directly to do the PDF conversion. One thing is, that now the text from the PDF file, seems a little jagged. I'm not sure how I can get GhostScript to get me better looking text. If anybody nows the proper GhostScript directive, please let me know. (Note, the script requires the program mogrify from the ImageMagick suite, since GS cannot output GIF; thus the jpeg or png files from GS must be converted to GIF's by mogrify. • Update (18/11/2002): The script now has 2 new options. The -m switch can be used to specify which conversion method to use - ImageMagick's convert or direct GhostScript. So if you have a working convert (5.3.8 works) then use convert to get smoother text in the graphic output. If your version of convert only produces one page from the PDF, then use GhostScript as the conversion method. The default is ImageMagicks convert Using the -x and -y switches you can also specify the width and height respectively of the main slide image. • Update (21/11/2002): added an option to allow control of the form of the index (or TOC) page. Now with the -t or --toc option you can have images, text or both on the index page. Requires the LaTeX source of the PDF presentation to be present. Right now it just gets the slide title out of the source file. Furthermore, it assumes that the the whole presentation is contained in a single source file. • Update (22/11/2002): Fixed a small bug which caused a crash when parsing slide titles. Currently its fixed - partially. If your slide title has things like \emph{} etc in it the extracted title will not be correct. I need to read up a little more on regexes :( Also fixed a Python 2.2 idiom so that it works with Python 1.5.2 • Update (26/11/2002): Thanks to Alain Bannay for sending in a patch to take into account slides generated using the \part command as well an improved regex to get slide titles of the form \begin{slide}{\command{Title}} etc (ie multiply nested LaTeX commands). • Update (06/07/2003): Thanks to Francesco Potorti for pointing out a typo. Also removed some debug output from the TOC generation routine • Update (28/10/2003): Thanks to Ki Joo Kim for pointing out that the slides are sorted according to filenames so slide10.jpg comes before slide2.jpg. This is corrected so that slides are in the proper order He also suggested a few more parameters for GS to improve PNG output • Update (11/04/2004): Thanks to Paolo Costa for pointing out a problem in name generation when using ImageMagick to convert PDF's to other graphic formats. • Update (17/11/2005) - Thanks to Caner Kazanci for pointing out an error when Imagemagick was used to convert the PDF. The resultant images were misnamed and not recognized by subsequent code to actually put the HTML version together A cellular automata program that uses PyGTK to display the grid of changing cells. Its my first stab at CA's and is written so that you can easily write new rulesets. You need the file CARules.py - which contains the various rulesets. It contains instructions on how to write new rule functions. Good Features: • User programmable rule sets • The grid can be randomly initialized with a user defined percentage of living and dead cells, or the initial state can be loaded from a file • Arbitrary sized grid, depending only on memory available Bad Features: • Currently the rules cannot have any memory of the previous states of the grid. • Redraws the grid for each generation - could/should be optimized. • There is no way to save a given initial grid UPDATE (23/04/04) hydrosurf.py and color_bh.py have been modified to transform the hydrophobicity values onto [-1,1]. Thus multiple molecules are now colored on the same scale (blue corresponds to -1 and red to 1, using a blue to red color scale) This program is used to combine the data from Brian Mattioni's exlogp.perl script and the actual molecule file (a Hyperchem HIN file in this case) to produce a PDB file with the hydrophobicity values in the B-factor column. The resultant PDB file can be viewed in PyMOL - more importantly PyMOL can use the B-factor column to color code a molecular surface. Thus we can view hydrophobicity surfaces. To color a molecule using a the B-factor column you need a Python script color_bh.py which is a slightly modified version of color_b.py. My modified version fixes the minimum and maximum B values to to -1 and 1 and scales the calculated atomic hydrophobicity values to this scale There are two ways to color code molecules with hydrophobicity data. Firstly, you can generate the required PDB file using pdbsurf.py and then load it into PyMOL and invoke color_bh as shown below. The steps to view a hydrophobic surface are: 1. ./pdbsurf.py -d hydro.out -m mol.hin 2. ./pymol 3. Load the PDB file by typing: load mol.pdb 4. Load the coloring script: run color_b.py 5. To color code the surface: color_bh('(all)', ramp=0, rainbow=1,nbins=2) Some points to note: • To run exlogp.perl simply place it in the directory containing the HIN files and run the program • In step 1, hydro.out is the output file produced by exlogp.perl and mol.hin is the HIN file that exlogp.perl used. The resultant PDB file is called mol.pdb • Make sure that color_bh.py is in the directory that you start PyMOL in. • In step 5, nbins controls how many colors it will use. • ramp = 0 indicates that the binning should be between equal B (ie hydrophobicity in this case) values, rather than equal number of atoms in each bin (this screwed up previous versions!). However, you can automate the above by using hydrosurf.py. This is a Python script that can be run from within PyMOL and handles the conversion to PDB and coloring of the resultant molecule automatically. Requires the color_bh.py script as before. To use it load up PyMOL and then do 1. run hydrosurf.py 2. hydrosurf('data.dat','mol.hin',0,2) data.dat and mol.hin are your hydropbobicity data file and HIN file respectively. The 0 is to color the molecule using hues from blue to magenta; if you want colors of the rainbow use 1. The 2 is the number of colors to use - it will draw a colored (coded by the hydrophobicity values) line image of the molecule. Use PyMOL to convert it to a surface view (show surface). You can see movies of propanol and 1-propane thiol color coded using hydrophobicity values. Using Pipes in Python The following snippet is an example of using popen3() to control an interactive program: import os,time o,i,e = os.popen3("/usr/local/adapt/bin/descmng -g -f qnetin") o.write("\n\n%s\n" % (desc)) o.close() time.sleep(2)  The call to sleep() is useful in some cases where the called program (like gnuplot) might take a second or two to do its work. This program converts a KMail mail directory (in Maildir format) to the Evolution mbox fomat, maintaining folder structure. You can specify that certain KMail folders can be ignored (the default being 'trash','inbox','outbox' and 'sent-mail') The code effectively only parses messages in Maildir format and simply copies mbox style folders to the corresponding Evolution directory. When it faces a Maildir message that it cannot parse it will log the message filename to mail.log Th script uses the email class in Python 2.2 and so won't work with 1.5.2 (unless you patched it). As a result, if the email class can't parse a message theres nothing that I can do! Finally, if you have converted a *large* mail store then Evolution will take some time to initially load and display the messages. This is because this script does not do any indexing. Hence Evolution must create indices the first time it loads the new folders. The simplest example is: kmail2evo.py  This will convert all KMail mailstore under$HOME/Mail to and Evolution mbox format under $HOME/evolution/local and will maintain the whole folder heirarchy that you had in KMail Before writing this I looked around for converters - the only thing I could find was CEIConvert. However it did'nt seem to handle my heirarchy properly - though a nice feature is that it does generate indices. Update (16/05/2005): Thanks to Achim Vierheilig for providing an update which allows a KMail mail store to be converted to a Thunderbird mail store. Update (11/09/2003): Thanks to David Fenyes for updating the code so that it avoids adding the 'From' and 'Date' portions of the parsed message if they're not strings. DNA Representations The program dna_2d.py converts a sequence of bases to the simple 2D representation described by Nandy et al. (Curr. Sci.,1994,66,309). It can output the coordinates of the 2D representation as well generate graphs using Gnuplot. Spoofing the HTTP Referer header In my attempts to combat referer spam I use the URL rewriting engine provided by Apache (mod_rewrite). This method uses regxes to handle incoming requests whose Referer header or User-Agent header (amongst other headers) indicate that its a spammer (or a leacher). This script allows you to connect to your webserver with a spoofed Referer header and/or spoofed User-Agent to check that your regexes are working. Usage is python refcheck.py -h www.someserver.com -r http://www.somereferer.com \ -u "A User Agent" The host name is mandatory and if User-Agent and Referer headers are not specified they will be empty. The path that the script requests by default is '/' but this can be changed on the command line with the -p switch. The output is simply the status code and associated reason from the response. Extracting pages from a PDF file This program allows you to extract specific pages or ranges of pages from a PDF file and dumps the extracted pages to another PDF file. The program is simply a wrapper around pdflatex and the pdfpages package. Currently, only the page specification of the pdfpages package is supported, though I probably should add the other options as well. No error checking is carried out on the page specification and so if an invalid specification is supplied the program will hang as pdftex will not exit back to the command line. Example usage is extractpdfpages.py -p "3-10,15-25" file.pdf The extracted pages are placed in a file called extract.pdf by default, though the output file can be specified on the command line. An alternative method that does not need pdflatex is to use the ReportLab toolkit. However it seems that the ability to extract specific pages from a PDF is only available in the for-fee version of the library. Another option which I don't really know much about is the pdftk toolkit which is Java based (but Jython would come to the rescue in this case). CATS Descriptors The CATS family of descriptors were described by Schneider [1, 2] . The descriptors are pharmacophoric descriptors and come in three variants. CATS2D are topological pharmacophoric decriptors, CATS3D uses the same pharmacophores as CATS2D but considers 3D Euclidean distances and finally SURFCAT extends CATS3D to take into account surfaces. cats2d.py is an implementation of the CATS2D descriptors in Python and uses the OpeneEye toolkit. The pharmacophore definitions are based on those described by Renner, S. et al. (in Pharmacophores and Pharmacophore Searches, Langer, T. and Hoffmann, R.D. (Eds.), Wiley-VCH, 2006), with some changes. The H-bond donor and H-bond acceptor patterns were taken from the Daylight web site and for the "carbon atom adjacent only to another carbon atom" I used the following SMARTS: [C;D2;$(C(=C)(=C))], [C;D3;$(C(=C)(C)(C))], [C;D4;$(C(C)(C)(C)(C))],
[C;D3;H1;$(C(C)(C)(C))], [C;D2;H2;$(C(C)(C))]

It is easy to replace the SMARTS for any of the given pharmacophore groups (H-bond donor, acceptor, positive, negative and lipophilic) within the code. The maximum topological path length is 9 as described by Renner, S. et al. but this can be changed on the command line. The program can caluclate three variants of the CATS2D descriptors: the raw versions, scaled by the number of heavy atoms and scaled by the co-occurence of the individual PPP's. By default the raw descriptors (no scaling) are calculated, but this can be changed on the command line
Using the CDK from Python
The CDK is a Java project that provides a wide variety of functionality for cheminformatics. A number of examples of its use in Java programs can be found here.

A recent article in CDK News (O' Boyle, N.; CDK News, 2005, 2(2), 40-42) described the use of the CDK libraries in Python using Jython.

An example of the use of Jython to allow the inclusion of Java libraries in a Python program is available here. The function of this program is to evaluate the total surface area for a set of molecules specified in a SD file. Since the CDK surface area algorithm is numerical it uses a probe radius and a tesselation level, both of which can be specified on the command line. An example of its usage is:

jython calcSA.jy -r 1,2,3 -t 3,4 structures.sdf