Python Programs, Scripts and Snippets
Using the CDK from Python | CATS Descriptors | Extracting pages from a PDF file | Spoofing the HTTP referer | Cheminformatics | Misc | PyBibTeX | PDF to HTML | Cellular Automata | Hydrophobicity Surfaces | KMail to Evolution | DNA representation
< name="splitob">
from openbabel import * import sys ifile = sys.argv[1] ofile = sys.argv[2] conv = OBConversion() conv.SetInAndOutFormats('smi', 'smi') conv.OpenInAndOutFiles(ifile,ofile) mol = OBMol() ok = conv.Read(mol) nmol = 1 while ok: frags = mol.Separate() maxFrag = None maxNAtom = -9999 for frag in frags: if frag.NumAtoms() > maxNAtom: maxNAtom = frag.NumAtoms() maxFrag = OBMol(frag) maxFrag.ConvertDativeBonds() for atom in OBMolAtomIter(maxFrag): atom.SetFormalCharge(0) maxFrag.AddHydrogens() conv.Write(maxFrag) mol = OBMol() ok = conv.Read(mol)
ringset.py is a snippet that will find the smallest ring in a graph that contains a specified vertex. It is based on a breadth first search and is described by Figueras in J. Chem. Inf. Comput. Sci., 1996, 36, 986-991
An implementation of the Figueras algorithm to determine the SSSR maybe found here. The main file is fig.py which contains the actual SSSR search algorithm alongwith the ring search algorithm described above. It utilizes graph, queue and set classes which are included in the tarball. The test example is the graph corresponding to cubane.
HinFile.py: A class which holds the
details of a Hyperchem HIN
files. Pretty much documented and contains several helper
functions to manipulate the molecule data.
reorder.py is a program that can
reorder a HIN file such that all the hydrogens come after the
heavy atoms. This is required by ADAPT. Currently only handles
HIN files which contain 1 molecule (ie only one set of mol -
endmol markers)
splitsdf.py can be used to split a large
SDF file containing multiple structures into individual SDF files
each containing a single structure.
setbin.py and setbinmod.py: The latter provides function which are used by setbin.py to generate sets in QSAR studies (check out the ADAPT menu item, alongside) using the activity ranked binning method. setbin.py is a replacement for the original setbin program in the ADAPT suite. setbinmod.py is also used in ksomsetbin.py which uses the results from a KSOM run to generate QSAR sets (using the binning algorithm)
morgan.py: An implementation of the Morgan algorithm to get a canonical numbering of the atoms in a molecule. Currently it only takes the Hyperchem HIN file format as input. It requires the HinFile class to run.
Currently it is very simplistic:
- If two or more atoms have the same EC values, they are simply numbered serially (ie, ordering via atom type or bond type is ignored).
- Oscillating EC values and atoms with isospectral points are not handled
- The script considers the whole molecule (i.e., hydrogens are included)
cml101.py: This is a very basic
CML parser. Its more a proof of
concept (the concept being, I could parse CML :) and uses a SAX parser.
CML looks pretty interesting as a way to store molecule
information independently
of vendor formats. It parses CML 1.01 and has no error handling, so unless
its supplied a correct CML file, it will probably crash. The information from a
<molecule> element is stored in a Molecule class - read the source for more
info on how the info is stored in the program.
I hope to fiddle some more with CML and the Python XML tools, so more code
(and even a CML section at some time) should pop up
gengrid.py: This code snippet will generate a set of points on an N dimensional grid. It's useful as a brute force method of parameter optimization. It requires you to pass the axes which are defined as lists of numbers (basically each axis corresponds to a single parameter and the values of the axis correspond to the possible values of the parameter). Thus in the 2D case if our axes are defined as
y = [1,2,3]
The the code will return 9 points: (1,1) , (1,2) ... , (3,2) , (3,1). For the 3D case we would get 27 points. The problem with the code is that it runs in exponential time so for 6 axes of length 5 or more, it can take more than a few minutes to generate the grid. The use of copy.deepcopy() also seems very inelegant.
NOTE: gengrid2.py is a modification of gengrid.py suggested by Simon Place which generates a list comprehension expression which is then exec'ed. The resulting code is much faster (though still exponential) and is more elegant.
xbel2netscape.py: Converts Galeon bookmarks (output in XBEL format) to plain HTML. Uses the expat parser and is pretty simple. I need to fiddle around a little more with XML parsing.
run.py: A small script which provides a pop up text box (uses PyGTK) to run a program, without having to open up a terminal.
Exam Score System: This is a set of Python and shell scripts which manage the storage of test scores in a MySQL database. Its a pretty plain application, but shows how to get the values out of a HTML form, store and retrieve data from a MySQL database, and display the results as HTML.
Email -> HTML Email: A small function that uses the Python email module to parse a mbox style mailbox (KMail and mutt use this format) and generates an HTML page with the individual mails formatted. Converts links and email addresses to HTML links. Currently skips multipara mails & mails that the email module chokes on.
Install: Extract the tarball and run bibgui.py
Currently an interesting feature is the ability to be used from within a VIM session. This is possible with vim.py, a Python script that can be run by doing :pyfile vim.py on a visible buffer. It assumes that the VIM program has Python support compiled in. The procedure to use it is:
- Start up PyBibTeX
- Load a .bib file
- Start up VIM with your document
- Set the cursor at the point you want to insert a \cite{}
- Select the citation(s) from the PyBibTeX GUI
- Back in VIM go to normal mode and do :pyfile vim.py (assuming vim.py is in your current working directory).
The connection between VIM and PyBibTeX is managed through a Unix domain socket. Thus this setup could concievably be used to interface PyBibTeX (if it gets to be usable!) to other word processing packages (OOfice, KWord, AbiWord)
Tested with ImageMagick 5.3.8, Ghostscript 6.5.1 and Python 2.2 & 1.5.2
- Update (15/11/2002): I've updated the code to allow user defined image format and output directory. It also does some error checking.
- Update (16/11/2002): A small update to change the page numbering scheme. In addition it will not barf on PDF's having a single page.
-
Update (17/11/2002): I got a report that the version of
convert in ImageMagick 5.4.7 was'nt converting the PDF properly
- it only converted the first page and then exited. So the code
now uses GhostScript directly to do the PDF conversion.
One thing is, that now the text from the PDF file, seems a little jagged. I'm not sure how I can get GhostScript to get me better looking text. If anybody nows the proper GhostScript directive, please let me know. (Note, the script requires the program mogrify from the ImageMagick suite, since GS cannot output GIF; thus the jpeg or png files from GS must be converted to GIF's by mogrify. -
Update (18/11/2002): The script now has 2 new
options. The -m switch can be used to specify which
conversion method to use - ImageMagick's convert or
direct GhostScript. So if you have a working convert (5.3.8
works) then use convert to get smoother text in the graphic
output. If your version of convert only produces one page
from the PDF, then use GhostScript as the conversion method.
The default is ImageMagicks convert
Using the -x and -y switches you can also specify the width and height respectively of the main slide image.
- Update (21/11/2002): added an option to allow control of the form of the index (or TOC) page. Now with the -t or --toc option you can have images, text or both on the index page. Requires the LaTeX source of the PDF presentation to be present. Right now it just gets the slide title out of the source file. Furthermore, it assumes that the the whole presentation is contained in a single source file.
- Update (22/11/2002): Fixed a small bug which caused a crash when parsing slide titles. Currently its fixed - partially. If your slide title has things like \emph{} etc in it the extracted title will not be correct. I need to read up a little more on regexes :( Also fixed a Python 2.2 idiom so that it works with Python 1.5.2
- Update (26/11/2002): Thanks to Alain Bannay for sending in a patch to take into account slides generated using the \part command as well an improved regex to get slide titles of the form \begin{slide}{\command{Title}} etc (ie multiply nested LaTeX commands).
- Update (06/07/2003): Thanks to Francesco Potorti for pointing out a typo. Also removed some debug output from the TOC generation routine
- Update (28/10/2003): Thanks to Ki Joo Kim for pointing out that the slides are sorted according to filenames so slide10.jpg comes before slide2.jpg. This is corrected so that slides are in the proper order He also suggested a few more parameters for GS to improve PNG output
- Update (11/04/2004): Thanks to Paolo Costa for pointing out a problem in name generation when using ImageMagick to convert PDF's to other graphic formats.
- Update (17/11/2005) - Thanks to Caner Kazanci for pointing out an error when Imagemagick was used to convert the PDF. The resultant images were misnamed and not recognized by subsequent code to actually put the HTML version together
Good Features:
- User programmable rule sets
- The grid can be randomly initialized with a user defined percentage of living and dead cells, or the initial state can be loaded from a file
- Arbitrary sized grid, depending only on memory available
- Currently the rules cannot have any memory of the previous states of the grid.
- Redraws the grid for each generation - could/should be optimized.
- There is no way to save a given initial grid
This program is used to combine the data from Brian Mattioni's exlogp.perl script and the actual molecule file (a Hyperchem HIN file in this case) to produce a PDB file with the hydrophobicity values in the B-factor column. The resultant PDB file can be viewed in PyMOL - more importantly PyMOL can use the B-factor column to color code a molecular surface. Thus we can view hydrophobicity surfaces.
To color a molecule using a the B-factor column you need a Python script color_bh.py which is a slightly modified version of color_b.py. My modified version fixes the minimum and maximum B values to to -1 and 1 and scales the calculated atomic hydrophobicity values to this scale
There are two ways to color code molecules with hydrophobicity data. Firstly, you can generate the required PDB file using pdbsurf.py and then load it into PyMOL and invoke color_bh as shown below. The steps to view a hydrophobic surface are:
- ./pdbsurf.py -d hydro.out -m mol.hin
- ./pymol
- Load the PDB file by typing: load mol.pdb
- Load the coloring script: run color_b.py
- To color code the surface: color_bh('(all)', ramp=0, rainbow=1,nbins=2)
- To run exlogp.perl simply place it in the directory containing the HIN files and run the program
- In step 1, hydro.out is the output file produced by exlogp.perl and mol.hin is the HIN file that exlogp.perl used. The resultant PDB file is called mol.pdb
- Make sure that color_bh.py is in the directory that you start PyMOL in.
- In step 5, nbins controls how many colors it will use.
- ramp = 0 indicates that the binning should be between equal B (ie hydrophobicity in this case) values, rather than equal number of atoms in each bin (this screwed up previous versions!).
- run hydrosurf.py
- hydrosurf('data.dat','mol.hin',0,2)
You can see movies of propanol and 1-propane thiol color coded using hydrophobicity values.
import os,time o,i,e = os.popen3("/usr/local/adapt/bin/descmng -g -f qnetin") o.write("\n\n%s\n" % (desc)) o.close() time.sleep(2)The call to sleep() is useful in some cases where the called program (like gnuplot) might take a second or two to do its work.
The code effectively only parses messages in Maildir format and simply copies mbox style folders to the corresponding Evolution directory. When it faces a Maildir message that it cannot parse it will log the message filename to mail.log Th script uses the email class in Python 2.2 and so won't work with 1.5.2 (unless you patched it). As a result, if the email class can't parse a message theres nothing that I can do! Finally, if you have converted a *large* mail store then Evolution will take some time to initially load and display the messages. This is because this script does not do any indexing. Hence Evolution must create indices the first time it loads the new folders. The simplest example is:
kmail2evo.pyThis will convert all KMail mailstore under $HOME/Mail to and Evolution mbox format under $HOME/evolution/local and will maintain the whole folder heirarchy that you had in KMail Before writing this I looked around for converters - the only thing I could find was CEIConvert. However it did'nt seem to handle my heirarchy properly - though a nice feature is that it does generate indices.
Update (16/05/2005): Thanks to Achim Vierheilig for providing
an update which allows a KMail mail store to be converted to a
Thunderbird mail store.
Update (11/09/2003): Thanks to David Fenyes for
updating the code so that it avoids adding the 'From' and
'Date' portions of the parsed message if they're not strings.
python refcheck.py -h www.someserver.com -r http://www.somereferer.com \ -u "A User Agent"The host name is mandatory and if User-Agent and Referer headers are not specified they will be empty. The path that the script requests by default is '/' but this can be changed on the command line with the -p switch. The output is simply the status code and associated reason from the response.
Example usage is
extractpdfpages.py -p "3-10,15-25" file.pdfThe extracted pages are placed in a file called extract.pdf by default, though the output file can be specified on the command line.
An alternative method that does not need pdflatex is to use the ReportLab toolkit. However it seems that the ability to extract specific pages from a PDF is only available in the for-fee version of the library. Another option which I don't really know much about is the pdftk toolkit which is Java based (but Jython would come to the rescue in this case).
cats2d.py is an implementation of the CATS2D descriptors in Python and uses the OpeneEye toolkit. The pharmacophore definitions are based on those described by Renner, S. et al. (in Pharmacophores and Pharmacophore Searches, Langer, T. and Hoffmann, R.D. (Eds.), Wiley-VCH, 2006), with some changes.
The H-bond donor and H-bond acceptor patterns were taken from the Daylight web site and for the "carbon atom adjacent only to another carbon atom" I used the following SMARTS:
[C;D2;$(C(=C)(=C))], [C;D3;$(C(=C)(C)(C))], [C;D4;$(C(C)(C)(C)(C))], [C;D3;H1;$(C(C)(C)(C))], [C;D2;H2;$(C(C)(C))]It is easy to replace the SMARTS for any of the given pharmacophore groups (H-bond donor, acceptor, positive, negative and lipophilic) within the code. The maximum topological path length is 9 as described by Renner, S. et al. but this can be changed on the command line. The program can caluclate three variants of the CATS2D descriptors: the raw versions, scaled by the number of heavy atoms and scaled by the co-occurence of the individual PPP's. By default the raw descriptors (no scaling) are calculated, but this can be changed on the command line
A recent article in CDK News (O' Boyle, N.; CDK News, 2005, 2(2), 40-42) described the use of the CDK libraries in Python using Jython.
An example of the use of Jython to allow the inclusion of Java libraries in a Python program is available here. The function of this program is to evaluate the total surface area for a set of molecules specified in a SD file. Since the CDK surface area algorithm is numerical it uses a probe radius and a tesselation level, both of which can be specified on the command line. An example of its usage is:
jython calcSA.jy -r 1,2,3 -t 3,4 structures.sdf