This script is a Python implementation of the CE algorithm pioneered by Drs. Shindyalov and Bourne (See References). It is a fast, accurate structure-based protein alignment algorithm. There are a few changes from the original code (See Notes), and "fast" depends on your machine and the implementation. That is, on my machine --- a relatively fast 64-bit machine --- I can align two 400+ amino acid structures in about 0.300 s with the C++ implementation. In Python however, two 165 amino acid proteins took about 35 seconds!
When coupled to the Kabsch algorithm, this should be able to align any two protein structures, using just the alpha carbon coordinates.
Comparison to PyMol
Why should you use this?
PyMol's structure alignment algorithm is fast and robust. However, its first step is to perform a sequence alignment of the two selections. Thus, proteins in the twilight zone or those having a low sequence identity, may not align well. Because CE is a structure-based alignment, this is not a problem. Look at the following example. The image at LEFT was the result of CE-aligning two proteins (1C0M to 1BCO). The result is 88 aligned (alpha carbons) residues (not atoms) at 2.78 Angstroms. The image on the RIGHT shows the results from PyMol's align command: an alignment of 221 atoms (not residues) at an RMSD of 15.7 Angstroms. To make the alignment easier to see, cealign (actually the Kabsch code) colors the aligned residues differently.
CEAlign has the semantic, and syntactic formalism of
cealign MASTER, TARGET
where a post-condition of the algorithm is that the coordinates of the MASTER protein are unchanged. This allows for easier multi-protein alignments. For example,
cealign 1AUE, 1BZ4 cealign 1AUE, 1B68 cealign 1AUE, 1A7V cealign 1AUE, 1CPR
will superimpose all the TARGETS onto the MASTER.
cealign 1cll and i. 42-55, 1ggz and c. A cealign 1kao, 1ctq cealign 1fao, 1eaz
Multiple Structure Alignments
To use cealign to do a multiple structure alignment, simple load all your proteins and execute the following command:
for x in cmd.get_names("*"): cealign("MASTER", x)
where MASTER is the protein to align all others to. For example, load the following proteins: 1A15, 1EOT, 1ESR, 1F9R, 1G2S, 1NR4, 1QE6. Now, execute the command,
for x in cmd.get_names("*"): cealign("1A15", x)
note: Windows installer coming soon.
- Python 2.4+ with distutils
- C compiler
- uncompress the distribution file cealign-VERSION.tgz
- cd cealign-VERSION
- sudo python setup.py install
- insert "run DIR_TO_CEALIGN/cealign.py" and "run DIR_TO_CEALIGN/qkabsch.py" into your .pymolrc file, or just run the two Python scripts by hand.
- load some molecules
- run, cealign molecule1, molecule2
Please unpack and read the documentation. All comments/questions should be directed to Jason Vertrees (javertre _at_ utmb ...dot... edu)
CE Align v0.2 released.
Found a "feature" I don't like; so, I fixed it. The new version of cealign has the formal syntax of
cealign MASTER, TARGET
and cealign is now guaranteed not to change the coordinates of the MASTER protein. This is useful is you want to align 10 structures on top of one. Before, cealign would center the two molecules; now it just overlaps the TARGET onto the MASTER.
CE Align V0.1 released.
The first version of the C-module code is complete. I fixed handling (multiple) missing residues, the centering problem, and the problem of multiple chains. I'll package and provide the code soon.
Trying to remedy missing residues. If a user's selections are protA and i. 10-20 and prot2 and i. 10-20, and if prot2 is missing residue 14, the SVD is undefined/inappropriate. I have to weed out residues that don't have partners in the PDB file. Alignments do this implicitly since the XYZ values it sees are only the ones with coordinates. Also, CE only works on individual chains. If someone can find a consistent method to map residues and chains to ints and then back to residues and chains -- that might work. Ha!
If more than a week lapses after this comment, I'll just wrap up the code and post the first version. There seems to be some interest in this plugin, so the more eyes the easier it may be to fix the bugs. I will also need testers for the Mac and Windows editions.
Yeah! The C code that plugs into PyMol has been completed. It's a little slower than the plain C++ code I wrote, but that's what you get when passing data from PyMol to Python to C, fiddle with it, pass it back to Python to PyMol for some more quick math. The alignment times for the two proteins mentioned below (1B50 and 1C0M) on my machine with the new C module is about 1-3 second (with a full CPU load for other intensive tasks running in the background; this shows great improvement over the pure Python alignment times). Once the code is cleaned up (and I'm not too embarrassed to post it) and some bugs are worked out, I'll post it. The current bugs are:
- Some alignments don't center right
- Missing residues cause problems
- Memory leaks galore, I'm sure
The code consists of:
Also, I provide the option of aligning based solely upon RMSD or upon the better CE-Score. See the References for information on the CE Score.
Post your problems/solutions here.
LinAlg Module Not Found
Problem: Running CE Align gives the following error message:
run qkabsch.py Traceback (most recent call last): File "/usr/lib/python2.4/site-packages/pymol/parser.py", line 285, in parse parsing.run_file(exp_path(args[nest]),pymol_names,pymol_names) File "/usr/lib/python2.4/site-packages/pymol/parsing.py", line 407, in run_file execfile(file,global_ns,local_ns) File "qkabsch.py", line 86, in ? import numpy File "/usr/lib/python2.4/site-packages/numpy/__init__.py", line 40, in ? import linalg ImportError: No module named linalg
Solution: You do not have the linear algebra module installed (or Python can't find it) on your machine. One workaround is to install Scientific Python. Another is to reinstall the Numpy package from source, ensuring that you have the necessary requirements for the linear algebra module (linpack, lapack, fft, etc.).
NumPy Module Not Found
Problem: Running CE Align gives the following error message:
run qkabsch.py Traceback (most recent call last): File "/home/local/warren/MacPyMOL060530/build/Deployment/MacPyMOL.app/pymol/modules/pymol/parser.py", line 297, in parse File "/home/local/warren/MacPyMOL060530/build/Deployment/MacPyMOL.app/pymol/modules/pymol/parsing.py", line 408, in run_file File "qkabsch.py", line 86, in ? import numpy ImportError: No module named numpy
Solution: This problem occurs under Apple Mac OS X if (a) the Apple's python executable on your machine (/usr/bin/python, currently version 2.3.5) is superseded by Fink's python executable (/sw/bin/python, currently version 2.5) and (b) you are using precompiled versions of PyMOL (MacPyMOL, PyMOLX11Hybrid or PyMOL for Mac OS X/X11). These executables ignore Fink's python and instead use Apple's - so, in order to run CE Align, one must install NumPy (as well as CE Align itself) using Apple's python. To do so, first download the Numpy source code archive (currently version 1.0.1), unpack it, change directory to numpy-1.0.1 and specify the full path to Apple's python executable during installation: sudo /usr/bin/python setup.py install | tee install.log. Then, donwload the CE Align source code archive (currently version 0.2), unpack it, change directory to cealign-0.2 and finally install CE Align as follows: sudo /usr/bin/python setup.py install | tee install.log. Luca Jovine 05:11, 25 January 2007 (CST).
Text taken from PubMed and formatted for the wiki. The first reference is the most important for this code.
- Shindyalov IN, Bourne PE. Protein structure alignment by incremental combinatorial extension (CE) of the optimal path. Protein Eng. 1998 Sep;11(9):739-47. PMID: 9796821 [PubMed - indexed for MEDLINE]
- Jia Y, Dewey TG, Shindyalov IN, Bourne PE. A new scoring function and associated statistical significance for structure alignment by CE. J Comput Biol. 2004;11(5):787-99. PMID: 15700402 [PubMed - indexed for MEDLINE]
- Pekurovsky D, Shindyalov IN, Bourne PE. A case study of high-throughput biological data processing on parallel platforms. Bioinformatics. 2004 Aug 12;20(12):1940-7. Epub 2004 Mar 25. PMID: 15044237 [PubMed - indexed for MEDLINE]
- Shindyalov IN, Bourne PE. An alternative view of protein fold space. Proteins. 2000 Feb 15;38(3):247-60. PMID: 10713986 [PubMed - indexed for MEDLINE]
The CEAlign and all its subprograms that I wrote, are released under the open source Free BSD License (BSDL).