Computational Molecular Biology: An Introduction
Peter Clote and Rolf Backofen, John Wiley & Sons, Inc.
Source Code for Selected Algorithms
Unless otherwise specified, all programs were written by P. Clote,
or in some cases by students working with P. Clote.
Most of the programs
are written C, though some are in Java and C++.
Table of Contents of Computational Molecular Biology:
An Introduction is
here
Typographic, etc. corrections to Computational Molecular Biology:
An Introduction are here .
Rasmol files
 BetaStrand (with Rasmol)
 ccp.ras
Rasmol script written by Rolf Backofen, to view the
face centered cubic (also called cubic close packed) 3dimensional
lattice, where every lattice point has 12 nearest neighbors.
 angles
Description of angles involved in the FCC (face centered cubic),
or equivalently ccp (cubic close packed) lattice.
 Data
Some postscript files of proteins, and some tRNA sequences.
Protein data comes from Brookhaven Protein Database (PDB) and
sample tRNA sequences come from TIGR.
 READMEgnuplot
How to plot data with gnuplot and capture the output in a postscript
file.
Simulations
 evolution.c
Implementation of Dawkin's program, described in
his book "The Blind Watchmaker", to
simulate evolution of strings and count the number of
generations necessary to converge to a given target string.
Start with some string s, and form a number of
mutant copies of s as follows. Each character of s is changed,
with probability r, to a randomly chosen character from the
alphabet of size A.
From the copies of s, choose one that agrees with
the target t in the greatest number of positions, and replace s
by this string. The process stops when s is the same as t.
Erich Bach gave an interesting probabilistic analysis of
Dawkin's program.
Back to Table of Contents
Probability

histoneMutation.c
It is estimated that the nucleotide
substitution rate lambda per site per year
for nuclear DNA of higher primates is 1.3 x 10^9.
Histone H4 consists of 105 amino acids, hence
315 nucleotides.
Assuming that nucleotide substitutions occur uniformly
across the 315 sites of histone H4,
what is the least number of nucleotide substitutions, where
with probability at least 0.5 a substitution has occurred
at the same site?
Using lambda, estimate the amount of time which has elapsed
for this to occur.

funccmp.c
Finds first point where asymptotically f dominates g.
Assists in getting a feel for "bigO" notation.

normalDist.c
Illustrates how to generate a sequence of real values according
to a normal distribution with certain mean and variance.

uniformDist.c
Illustrates how to generate a sequence of pseudorandom integers
using the linear congruential method, and how to measure the
expectation and variance of the values produced.
Random Walks

randomWalk.c
It is an open problem to characterize the distribution of the
root mean square deviation of normed distance matrix of a random
selfavoiding walk from normed distance matrix of a fixed protein
as given by PDB data. (Selfavoiding walks on face centered cubic
lattice in 3 dimensions.) This distribution resembles the easier problem
of generating 2 random walks in 2 dimensions on a square lattice, where
each coordinate increment DeltaX and DeltaY are chosen uniformly from
0,1,1. This program generates such walks.

randomWalkDistance.c
Generates data on average distance between 2 random walks on 2 dimensional
lattice.

randomWalkDistanceRSMD.c
Additionally generates data on root mean square deviation between
2 random walks on 2 dimensional lattice.
Expected RMSD between random coils on cubic lattice
Random (optionally selfavoiding) walks on 3dimensional cubic lattice.

README
Sebastian Will's README file explaining how to compile
and run the program to generate walks on 3dim cubic lattice.

Makefile
Sebastian Will's Makefile to compile the programs walks and hist.

hist.c
Sebastian Will's program to produce histogram data,
which can be used with gnuplot to produce graphics for a distribution.

walks.c
Sebastian Will's program to produce many random walks on
the 3dimensional cubic lattice.
Back to Table of Contents
Combinatorial Optimization
Elementary methods Here we illustrate
basic concepts with simple programs for gradient descent,
MonteCarlo with simulated annealing, genetic algoritms,
and Hopfield nets.

gradientDescent.c
Gradient Descent to find maximum likelihood probability
(p = 0.5) of 10 heads in observation sequence of 20 coin flips.
Illustrates dependence of gradient descent on initial parameters.

MonteCarlo.java
A Java application implementing MonteCarlo with simulated
annealing.

genetic.c
Genetic algorithm to determine the maximum of a function.
Uses somewhat slick <<,>> operations for efficiency.
Written by Rich Fiorito, modified by Peter Clote.

Genetic.java
Translation of previous program into Java by
Peter Clote.

function.ps.gz
Gzipped ps file of a difficult function for which the
maximum is found by the previous genetic algorithms.

hopfield.c
Implementation of Hopfield network for traveling salesman
problem.

grid.c
Peter Clote showed that under certain conditions, a
particular upper bound for the expected runtime of MonteCarlo
simulation program for protein folding should decrease
if there is a larger gap between the lowest and second
lowest energy levels, an observation first reported by
Sali, Shakhnovich and Karplus in an article in Nature 1994.
The current program attempts to calibrate Alistair
Sinclair's notion of PHI (conductance) with other
parameters.
MonteCarlo simulated annealing
This program, written by S. Schoenauer and P. Clote,
uses MonteCarlo with simulated annealing to try to find
optimal (artificial) genetic codes.
 README
 Makefile
 general.h
 general.c
 montecarlo.c
 msd.mat
Mean square difference of Woese polar requirements
 wac.mat
WAC matrix  see work of Altman
 run
csh script
Lagrange multipliers and optimality of genetic code
This program, written by P. Clote and R. Backofen,
uses the method of Lagrange multipliers to estimate how optimized
the standard genetic code is, an approach first taken by DiGiulio.
Some of the code was adapted from the Monte Carlo program by
S. Schoenauer and P. Clote.
 Makefile
 general.h
 general.c
 digiulio.c
 createPolarReq.c
This creates a distance matrix, whose i,jth entry is the
square of the square of the
difference between polar requirement of the ith
amino acid and polar requirement of the jth amino acid
(X_i  X_j)^2.
 pol.vec
Solving linear equations using LEDA
 Makefile
 solve.cc
 EXAMPLE
 M
 B
Genetic Algorithm for optimality of genetic code
This Java program, written by Alessandro Macri, vastly outperforms the
previous program of SchoenauerClote in finding optimal genetic codes.
 README
 Results
Fixed Stops
 Results
 Bio.java
 Evolution.java
 Gen.java
 Aminos.java
 no_ala_ser.amino
Back to Table of Contents
Entropy

freqDNA.c
This computes the mono and dinucleotide frequences of a genome
file, assuming only nucleotides A,C,G,T.

freqAllDNA.c
This program allows codes for incomplete data
(e.g. Y for purine, U for purine, N for undetermined, etc.)

entropy.c
This computes the mono and dinucleotide entropy of a genome
file, assuming nucleotides A,C,G,T,N,S,K,W,R,Y (which appear in
M. jannaschii genome). A complicated exact computation is
made  for instance, given incomplete data such as R (purine),
a weighted contribution of 0.5 to A and 0.5 to G is made. It
would have been much easier, and since incomplete data is rare,
better to use a random number generator to convert incomplete data
to A,C,G,T first.

entropyOutputMjannaschii
Output of entropy.c on M. jannaschii genome fragment.

entropyPlot0.c
Program to compute entropy plot of nucleotide and dinucleotide
entropies of window contents of a genome (length of window set
in a constant) and to shift the window by one position in the genome,
etc. Thus one obtains an entropy plot of windows of the genome as
a function of position.

entropyPlot1.c
Program to perform same computation as previous program, but
uses command line argument for window size and handles inaccurate
data (e.g. N,R,Y, etc.).

nuclEntropyMjannaschii
Output of entropyPlot1.c on M. jannaschii, where
nucleotide entropy of window of length 100 is computed, and then
window is advanced one position in the genome.

dinuclEntropyMjannaschii
Output of entropyPlot1.c on M. jannaschii, where
nucleotide entropy of window of length 100 is computed, and then
window is advanced one position in the genome.

nuclEntropyMjannaschii.ps
Postscript output of gnuplot applied to nuclEntropyMjannaschii

dinuclEntropyMjannaschii.ps
Postscript output of gnuplot applied to dinuclEntropyMjannaschii

bacterialEntropyData
Mononucleotide entropy, dinucleotide entropy and relative abundancy
measures of certain bacterial genomes, generated by S. Will.
Back to Table of Contents
JensenShannon divergence

jensenShannon.c
This computes the JensenShannon divergence
D(U,V) = H(W)  k/n*H(U)  (nk)/n*H(V)
for a segmentation of the whole genome W (for instance of
M. jannaschii) into left segment U and right segment V.
For each position i between 0 and genome length n, a putative
segmentation is made at position i. In a segmentation algorithm,
one would recursively make such segmentations, in each case segmenting
at a position of maximum divergence.

generateRandomNuclSeq.c
This simple program generates a random DNA nucleotide sequence. The
previous program can then be run on the output of this program to
compare divergence plots with real genomic data. As you see after
running these programs, for generating random nucleotide sequences,
it would be far better to build your own (linear congruential)
random number generator, since the divergence plot of the random
data generated from this program shows strong correlations.

divergenceMJannaschiiOutput
Output of jensenShannon.c on M. jannaschii
genome, where stepsize was 100 (i.e. allow segmentations only
every 100 nucleotides).

divergenceMJannaschii.ps
Postscript file of a graphical file produced using gnuplot on
the output of jensenShannon.c on M. jannaschii.
Back to Table of Contents
Global Sequence Alignment

dotter0.c
This program outputs a dot plot for 2 sequences, one of the first
approaches made by biologists towards identifying local
and global alignments between 2 sequences, where diagonals
were identified by eye. The output of this program is
a textfile of coordinates of matches. Then using gnuplot,
one can plot the output using the command plot "out" with points
assuming that the output file of coordinates is called out.

dotter.c
This program outputs a dot plot directly for 2 sequences, without using
gnuplot.

A robust, professional version of the dot plotter is known as
DOTTER and can be obtained from
Karolinska Institutet.

ptwiseMut0.c
This program, when given a nucleotide sequence, generates
pointwise mutations with certain probabilities (insertion,
deletion, substitution), and keeps track of which position
was mutated, so as to produce the "correct" alignment, given
the history. The historically correct alignment can then be
compared with the results from sequence alignment algorithms.

ptwiseMut1.c
Improvement.

ptwiseMut2.c
Improvement.

ptwiseMutInf.c
Infinite loop, allowing user to exit by entering
controlC.

gotoh.c
This is an implementation of Gotoh's O(n^2) algorithm for
global sequence alignment, assuming an AFFINE gap penalty
G(i) = u+v*i. This implementation requires
the first and last letters of each sequence to be identical.
Thanks to Tomas LozanoPerez for fixing a bug concerning the
traceback.

multipleGotoh.c
The program multipleGotoh.c
produces an n x n distance matrix, given n different nucleotide sequences.
The output of multipleGotoh.c is then used as input
for the phylogeny programs, described later. Only the distance
matrices are computed  no alignment using tracebacks is made.

smithWaterman.c
This is an implementation of both global (NeedlemanWunsch)
and local (SmithWaterman) sequence alignment using
similarity matrices, assuming
a LINEAR gap penalty G(i) = i*Delta. This program uses tracebacks and the
runtime is O(n^2).
Dynamic Programming Sequence Alignment Package
Sequence alignment algorithms based on dynamic programming share a
high degree of similarity. Thus, one can write an environment for
these algorithms that allows to implement new alignment algorithms
within very short time. All algorithms do a back trace through a
trace matrix to calculate the alignment and visualize the computed
trace. The implementation is done in C++. The package is written
by Rolf Backofen and Sebastian Will.
The following alignment algorithms are implemented.

NeedelmanWunsch. Alignment
in O(n^2) for linear gap cost, global alignment implemented.

NeedlmanWunsch
Local. Local NeedlemanWunsch Alignment algorithm.

SmithWaterman. Aligment in O(n^3)
for arbitrary gap cost functions, the program implements
affine and logarithmic gap costs, is similarity based and
allows to calculate global or local alignments.

Gotoh. Alignment in O(n^2) with affine
gap cost, program does global alignment.

NeedlmanWunsch
Wraparound. NeedlemanWunsch Alignment algorithm with
Wraparound for alignment with tandem repeats.
The library consists of the following files.
 basicdef.hh
 matrices.hh
 printarray.hh
 trace.hh
 tracearrow.hh
 basicdef.cc
 matrices.cc
 printarray.cc
 trace.cc
 tracearrow.cc
An example makefile to produce the Waterman alignment alorithm is
Clustering algorithms (UPGMA)
The following suite of programs implements variants of
the UPGMA (unweighted and weighted form, Farris transformed
distance) clustering algorithm for phylogeny trees. The most
inclusive and most efficient version is phy5.c . Start
by reading README . The program multipleGotoh.c
produces an nxn distance matrix, given n different nucleotide sequences.
The output of multipleGotoh.c is then used as input
for the phylogeny programs. The program phy0.c
is rather inefficient, since it uses linked lists and pointers
for performing clustering. Nevertheless, it is illustrative.

README
README file

multipleGotoh.c
Produces nxn distance matrix for n nucleotide sequences.

phy0.c

phy1.c

phy2.c

phy3.c

phy4.c

phy5.c
Back to Table of Contents
Generating Observables and HMM

markov.c
This program allows a user to
to iteratively compute the powers of a given matrix,
thus computing the stationary probabilities for the
underlying Markov chain.

README
Explanation of how to use the programs to generate a sequence
of observables according to certain initial, emission and
transmission probabilities.

genObserve0.c
Generates a sequence of observables according to certain initial,
emission and transmission probabilities.

genObserve.c
Improved version of last program.

p
Initial probabilities given as a table (PI).

e
Emission probabilities given as a table.

t
Transmission probabilities given as a table.

hmm0.c
Ralph Matthes' implementation of BaumWelch and Viterbi for
hidden Markov models.

hmm1.c
Improvement.

hmm2.c
Improvement.
Back to Table of Contents
RNA secondary structure

enumRNAsecStr.c
Enumeration of all possible RNA secondary structures,
considered as wellbalanced parenthesis expressions.

rnaCombinatorics.c
Simple recursive program to compute
the number of secondary structures on
1,...,n without any condition on base pairing. Namely
the program computes the function S(n), where
S(1)=1=S(2)
S(n+1)=S(n) + sum_{k=0}^{n2} S(k)*S(nk1)

rnaCombinatoricsNonRec.c
O(n^2) algorithm for previous problem using auxilliary array.
In previous algorithm, one cannot even compute S(40), which is
immediately done by the present algorithm.

rnaMaxNumBasePairs.c
Implementation of the NussinovJacobson
algorithm, using dynamic programming and backtracking,
to determine the optimal RNA secondary structure by
finding the structure with maximum number of base pairs.
Here base pair means WatsonCrick or GU base pair
(this can be varied in the code).

rnaSecStr.c
Secondary structure prediction algorithm using NussinovJacobson
algorithm. This implementation is more modular, has a threshold
value for which ji >= THRESHOLD is required for there to be
a base pairing between i,j, and has different energy values for
AU, GC and GU bonds.
Protein Folding
 ungerMoult.c
Matthias Schubert and Dominik Buchetmann implemented a genetic
algorithm (with hybrid Metropolis steps), following UngerMoult,
to fold a protein on a 2 dimensional lattice to maximize
the number of HH contacts at unit distance.
 SipplScore
Program written by Peter Clote and Sebastian Will to
adapt Sippl's method of amino acid pair potentials to
a window scoring method for finding motifs similar to those
from a given nucleotide sequence training set.
 HydrophobicData
The following data was generated for a paper by R. Backofen, S. Will
and P. Clote on quantifying hydrophobicity as a force in protein
folding, using a database of ancient protein fragments assembled
by the laboratory of W. Gilbert.
 MoCaPro  Monte Carlo Protein Folder
Program written by Sebastian Will to fold proteins on the cubic
lattice to their optimum. This program was written to mimic a program
by Sali, Shaknovich, and Karplus used in their paper
"How does a protein fold?". MoCaPro was extended in many ways
and has improved flexibility.
 Genetic Algorithm Protein Folder
Program written by Rolf Backofen and Sebastian Will to search for
optimal folds of lattice proteins on *arbitrary* lattices by genetic
algorithms (e.g., UngerMoult style ga). The program uses
automorphism groups to achieve its generality. The program is used
and described in Rolf Backofen, Peter Clote, and Sebastian Will. Algorithmic approach to quantifying the hydrophobic force
contribution in protein folding. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, and Teri
E. Klein, editors, Pacific Symposium on Biocomputing (PSB 2000), volume 5, pages 92103,
2000.
Back to Table of Contents
Back to Table of Contents
Software

Software catalogue Biocat of European Bioinformatics
Institute.

FASTA and TFASTA.

LFASTA.

SSEARCH.

ALIGN.
 BLAST at National Center for Biotechnology
Information.
 Joseph Felsenstein's compendium of phylogeny programs.
 Don Gilbert's program Philodendron
for drawing phylogenetic trees:
http://iubio.bio.indiana.edu/soft/molbio/java/source/
Databanks

Protein Data Base (PDB)
of Brookhaven National Laboratory. Look at the
browser.
Back to Table of Contents
XXXXXX
Back to Table of Contents
Exercises 97/98
These gzipped ps files are of a set of problems given
in the tutorial section of the course Computational
Biology given at the University of Munich in
Wintersemester 199798. Ralph Matthes was responsible
for the tutorials.
WARNING
The tutorials are in German.

Directory of Exercises
Exercises 99/00
A set of problems together with solutions given
in the tutorial section of the course Computational
Biology at the University of Munich in
Wintersemester 19992000. The programs are written by Sebastian Will
unless otherwise specified.
Note that the tutorials are in German.

Programs written for the Tutorial
 Small example program to introduce C
 StarCalcSheet to calculate a distance of sequences a la Felsenstein
 A dotter variation
 Generation of random DNA demonstrating fortune wheel principle
 Gather some statistical data of a DNA sequence
 Modification of Peter Clote's phy.c to handle the special case of quartet trees
 Modification of Peter Clote's pointwise mutation program (which computes historical alignments) to be more instructive. Translated to C++ using STL.
 Program that shows another approach to the historical alignment problem of the previous program

Directory of Exercises

Solutions to Exercises
 Exercise 2: Historical Alignment
 Exercise 3: Compute local alignment of two sequences
 Exercise 4: Estimate length of guide RNA (simulates markov process)
 Exercise 5: Approximate a continuous Markov process by a discrete Markov chain
 Exercise 6: Quartet Puzzling a la v.Haeseler et al.
 Exercise 7:Calculate a distance of sequences a la Felsenstein
Back to Table of Contents
clote@informatik.unimuenchen.de, 19981709