Computational Molecular Biology: An Introduction

Peter Clote and Rolf Backofen, John Wiley & Sons, Inc.

Source Code for Selected Algorithms

Unless otherwise specified, all programs were written by P. Clote, or in some cases by students working with P. Clote.
Most of the programs are written C, though some are in Java and C++.
Table of Contents of Computational Molecular Biology: An Introduction is here
Typographic, etc. corrections to Computational Molecular Biology: An Introduction are here .

Table of Contents

  • Chapter 1 - Molecular Biology
  • Chapter 2 - Math Primer
  • Chapter 3 - Sequence Alignment
  • Chapter 4 - All about Eve
  • Chapter 5 - Hidden Markov Models
  • Chapter 6 - Structure prediction
  • Internet URL's in Bioinformatics
  • Bioinformatics Resources Page
  • Bibliography
  • Exercises

  • Chapter 1 - Molecular Biology

    Rasmol files

    1. Beta-Strand (with Rasmol)
    2. ccp.ras Rasmol script written by Rolf Backofen, to view the face centered cubic (also called cubic close packed) 3-dimensional lattice, where every lattice point has 12 nearest neighbors.
    3. angles Description of angles involved in the FCC (face centered cubic), or equivalently ccp (cubic close packed) lattice.
    4. Data Some postscript files of proteins, and some tRNA sequences. Protein data comes from Brookhaven Protein Database (PDB) and sample tRNA sequences come from TIGR.
    5. README-gnuplot How to plot data with gnuplot and capture the output in a postscript file.

    Simulations

    1. evolution.c Implementation of Dawkin's program, described in his book "The Blind Watchmaker", to simulate evolution of strings and count the number of generations necessary to converge to a given target string. Start with some string s, and form a number of mutant copies of s as follows. Each character of s is changed, with probability r, to a randomly chosen character from the alphabet of size A. From the copies of s, choose one that agrees with the target t in the greatest number of positions, and replace s by this string. The process stops when s is the same as t. Erich Bach gave an interesting probabilistic analysis of Dawkin's program.

    Back to Table of Contents


    Chapter 2 - Math Primer

    Probability

    1. histoneMutation.c It is estimated that the nucleotide substitution rate lambda per site per year for nuclear DNA of higher primates is 1.3 x 10^-9. Histone H4 consists of 105 amino acids, hence 315 nucleotides. Assuming that nucleotide substitutions occur uniformly across the 315 sites of histone H4, what is the least number of nucleotide substitutions, where with probability at least 0.5 a substitution has occurred at the same site? Using lambda, estimate the amount of time which has elapsed for this to occur.
    2. funccmp.c Finds first point where asymptotically f dominates g. Assists in getting a feel for "big-O" notation.
    3. normalDist.c Illustrates how to generate a sequence of real values according to a normal distribution with certain mean and variance.
    4. uniformDist.c Illustrates how to generate a sequence of pseudo-random integers using the linear congruential method, and how to measure the expectation and variance of the values produced.

    Random Walks

    1. randomWalk.c It is an open problem to characterize the distribution of the root mean square deviation of normed distance matrix of a random self-avoiding walk from normed distance matrix of a fixed protein as given by PDB data. (Self-avoiding walks on face centered cubic lattice in 3 dimensions.) This distribution resembles the easier problem of generating 2 random walks in 2 dimensions on a square lattice, where each coordinate increment DeltaX and DeltaY are chosen uniformly from 0,1,-1. This program generates such walks.
    2. randomWalkDistance.c Generates data on average distance between 2 random walks on 2 dimensional lattice.
    3. randomWalkDistanceRSMD.c Additionally generates data on root mean square deviation between 2 random walks on 2 dimensional lattice.

    Expected RMSD between random coils on cubic lattice

    Random (optionally self-avoiding) walks on 3-dimensional cubic lattice.

    1. README Sebastian Will's README file explaining how to compile and run the program to generate walks on 3-dim cubic lattice.
    2. Makefile Sebastian Will's Makefile to compile the programs walks and hist.
    3. hist.c Sebastian Will's program to produce histogram data, which can be used with gnuplot to produce graphics for a distribution.
    4. walks.c Sebastian Will's program to produce many random walks on the 3-dimensional cubic lattice.

    Back to Table of Contents


    Combinatorial Optimization

    Elementary methods Here we illustrate basic concepts with simple programs for gradient descent, Monte-Carlo with simulated annealing, genetic algoritms, and Hopfield nets.

    1. gradientDescent.c Gradient Descent to find maximum likelihood probability (p = 0.5) of 10 heads in observation sequence of 20 coin flips. Illustrates dependence of gradient descent on initial parameters.
    2. MonteCarlo.java A Java application implementing Monte-Carlo with simulated annealing.
    3. genetic.c Genetic algorithm to determine the maximum of a function. Uses somewhat slick <<,>> operations for efficiency. Written by Rich Fiorito, modified by Peter Clote.
    4. Genetic.java Translation of previous program into Java by Peter Clote.
    5. function.ps.gz Gzipped ps file of a difficult function for which the maximum is found by the previous genetic algorithms.
    6. hopfield.c Implementation of Hopfield network for traveling salesman problem.
    7. grid.c Peter Clote showed that under certain conditions, a particular upper bound for the expected runtime of Monte-Carlo simulation program for protein folding should decrease if there is a larger gap between the lowest and second lowest energy levels, an observation first reported by Sali, Shakhnovich and Karplus in an article in Nature 1994. The current program attempts to calibrate Alistair Sinclair's notion of PHI (conductance) with other parameters.

    Monte-Carlo simulated annealing This program, written by S. Schoenauer and P. Clote, uses Monte-Carlo with simulated annealing to try to find optimal (artificial) genetic codes.

    1. README
    2. Makefile
    3. general.h
    4. general.c
    5. monte-carlo.c
    6. msd.mat Mean square difference of Woese polar requirements
    7. wac.mat WAC matrix -- see work of Altman
    8. run csh script

    Lagrange multipliers and optimality of genetic code This program, written by P. Clote and R. Backofen, uses the method of Lagrange multipliers to estimate how optimized the standard genetic code is, an approach first taken by DiGiulio. Some of the code was adapted from the Monte Carlo program by S. Schoenauer and P. Clote.

    1. Makefile
    2. general.h
    3. general.c
    4. digiulio.c
    5. createPolarReq.c This creates a distance matrix, whose i,j-th entry is the square of the square of the difference between polar requirement of the i-th amino acid and polar requirement of the j-th amino acid (X_i - X_j)^2.
    6. pol.vec

      Solving linear equations using LEDA

    7. Makefile
    8. solve.cc
    9. EXAMPLE
    10. M
    11. B

    Genetic Algorithm for optimality of genetic code This Java program, written by Alessandro Macri, vastly outperforms the previous program of Schoenauer-Clote in finding optimal genetic codes.

    1. README
    2. Results

    Fixed Stops

    1. Results
    2. Bio.java
    3. Evolution.java
    4. Gen.java
    5. Aminos.java
    6. no_ala_ser.amino

    Back to Table of Contents


    Entropy

    1. freqDNA.c This computes the mono- and dinucleotide frequences of a genome file, assuming only nucleotides A,C,G,T.
    2. freqAllDNA.c This program allows codes for incomplete data (e.g. Y for purine, U for purine, N for undetermined, etc.)
    3. entropy.c This computes the mono- and dinucleotide entropy of a genome file, assuming nucleotides A,C,G,T,N,S,K,W,R,Y (which appear in M. jannaschii genome). A complicated exact computation is made -- for instance, given incomplete data such as R (purine), a weighted contribution of 0.5 to A and 0.5 to G is made. It would have been much easier, and since incomplete data is rare, better to use a random number generator to convert incomplete data to A,C,G,T first.
    4. entropyOutputMjannaschii Output of entropy.c on M. jannaschii genome fragment.
    5. entropyPlot0.c Program to compute entropy plot of nucleotide and dinucleotide entropies of window contents of a genome (length of window set in a constant) and to shift the window by one position in the genome, etc. Thus one obtains an entropy plot of windows of the genome as a function of position.
    6. entropyPlot1.c Program to perform same computation as previous program, but uses command line argument for window size and handles inaccurate data (e.g. N,R,Y, etc.).
    7. nuclEntropyMjannaschii Output of entropyPlot1.c on M. jannaschii, where nucleotide entropy of window of length 100 is computed, and then window is advanced one position in the genome.
    8. dinuclEntropyMjannaschii Output of entropyPlot1.c on M. jannaschii, where nucleotide entropy of window of length 100 is computed, and then window is advanced one position in the genome.
    9. nuclEntropyMjannaschii.ps Postscript output of gnuplot applied to nuclEntropyMjannaschii
    10. dinuclEntropyMjannaschii.ps Postscript output of gnuplot applied to dinuclEntropyMjannaschii
    11. bacterialEntropyData Mononucleotide entropy, dinucleotide entropy and relative abundancy measures of certain bacterial genomes, generated by S. Will.

    Back to Table of Contents


    Jensen-Shannon divergence

    1. jensenShannon.c This computes the Jensen-Shannon divergence D(U,V) = H(W) - k/n*H(U) - (n-k)/n*H(V) for a segmentation of the whole genome W (for instance of M. jannaschii) into left segment U and right segment V. For each position i between 0 and genome length n, a putative segmentation is made at position i. In a segmentation algorithm, one would recursively make such segmentations, in each case segmenting at a position of maximum divergence.
    2. generateRandomNuclSeq.c This simple program generates a random DNA nucleotide sequence. The previous program can then be run on the output of this program to compare divergence plots with real genomic data. As you see after running these programs, for generating random nucleotide sequences, it would be far better to build your own (linear congruential) random number generator, since the divergence plot of the random data generated from this program shows strong correlations.
    3. divergenceMJannaschiiOutput Output of jensenShannon.c on M. jannaschii genome, where stepsize was 100 (i.e. allow segmentations only every 100 nucleotides).
    4. divergenceMJannaschii.ps Postscript file of a graphical file produced using gnuplot on the output of jensenShannon.c on M. jannaschii.

    Back to Table of Contents


    Chapter 3 - Sequence Alignment

    Global Sequence Alignment

    1. dotter0.c This program outputs a dot plot for 2 sequences, one of the first approaches made by biologists towards identifying local and global alignments between 2 sequences, where diagonals were identified by eye. The output of this program is a textfile of coordinates of matches. Then using gnuplot, one can plot the output using the command plot "out" with points assuming that the output file of coordinates is called out.
    2. dotter.c This program outputs a dot plot directly for 2 sequences, without using gnuplot.
    3. A robust, professional version of the dot plotter is known as DOTTER and can be obtained from Karolinska Institutet.
    4. ptwiseMut0.c This program, when given a nucleotide sequence, generates pointwise mutations with certain probabilities (insertion, deletion, substitution), and keeps track of which position was mutated, so as to produce the "correct" alignment, given the history. The historically correct alignment can then be compared with the results from sequence alignment algorithms.
    5. ptwiseMut1.c Improvement.
    6. ptwiseMut2.c Improvement.
    7. ptwiseMutInf.c Infinite loop, allowing user to exit by entering control-C.
    8. gotoh.c This is an implementation of Gotoh's O(n^2) algorithm for global sequence alignment, assuming an AFFINE gap penalty G(i) = u+v*i. This implementation requires the first and last letters of each sequence to be identical. Thanks to Tomas Lozano-Perez for fixing a bug concerning the traceback.
    9. multipleGotoh.c The program multipleGotoh.c produces an n x n distance matrix, given n different nucleotide sequences. The output of multipleGotoh.c is then used as input for the phylogeny programs, described later. Only the distance matrices are computed -- no alignment using tracebacks is made.
    10. smithWaterman.c This is an implementation of both global (Needleman-Wunsch) and local (Smith-Waterman) sequence alignment using similarity matrices, assuming a LINEAR gap penalty G(i) = i*Delta. This program uses tracebacks and the runtime is O(n^2).

    Dynamic Programming Sequence Alignment Package

    Sequence alignment algorithms based on dynamic programming share a high degree of similarity. Thus, one can write an environment for these algorithms that allows to implement new alignment algorithms within very short time. All algorithms do a back trace through a trace matrix to calculate the alignment and visualize the computed trace. The implementation is done in C++. The package is written by Rolf Backofen and Sebastian Will.
    The following alignment algorithms are implemented.

    1. Needelman-Wunsch. Alignment in O(n^2) for linear gap cost, global alignment implemented.
    2. Needlman-Wunsch Local. Local Needleman-Wunsch Alignment algorithm.
    3. Smith-Waterman. Aligment in O(n^3) for arbitrary gap cost functions, the program implements affine and logarithmic gap costs, is similarity based and allows to calculate global or local alignments.
    4. Gotoh. Alignment in O(n^2) with affine gap cost, program does global alignment.
    5. Needlman-Wunsch Wraparound. Needleman-Wunsch Alignment algorithm with Wraparound for alignment with tandem repeats.
    The library consists of the following files.
    1. basicdef.hh
    2. matrices.hh
    3. printarray.hh
    4. trace.hh
    5. tracearrow.hh
    6. basicdef.cc
    7. matrices.cc
    8. printarray.cc
    9. trace.cc
    10. tracearrow.cc
    An example makefile to produce the Waterman alignment alorithm is

    Chapter 4 - All about Eve

    Clustering algorithms (UPGMA) The following suite of programs implements variants of the UPGMA (unweighted and weighted form, Farris transformed distance) clustering algorithm for phylogeny trees. The most inclusive and most efficient version is phy5.c . Start by reading README . The program multipleGotoh.c produces an nxn distance matrix, given n different nucleotide sequences. The output of multipleGotoh.c is then used as input for the phylogeny programs. The program phy0.c is rather inefficient, since it uses linked lists and pointers for performing clustering. Nevertheless, it is illustrative.

    1. README README file
    2. multipleGotoh.c Produces nxn distance matrix for n nucleotide sequences.
    3. phy0.c
    4. phy1.c
    5. phy2.c
    6. phy3.c
    7. phy4.c
    8. phy5.c

    Back to Table of Contents


    Chapter 5 - Hidden Markov Models

    Generating Observables and HMM

    1. markov.c This program allows a user to to iteratively compute the powers of a given matrix, thus computing the stationary probabilities for the underlying Markov chain.
    2. README Explanation of how to use the programs to generate a sequence of observables according to certain initial, emission and transmission probabilities.
    3. genObserve-0.c Generates a sequence of observables according to certain initial, emission and transmission probabilities.
    4. genObserve.c Improved version of last program.
    5. p Initial probabilities given as a table (PI).
    6. e Emission probabilities given as a table.
    7. t Transmission probabilities given as a table.
    8. hmm0.c Ralph Matthes' implementation of Baum-Welch and Viterbi for hidden Markov models.
    9. hmm1.c Improvement.
    10. hmm2.c Improvement.

    Back to Table of Contents


    Chapter 6 - Structure prediction

    RNA secondary structure

    1. enumRNAsecStr.c Enumeration of all possible RNA secondary structures, considered as well-balanced parenthesis expressions.
    2. rnaCombinatorics.c Simple recursive program to compute the number of secondary structures on 1,...,n without any condition on base pairing. Namely the program computes the function S(n), where

      S(1)=1=S(2)
      

      S(n+1)=S(n) + sum_{k=0}^{n-2} S(k)*S(n-k-1)

    3. rnaCombinatoricsNonRec.c O(n^2) algorithm for previous problem using auxilliary array. In previous algorithm, one cannot even compute S(40), which is immediately done by the present algorithm.

    4. rnaMaxNumBasePairs.c Implementation of the Nussinov-Jacobson algorithm, using dynamic programming and backtracking, to determine the optimal RNA secondary structure by finding the structure with maximum number of base pairs. Here base pair means Watson-Crick or GU base pair (this can be varied in the code).

    5. rnaSecStr.c Secondary structure prediction algorithm using Nussinov-Jacobson algorithm. This implementation is more modular, has a threshold value for which |j-i| >= THRESHOLD is required for there to be a base pairing between i,j, and has different energy values for AU, GC and GU bonds.

    Protein Folding

    1. ungerMoult.c Matthias Schubert and Dominik Buchetmann implemented a genetic algorithm (with hybrid Metropolis steps), following Unger-Moult, to fold a protein on a 2 dimensional lattice to maximize the number of H-H contacts at unit distance.
    2. SipplScore Program written by Peter Clote and Sebastian Will to adapt Sippl's method of amino acid pair potentials to a window scoring method for finding motifs similar to those from a given nucleotide sequence training set.
    3. HydrophobicData The following data was generated for a paper by R. Backofen, S. Will and P. Clote on quantifying hydrophobicity as a force in protein folding, using a database of ancient protein fragments assembled by the laboratory of W. Gilbert.
    4. MoCaPro - Monte Carlo Protein Folder Program written by Sebastian Will to fold proteins on the cubic lattice to their optimum. This program was written to mimic a program by Sali, Shaknovich, and Karplus used in their paper "How does a protein fold?". MoCaPro was extended in many ways and has improved flexibility.
    5. Genetic Algorithm Protein Folder Program written by Rolf Backofen and Sebastian Will to search for optimal folds of lattice proteins on *arbitrary* lattices by genetic algorithms (e.g., Unger-Moult style ga). The program uses automorphism groups to achieve its generality. The program is used and described in Rolf Backofen, Peter Clote, and Sebastian Will. Algorithmic approach to quantifying the hydrophobic force contribution in protein folding. In Russ B. Altman, A. Keith Dunker, Lawrence Hunter, and Teri E. Klein, editors, Pacific Symposium on Biocomputing (PSB 2000), volume 5, pages 92--103, 2000.

    Back to Table of Contents


    Back to Table of Contents


    Internet URL's in Bioinformatics

    Software

    1. Software catalogue Biocat of European Bioinformatics Institute.
    2. FASTA and TFASTA.
    3. LFASTA.
    4. SSEARCH.
    5. ALIGN.
    6. BLAST at National Center for Biotechnology Information.
    7. Joseph Felsenstein's compendium of phylogeny programs.
    8. Don Gilbert's program Philodendron for drawing phylogenetic trees: http://iubio.bio.indiana.edu/soft/molbio/java/source/

    Databanks

    1. Protein Data Base (PDB) of Brookhaven National Laboratory. Look at the browser.

    Back to Table of Contents


    Bibliography

    XXXXXX

    Back to Table of Contents


    Exercises to a computational biology course at the University of Munich

    Exercises 97/98

    These gzipped ps files are of a set of problems given in the tutorial section of the course Computational Biology given at the University of Munich in Wintersemester 1997-98. Ralph Matthes was responsible for the tutorials.
    WARNING The tutorials are in German.

    1. Directory of Exercises

    Exercises 99/00

    A set of problems together with solutions given in the tutorial section of the course Computational Biology at the University of Munich in Wintersemester 1999-2000. The programs are written by Sebastian Will unless otherwise specified.
    Note that the tutorials are in German.

    1. Programs written for the Tutorial
      1. Small example program to introduce C
      2. StarCalc-Sheet to calculate a distance of sequences a la Felsenstein
      3. A dotter variation
      4. Generation of random DNA demonstrating fortune wheel principle
      5. Gather some statistical data of a DNA sequence
      6. Modification of Peter Clote's phy.c to handle the special case of quartet trees
      7. Modification of Peter Clote's pointwise mutation program (which computes historical alignments) to be more instructive. Translated to C++ using STL.
      8. Program that shows another approach to the historical alignment problem of the previous program
    2. Directory of Exercises
    3. Solutions to Exercises
      1. Exercise 2: Historical Alignment
      2. Exercise 3: Compute local alignment of two sequences
      3. Exercise 4: Estimate length of guide RNA (simulates markov process)
      4. Exercise 5: Approximate a continuous Markov process by a discrete Markov chain
      5. Exercise 6: Quartet Puzzling a la v.Haeseler et al.
      6. Exercise 7:Calculate a distance of sequences a la Felsenstein

    Back to Table of Contents


    clote@informatik.uni-muenchen.de, 1998-17-09