Computational Molecular Biology: An Introduction

Peter Clote and Rolf Backofen

John Wiley & Sons, Ltd., August 2000
ISBN 0-471-87251-2 (hardback), ISBN 0-471-87252-0 (paperback)

The text is based on the authors' courses given every semester at the University of Munich since Fall 1996. (Peter Clote previously held the Gerhard-Gentzen Chair of Theoretical Computer Science at the University of Munich.)

This text is used in Fall 2000-01 in the course 18.417 Computational Molecular Biology, given by Peter Clote, as Visiting Professor in the Dept of Mathematics, M.I.T.


A list of corrections is given in the file Errata .

Key features:

  1. Web site with C source code for prototype implementations of algorithms.
  2. Detailed primer on biology and mathematics, for largely self-contained introduction to mainstream bioinformatics:
  3. Various algorithms either not covered or with different coverage in current texts, such as:
====================

Chapter 1 -- Molecular Biology
   1.1 Some Organic Chemistry.................................3
   1.2 Small Molecules........................................4
   1.3 Sugars.................................................6
   1.4 Nucleic Acids..........................................6
      1.4.1 Nucleotides.......................................6
      1.4.2 DNA...............................................8
      1.4.3 RNA..............................................13
   1.5 Proteins..............................................14
      1.5.1 Amino Acids......................................14
      1.5.2 Protein Structure................................15
   1.6 From DNA to Proteins..................................17
      1.6.1 Amino Acids and Proteins.........................17
      1.6.2 Transcription and Translation....................19
   1.7 Exercises.............................................21
   Acknowledgments and References............................22


Chapter 2 -- Math Primer
   2.1 Probability...........................................23
      2.1.1 Random Variables.................................25
      2.1.2 Some Important Probability Distributions.........27
         Binomial Distribution
         Geometric Distribution
         Poisson Distribution
         Normal Distribution
         Hypergeometric Distribution
         Exponential Distribution
         Boltzmann Distribution
      2.1.3 Markov Chains.................................. .38
      2.1.4 Metropolis--Hastings Algorithm...................43
            Annealing Schedule...............................46
      2.1.5 Markov Random Fields and Gibbs Sampler...........47
      2.1.6 Maximum Likelihood...............................52
   2.2 Combinatorial Optimization............................53
      2.2.1 Lagrange Multipliers.............................53
      2.2.2 Gradient Descent.................................54
      2.2.3 Heuristics Related to Simulated Annealing........54
      2.2.4 Applications of Monte--Carlo.....................55
         Optimality of the Genetic Code......................55
      2.2.5 Genetic Algorithms...............................60
   2.3 Entropy and Applications to Molecular Biology.........61
      2.3.1 Information Theoretic Entropy....................62
      2.3.2 Shannon Implies Boltzmann........................63
      2.3.3 Simple Statistical Genomic Analysis..............66
      2.3.4 Genomic Segmentation Algorithm...................69
   2.4 Exercises.............................................72
   2.5 Appendix: Modification of Bezout's Lemma..............77
   Acknowledgments and References............................79


Chapter 3 -- Sequence Alignment
   3.1 Motivating Example....................................83
   3.2 Scoring Matrices......................................84
      Score Matrices Based on a Statistical Model............85
      PAM and Amino Acid Pair Probabilities..................86
   3.3 Global Pairwise Sequence Alignment....................88
      3.3.1 Distance Methods.................................88
      3.3.2 Alignment with Tandem Duplication................99
      3.3.3 Similarity Methods..............................110
   3.4 Multiple Sequence Alignment..........................111
      3.4.1 Dynamic Programming.............................112
      3.4.2 Gibbs Sampler...................................112
      3.4.3 Maximum-Weight Trace............................114
      3.4.4 Hidden Markov Models............................117
      3.4.5 Steiner Sequences...............................117
   3.5 Genomic Rearrangements...............................118
   3.6 Locating Cryptogenes and Guide RNA...................120
      3.6.1 Anchor and Periodicity Rules....................122
      3.6.2 Search for Cryptogenes..........................122
   3.7 Expected Length of gRNA in Trypanosomes..............123
   3.8 Exercises............................................128
   3.9 Appendix: Maximum-Likelihood ........................132
   Acknowledgements and References..........................133


Chapter 4 -- All About EVE
   4.1 Introduction.........................................135
   4.2 Rate of Evolutionary Change..........................137
      4.2.1 Amino Acid Sequences............................137
      4.2.2 Nucleotide Sequences............................139
   4.3 Clustering Methods...................................144
      4.3.1 Ultrametric Trees...............................147
      4.3.2 Additive Metric.................................152
      4.3.3 Estimating Branch Lengths.......................156
   4.4 Maximum Likelihood...................................157
      4.4.1 Likelihood of a Tree............................159
      4.4.2 Recursive Definition for the Likelihood.........160
      4.4.3 Optimal Branch Lengths for Fixed Topology.......162
      4.4.4 Determining the Topology........................166
   4.5 Quartet Puzzling.....................................166
      4.5.1 Quartet Puzzling Step...........................169
      4.5.2 Majority Consensus Tree.........................170
   4.6 Exercises............................................171
   Acknowledgments and References...........................173
   
   
   
Chapter 5 -- Hidden Markov Models
   5.1 Likelihood and Scoring a Model.......................177
   5.2 Re-estimation of Parameters..........................180
      5.2.1 Baum--Welch Method..............................181
      5.2.2 EM and Justification of the Baum--Welch Method..184
      5.2.3 Baldi--Chauvin Gradient Descent.................187
      5.2.4 Mamitsuka's MA Algorithm........................191
   5.3 Applications.........................................193
      5.3.1 Multiple Sequence Alignment.....................193
      5.3.2 Protein Motifs..................................194
      5.3.3 Eukaryotic DNA Promotor Regions.................195
   5.4 Exercises............................................197
   Acknowledgments and References...........................198
   
   
Chapter 6 -- Structure Prediction
   6.1 RNA Secondary Structure..............................202
   6.2 DNA Strand Separation................................213
   6.3 Amino Acid Pair Potentials...........................223
   6.4 Lattice Models of Proteins...........................228
      6.4.1 Monte--Carlo and the Heteropolymer Protein Model231
         Local Move Set.....................................231
         Pivot Moves........................................232
      6.4.2 Genetic Algorithm for Folding in the HP Model...233
   6.5 Hart and Istrail's Approximation Algorithm...........234
      6.5.1 Performance.....................................234
      6.5.2 Lower Bound.....................................236
      6.5.3 Block Structure, Folding Point, and Balanced Cut239
   6.6 Constraint-Based Structure Prediction................243
   6.7 Protein Threading....................................246
      6.7.1 Definition......................................246
      6.7.2 A Branch-and-Bound Algorithm....................249
      6.7.3 NP-hardness.....................................258
   6.8 Exercises............................................259
   Acknowledgments and References...........................261
   
   
Appendix
   A Mathematical Background................................263
      A.1 Asymptotic complexity.............................263
      A.2 Units of Measurement..............................263
      A.3 Lagrange Multipliers..............................264
   B Resources..............................................265
      B.1 Web Sites.........................................265
      B.2 The PDB Format....................................266
   References...............................................269