BI507/MC615
Computational Molecular Biology
Tu Th 1:30-3:00 in Higgins 425


Course description | Intended audience, prerequisites, and course work | What is computational biology? | Text | Tentative Syllabus | Lectures, Class Notes and Data | Source code of programs | Grading policy | Homework

Course description

Introduction to computational molecular biology, with focus on the development and implementation of efficient algorithms for problems generally related to genomics. Sample topics include sequence homology and alignment, phylogenetic tree construction methods (``All about Eve''), hidden Markov models and their applications (eg. multiple sequence alignment, recognition of genes and promotor sequences), RNA secondary structure prediction, protein structure determination on lattice models, and the determination of DNA strand separation sites in duplication and replication events. The course will present all necessary concepts from molecular biology and probability theory, but requires good algorithm development and programming skills.

Course work involves programming on Linux platform in C/C++ (preferred in bioinformatics) or in Java, together with some scripting in Python or Perl. Programming experience in one of the languages C/C++, Java, Python or Perl is required in this course, though we will give a brief introduction to Python with its applications in bioinformatics. Those unfamiliar with Unix are expected to learn essential Unix on their own within the first week (cf. Chapters 3,4,5 of Developing Bioinformatics Computer Skills).

Topics will be among the following, though new topics may be introduced during the course.

  1. Overview of molecular biology for non-biologists: nucleic acids, proteins, shotgun sequencing, PCR, physical maps, double digest problem.
  2. Combinatorial optimization techniques (genetic algorithms, Monte Carlo, simulated annealing), probability theory, maximum likelihood, Markov chains, Shannon entropy and applications (such as Logo plots and DNA segmentation algorithm).
  3. Sequence alignment: BLAST, FASTA, Smith-Waterman, global and local alignment.
  4. ``All about Eve'', the mitochondrial Eve hypothesis, clustering algorithms, phylogeny trees, maximum likelihood, quartet puzzling, synteny.
  5. Hidden Markov models and applications to detection of promotor sequences in eukaryotic DNA and to multiple sequence alignment.
  6. Motif detection, weight matrices, Gibbs samplers, support vector machines.
  7. RNA secondary structure determination using dynamic programming, ``neutral networks'' and mathematical evolution theory.
  8. Protein structure determination on lattice models, a computational approach to the question of ``How optimal is the genetic code?'', DNA strand separation in duplication and replication events.

Return to table of contents


Intended audience, prerequisites, and course work

Advanced undergraduate students and graduate students in mathematics, computer science, biology, chemistry, physics. Course material could prove useful for graduate finance or graduate management students interested in aquiring scientific and technical expertise in the bioinformatics field for knowledgeable investment strategies.

Prerequisites are

The course is based on Computational Molecular Biology : An Introduction, by P. Clote and R. Backofen, Wiley & Sons, Inc. (August 2000), and will give a (pretty much) self-contained presentation of necessary notions from probability/statistics and biology. Students who do not program in one of C/C++, Java, Python, Perl, but who have good programming skills (e.g. in Fortran) should speak with the instructor.

The course grade will depend on homework (mostly implementation of algorithms), a take-home midterm and a 2.5-hour final examination, on Thursday, May 9 at 12:30 P.M. in Fulton 415.

Return to table of contents


What is Computational Biology?

In the past, biologists generally grouped living organisms into two distinct life forms or domains:

Methanococcus jannaschii is a methanogenic archaebacterium, first collected in 1982 by the Woods Hole submersible Alvin near white smokers from a hot spot of the sea floor of the Pacific Ocean at a depth of 2600 meters. In August 1996, the 1.66 megabase pair genome of M. jannaschii was published by Bult et al. in Science, where it was asserted that more than 56% of its 1738 genes are completely new, unlike any genes in existent databases. A small initial portion of the DNA sequence, consisting of over 1.6 million characters, is given as follows.

TACATTAGTGTTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCT
TATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTGATTGTTTA
GAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTA
AATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTCTG
TTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATTATGAAGTAGTTACTTA
CCCTTAGAAAAATATGGTATAGAAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT

Analysis of the DNA sequence of M. jannaschii provided solid evidence for a startling hypothesis advanced two decades earlier by Carl Woese: there is a third domain of life called Archaea, which is distinct from Prokarya and Eukarya.

How can one determine the (hypothetical) genes of M. jannaschii from its 1.66 megabase pair genome? Obviously this must be done by a computer program, but if the majority of the (hypothetical) genes in this new life form have no homology to known genes, then how does the program work?

The TIGR group of Bult et al. used the commercial software GenMark, which implements a 5-th order Markov model. We'll study Markov chains and important machine learning algorithms for recognizing (inexact) patterns. In particular, we'll study the theory and then implement Hidden Markov Models (HMM), currently used in determining genes, intron/exon splice sites, parts of the genome wrapped around nucleosomes, etc.

Sequence similarity between the new genes of M. jannaschii and those in existent databases were determined by programs. How do these programs work?

We'll consider software such as BLAST, developed by Altshul et al. and available on the net, as well as dynamic programming algorithms such as Smith-Waterman.

Computational biology is a new field, rapidly expanding, and concerns itself with the development of algorithms for

Return to table of contents


Required Texts

  1. Computational Molecular Biology : An Introduction, by P. Clote and R. Backofen, Wiley & Sons, Inc. (August 2000), ISBN 1-56592-664-1.
  2. Fundamental Concepts of Bioinformatics, by Dan E. Krane, Michael L. Raymer, Elaine Nicpon Marieb, Benjamin/Cummings Publisher (2002).

Suggested Texts (not required)

Additional references (for your reference, do not purchase)

  1. All you need to know about DNA, Genes and Genetic Engineering, A Concise, Comprehensive Outline, by Gordon R. Carter and Stephen M. Boyle, published by Charles C.Thomas Publisher, Ltd., Springfield, Illinois 1998
  2. Introduction to Computational Biology, Michael Waterman, Chapman & Hall, London, 1995
  3. Molecular Biology of the Gene, J.D. Watson et al. 3-rd edition, Benjamin/Cummings Publishing Co, 1987.
  4. Introduction to Computational Molecular Biology, J. Setubal and J. Meidanis, PWS Publishing Co, 1997
  5. Introduction to Protein Structure, J. Brandon und C.Tooze, Garland Pub, NY, 1991
  6. Molecular Evolutionary Genetics, M. Nei, Columbia University Press 1987
  7. Genes V, Lewis

Return to table of contents


Grading Policy

Homework, class participation 30%
Midterm 30%
Final Exam 40%

The grading policy is subject to change. If so, then this will be clearly announced with ample time.


Return to table of contents