Introduction to computational molecular biology, with focus on the development and implementation of efficient algorithms for problems generally related to genomics. Sample topics include sequence homology and alignment, phylogenetic tree construction methods (``All about Eve''), hidden Markov models and their applications (eg. multiple sequence alignment, recognition of genes and promotor sequences), RNA secondary structure prediction, protein structure determination on lattice models, and the determination of DNA strand separation sites in duplication and replication events. The course will present all necessary concepts from molecular biology and probability theory, but requires good algorithm development and programming skills.
Course work involves programming on Linux platform in C/C++ (preferred in bioinformatics) or in Java, together with some scripting in Python or Perl. Programming experience in one of the languages C/C++, Java, Python or Perl is required in this course, though we will give a brief introduction to Python with its applications in bioinformatics. Those unfamiliar with Unix are expected to learn essential Unix on their own within the first week (cf. Chapters 3,4,5 of Developing Bioinformatics Computer Skills).
Return to table of contents
Advanced undergraduate students and graduate students in mathematics, computer science, biology, chemistry, physics. Course material could prove useful for graduate finance or graduate management students interested in aquiring scientific and technical expertise in the bioinformatics field for knowledgeable investment strategies.
Prerequisites are
The course grade will depend on homework (mostly implementation of algorithms), a take-home midterm and a 2.5-hour final examination, on Thursday, May 9 at 12:30 P.M. in Fulton 415.
Return to table of contents
In the past, biologists generally grouped living organisms into two distinct life forms or domains:
Methanococcus jannaschii is a methanogenic archaebacterium, first collected in 1982 by the Woods Hole submersible Alvin near white smokers from a hot spot of the sea floor of the Pacific Ocean at a depth of 2600 meters. In August 1996, the 1.66 megabase pair genome of M. jannaschii was published by Bult et al. in Science, where it was asserted that more than 56% of its 1738 genes are completely new, unlike any genes in existent databases. A small initial portion of the DNA sequence, consisting of over 1.6 million characters, is given as follows.
TACATTAGTGTTTATTACATTGAGAAACTTTATAATTAAAAAAGATTCATGTAAATTTCT TATTTGTTTATTTAGAGGTTTTAAATTTAATTTCTAAGGGTTTGCTGGTTTGATTGTTTA GAATATTTAACTTAATCAAATTATTTGAATTTTTGAAAATTAGGATTAATTAGGTAAGTA AATAAAATTTCTCTAACAAATAAGTTAAATTTTTAAATTTAAGGAGATAAAAATACTCTG TTTTATTATGGAAAGAAAGATTTAAATACTAAAGGGTTTATATTATGAAGTAGTTACTTA CCCTTAGAAAAATATGGTATAGAAAAGCTTAAATATTAAGAGTGATGAAGTATATTATGT
Analysis of the DNA sequence of M. jannaschii provided solid evidence for a startling hypothesis advanced two decades earlier by Carl Woese: there is a third domain of life called Archaea, which is distinct from Prokarya and Eukarya.
How can one determine the (hypothetical) genes of M. jannaschii from its 1.66 megabase pair genome? Obviously this must be done by a computer program, but if the majority of the (hypothetical) genes in this new life form have no homology to known genes, then how does the program work?
The TIGR group of Bult et al. used the commercial software GenMark, which implements a 5-th order Markov model. We'll study Markov chains and important machine learning algorithms for recognizing (inexact) patterns. In particular, we'll study the theory and then implement Hidden Markov Models (HMM), currently used in determining genes, intron/exon splice sites, parts of the genome wrapped around nucleosomes, etc.
Sequence similarity between the new genes of M. jannaschii and those in existent databases were determined by programs. How do these programs work?
We'll consider software such as BLAST, developed by Altshul et al. and available on the net, as well as dynamic programming algorithms such as Smith-Waterman.
Computational biology is a new field, rapidly expanding, and concerns itself with the development of algorithms for
Return to table of contents
Suggested Texts (not required)
Return to table of contents
Homework, class participation 30% Midterm 30% Final Exam 40%
The grading policy is subject to change. If so, then this will be clearly announced with ample time.
Return to table of contents