EECS500 Fall 2014 Distinguished Lecture

Tandy Warnow
New HMM-based methods for Ultra-large Alignment and Phylogeny Estimation
University of Illinois at Urbana-Champaign
White 411
September 16, 2014

Multiple sequence alignment of datasets containing many thousands of sequences is a challenging problem with applications in phylogeny estimation, protein structure and function prediction, taxon identification of metagenomic data, etc. However, few methods can analyze large datasets, and none have been shown to have good accuracy on datasets with more than about 10,000 sequences, especially if the sequence datasets have evolved with high rates of evolution. In this talk, I will present a new method to obtain highly accurate estimations of large-scale multiple sequence alignments and phylogenies. The basic idea is to use a family of Hidden Markov Models (HMMs) to represent a "seed alignment", and then align all the remaining sequences to the seed alignment. Our method, UPP, returns very accurate alignments, and trees on these alignments are also very accurate - even on datasets with as many as 1,000,000 sequences. Furthermore, UPP is both fast and very scalable, so that the analysis of the 1-million taxon dataset took only 24 hours using 12 cores and small amounts of memory. Finally, this "HMM Family" technique can also be used for other machine learning problems, including taxon identication of metagenomic data.


Tandy Warnow is the Founder Professor of Bioengineering and Computer Science at the University of Illinois at Urbana-Champaign.  Her research combines mathematics, computer science, and statistics to develop improved models and algorithms for reconstructing complex and large-scale evolutionary histories in both biology and historical linguistics.

Tandy received her PhD in Mathematics at UC Berkeley under the direction of Gene Lawler, and did postdoctoral training with Simon Tavare and Michael Waterman at USC.  She received the National Science Foundation Young Investigator Award in 1994, the David and Lucile Packard Foundation Award in Science and Engineering in 1996, a Radcliffe Institute Fellowship in 2006,

and a Guggenheim Foundation Fellowship for 2011. Her current research focuses on phylogeny and alignment estimation for very large datasets (10,000 to 500,000 sequences), estimating species trees from collections of gene trees, and metagenomics.