An SVD-Based Phylogenetic Method for Creating Comprehensive Whole Genomes Michael Berry University of Tennessee As whole genome sequences continue to expand in number and complexity, effective methods for comparing and categorizing both genes and species represented within extremely large datasets are required. Current methods have generally utilized incomplete (and likely insufficient) subsets of the available data even as additional data becomes available at a rapid rate. We have developed an accurate and efficient method for producing robust gene and species phylogenies using very large whole genome protein datasets. This method relies on multidimensional protein vector definitions supplied by the singular value decomposition (SVD) of large sparse data matrices in which each protein is uniquely represented as vector of overlapping tetrapeptide frequencies. Over 134,000 proteins from 53 complete prokaryotic genomes and one mitochondria were represented in definition spaces constructed from the 500 to 600 of the largest singular triplets. Quantitative pairwise estimates of species similarity are obtained by summing the protein vectors to form species vectors, then determining the cosines of the angles between species vectors. Evolutionary trees are then produced from the distance matrices obtained following the conversion of these vector derived similarity measures into evolutionary distance measures. Although many accepted prokaryotic relationships were confirmed in these trees, several novel relationships were also noted. In addition, we provide evidence that each of the SVD-derived basis vectors represents a particular conserved protein motif composed of sets of correlated peptides. Each "copep" motif is precisely defined as a particular linear combination of all 160,000 possible tetrapeptides. This analysis represents the most detailed simultaneous comparison of prokaryotic genes and species available to date.