Genomic sequencing and structural genomics produced a vast amount of sequence and structural data, creating an opportunity for structure-function analysis in silico [Radivojac P, et al. (2013) Nat Methods 10(3):221-227]. Unfortunately, only a few large experimental datasets exist to serve as benchmarks for functionrelated predictions. Furthermore, currently there are no reliable means to predict the extent of functional similarity among proteins. Here, we quantify structure-function relationships among three phylogenetic branches of the matrix metalloproteinase (MMP) family by comparing their cleavage efficiencies toward an extended set of phage peptide substrates that were selected from ∼64 million peptide sequences (i.e., a large unbiased representation of substrate space). The observed second-order rate constants [k (obs) ] across the substrate space provide a distance measure of functional similarity among the MMPs. These functional distances directly correlate with MMP phylogenetic distance. There is also a remarkable and near-perfect correlation between the MMP substrate preference and sequence identity of 50-57 discontinuous residues surrounding the catalytic groove. We conclude that these residues represent the specificity-determining positions (SDPs) that allowed for the expansion of MMP proteolytic function during evolution. A transmutation of only a few selected SDPs proximal to the bound substrate peptide, and contributing the most to selectivity among the MMPs, is sufficient to enact a global change in the substrate preference of one MMP to that of another, indicating the potential for the rational and focused redesign of cleavage specificity in MMPs.protease | specificity-determining positions | MMPs A paramount objective of biological research is to understand how sequence encodes function. Previously, functional regions in proteins were identified using large-scale mutagenesis (e.g., alanine scanning) (1). More recently, our insights are largely gained by computational approaches aimed at comparing sequences and structures of large protein sets across multiple genomes. The vast increase in the number of available sequences makes it possible to compare homology between sequences from genome projects to proteins of known structure and function and, as a result, identify functional similarities in silico (2-4). Because the global fold of most ordered proteins can be reliably predicted (5, 6) and because the catalytic residues of most classes of enzymes are either known or can be inferred (7,8), protein sequences can now be directly used to elucidate and classify major protein functions (9-12), and are even being extended to predict enzyme substrates (13).However, such classifications fail to explain the specialization and expansion of function that is required for organismal plasticity, complexity, and adaptability, all of which are normally driven by gene duplications and subsequent divergence. Computational approaches aimed at identifying functional distinctions across protein families have pri...