The nature of semistructured data in web collections is evolving. Even when XML web documents are valid with regard to a schema, the actual structure of such documents exhibits significant variations across collections for several reasons: an XML schema may be very lax (e.g., to accommodate the flexibility needed to represent collections of documents in RSS 1 feeds), a schema may be large and different subsets used for different documents (e.g., this is common in industry standards like UBL 2 ), or open content models may allow arbitrary schemas to be mixed (e.g., RSS extensions like those used for podcasting). A schema alone may not provide sufficient information for many data management tasks that require knowledge of the actual structure of the collection.Web applications (such as processing RSS feeds or web service messages) rely on XPath-based data manipulation tools. Web developers need to use XPath queries effectively on increasingly larger web collections containing hundreds of thousands of XML documents. Even when tasks only need to deal with a single document at a time, developers benefit from understanding the behaviour of XPath expressions across multiple documents (e.g., what will a query return when run over the thousands of hourly feeds collected during the last few months?). Dealing with the (highly variable) structure of such web collections poses additional challenges. First and foremost, I wish to express my utmost gratitude to Alberto Mendelzon and Renée Miller, both of whom have been a source of inspiration for many years. They were also the Alpha and the Omega of my graduate studies: Alberto was my advisor during my master's and the early years of my PhD, and also the first one to suggest the idea of a framework upon which this thesis is based; Renee supervised the final stages of the work and made sure that all the pieces fitted together. Both provided me with support and guidance well beyond the call of duty. They really made this work possible. Special thanks go to José María Turull Torres and Alejandro Vaisman, great friends and mentors. José María introduced me to the fascinating world of scientific research and encouraged me to pursue graduate studies. I could never thank him enough for his guidance in the first steps of my research career. Alejandro's encouragement and insight were always invaluable, especially during the most difficult times of my PhD (he was my thesis advisor in disguise for many years). I consider them my academic role models for their integrity and professionalism. I would like to thank the members of my PhD committee, Kelly Lyons, Thodoros Topaloglou, and John Mylopoulos, for their insightful comments, and Frank Tompa for his thorough external appraisal. I also wish to acknowledge the contribution of Mariano Consens, who helped in the development of some core ideas of this thesis. I am deeply indebted to the administrative staff of the Department of Computer Science for assisting me in so many different ways. Joan Allen and Linda Chow deserve a special mention. I...