This paper gives an overview of Turn Data Management Platform (DMP). We explain the purpose of this type of platforms, and show how it is positioned in the current digital advertising ecosystem. We also provide a detailed description of the key components in Turn DMP. These components cover the functions of (1) data ingestion and integration, (2) data warehousing and analytics, and (3) real-time data activation. For all components, we discuss the main technical and research challenges, as well as the alternative design choices. One of the main goals of this paper is to highlight the central role that data management is playing in shaping this fast growing multi-billion dollars industry.
We are proposing a new similarity based recommendation system for large-scale dynamic marketplaces. Our solution consists of an offline process, which generates long-term cluster definitions grouping short-lived item listings, and an online system, which utilizes these clusters to first focus on important similarity dimensions and next conducts a trade-off between further similarity and other quality factors such as seller trustworthiness. Our system generates these clusters from several hundred millions of item listings using a large Hadoop map-reduce based system. The clusters are learned using user queries as the main information source and therefore biased towards how users conceptually group items. Our system is deployed on several eBay sites in large-scale and has increased user-engagement and business metrics compared to the previous system. We show that utilizing user queries helps capturing similarity better. We also present experiments demonstrating that adapting the ranking function, which controls the trade-off between similarity and quality, to a specific context improves recommendation performance.
We consider the coverage testing problem where we are given a document and a corpus with a limited query interface and asked to find if the corpus contains a near-duplicate of the document. This problem has applications in search engines for competitive coverage testing. To solve this problem, we propose approaches that work in three main steps: generate a query signature from the document, query the corpus using the query signature and scrape the returned results, and validate the similarity between the input document and the returned results. We discuss techniques to control and bound the performance of these methods. We perform largescale experimental validation and show that these methods perform well across different search engine corpora and documents in multiple languages. They also are robust against performance parameter variations.
A website can regulate search engine crawler access to its content using the robots exclusion protocol, specified in its robots.txt file. The rules in the protocol enable the site to allow or disallow part or all of its content to certain crawlers, resulting in a favorable or unfavorable bias towards some of them. A 2007 survey on the robots.txt usage of about 7,593 sites found some evidence of such biases, the news of which led to widespread discussions on the web. In this paper, we report on our survey of about 6 million sites. Our survey tries to correct the shortcomings of the previous survey and shows the lack of any significant preferences towards any particular search engine.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.