Large-scale Web crawls have emerged as the state of the art for studying characteristics of the Web. In particular, they are a core tool for online tracking research. Web crawling is an attractive approach to data collection, as crawls can be run at relatively low infrastructure cost and don't require handling sensitive user data such as browsing histories. However, the biases introduced by using crawls as a proxy for human browsing data have not been well studied. Crawls may fail to capture the diversity of user environments, and the snapshot view of the Web presented by one-time crawls does not reflect its constantly evolving nature, which hinders reproducibility of crawl-based studies. In this paper, we quantify the repeatability and representativeness of Web crawls in terms of common tracking and fingerprinting metrics, considering both variation across crawls and divergence from human browser usage. We quantify baseline variation of simultaneous crawls, then isolate the effects of time, cloud IP address vs. residential, and operating system. This provides a foundation to assess the agreement between crawls visiting a standard list of high-traffic websites and actual browsing behaviour measured from an opt-in sample of over 50,000 users of the Firefox Web browser. Our analysis reveals differences between the treatment of stateless crawling infrastructure and generally stateful human browsing, showing, for example, that crawlers tend to experience higher rates of third-party activity than human browser users on loading pages from the same domains.
In this work, we propose a 2D-PCA based face recognizer as a semi-automatic tool for helping indexing people in historical photographs. In the proposed recognizer we cope with the scarcity of training samples and the lack of precision of the detector using a training scheme in two stages. The first stage uses an external face database to compute an average face that is used as a reference either at the second training stage and at the recognition step. We also added an auxiliary distance measure we call relative distance to reorder the results generated by the original Euclidean-based distance measure for 2D-PCA. Experimental results with the ORL database as the external face database and a real collection of historical photographs have shown the viability of the proposed tool. These experiments also indicated that both improvements proposed were indeed able to increase recognition rates.
This work presents a proposal of a system that classifies images collected in the World Wide Web. The system separates the images in two semantic classes: photographs and graphics. Photographs are images that show natural scenes, such as people, faces, flowers, animals, landscapes, and cities. Graphics are logos, drawings, icons, maps, and backgrounds, frequently generated by computer. To do this classification we used metrics based on difference that exist between the two images types. These metrics return a numerical value that drive to one of the two classes. To realize the classification we used a supervised technique based on the knowledge that generates rules. This technique is the ID3 method that induces a decision tree and allows to classify the images.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.