Common benchmark data sets, standardized performance metrics, and baseline algorithms have demonstrated considerable impact on research and development in a variety of application domains. These resources provide both consumers and developers of technology with a common framework to objectively compare the performance of different algorithms and algorithmic improvements. In this paper, we present such a framework for evaluating object detection and tracking in video: specifically for face, text, and vehicle objects. This framework includes the source video data, ground-truth annotations (along with guidelines for annotation), performance metrics, evaluation protocols, and tools including scoring software and baseline algorithms. For each detection and tracking task and supported domain, we developed a 50-clip training set and a 50-clip test set. Each data clip is approximately 2.5 minutes long and has been completely spatially/temporally annotated at the I-frame level. Each task/domain, therefore, has an associated annotated corpus of approximately 450,000 frames. The scope of such annotation is unprecedented and was designed to begin to support the necessary quantities of data for robust machine learning approaches, as well as a statistically significant comparison of the performance of algorithms. The goal of this work was to systematically address the challenges of object detection and tracking through a common evaluation framework that permits a meaningful objective comparison of techniques, provides the research community with sufficient data for the exploration of automatic modeling techniques, encourages the incorporation of objective evaluation into the development process, and contributes useful lasting resources of a scale and magnitude that will prove to be extremely useful to the computer vision research community for years to come.
AbstractÐPerceptual organization offers an elegant framework to group low-level features that are likely to come from a single object. We offer a novel strategy to adapt this grouping process to objects in a domain. Given a set of training images of objects in context, the associated learning process decides on the relative importance of the basic salient relationships such as proximity, parallelness, continuity, junctions, and common region toward segregating the objects from the background. The parameters of the grouping process are cast as probabilistic specifications of Bayesian networks that need to be learned. This learning is accomplished using a team of stochastic automata in an N-player cooperative game framework. The grouping process, which is based on graph partitioning is, able to form large groups from relationships defined over a small set of primitives and is fast. We statistically demonstrate the robust performance of the grouping and the learning frameworks on a variety of real images. Among the interesting conclusions are the significant role of photometric attributes in grouping and the ability to form large salient groups from a set of local relations, each defined over a small number of primitives.
Abstract-In recent years, one of the effective engines for perceptual organization of low-level image features is based on the partitioning of a graph representation that captures Gestalt inspired local structures, such as similarity, proximity, continuity, parallelism, and perpendicularity, over the low-level image features. Mainly motivated by computational efficiency considerations, this graph partitioning process is usually implemented as a recursive bipartitioning process, where, at each step, the graph is broken into two parts based on a partitioning measure. We concentrate on three such measures, namely, the minimum [41], average [28], and normalized [32] cuts. The minimum cut partition seeks to minimize the total link weights cut. The average cut measure is proportional to the total link weight cut, normalized by the sizes of the partitions. The normalized cut measure is normalized by the product of the total connectivity (valencies) of the nodes in each partition. We provide theoretical and empirical insight into the nature of the three partitioning measures in terms of the underlying image statistics. In particular, we consider for what kinds of image statistics would optimizing a measure, irrespective of the particular algorithm used, result in correct partitioning. Are the quality of the groups significantly different for each cut measure? Are there classes of images for which grouping by partitioning does not work well? Another question of interest is if the recursive bipartitioning strategy can separate out groups corresponding to K objects from each other. In the analysis, we draw from probability theory and the rich body of work on stochastic ordering of random variables. Our major conclusion is that optimization of none of the three measures is guaranteed to result in the correct partitioning of K objects, in the strict stochastic order sense, for all image statistics. Qualitatively speaking, under very restrictive conditions, when the average interobject feature affinity is very weak when compared to the average intraobject feature affinity, the minimum cut measure is optimal. The average cut measure is optimal for graphs whose partition width is less than the mode of distribution of all possible partition widths. The normalized cut measure is optimal for a more restrictive subclass of graphs whose partition width is less than the mode of the partition width distributions and the strength of interobject links is six times less than the intraobject links. Rigorous empirical evaluation on 50 real images indicates that, in practice, the quality of the groups generated using minimum or average or normalized cuts are statistically equivalent for object recognition, i.e., the best, the mean, and the variation of the qualities are statistically equivalent. We also find that, for certain image classes, such as aerial and scenes with man-made objects, in man-made surroundings, the performance of grouping by partitioning is the worst, irrespective of the cut measure.
scite is a Brooklyn-based organization that helps researchers better discover and understand research articles through Smart Citations–citations that display the context of the citation and describe whether the article provides supporting or contrasting evidence. scite is used by students and researchers from around the world and is funded in part by the National Science Foundation and the National Institute on Drug Abuse of the National Institutes of Health.
customersupport@researchsolutions.com
10624 S. Eastern Ave., Ste. A-614
Henderson, NV 89052, USA
This site is protected by reCAPTCHA and the Google Privacy Policy and Terms of Service apply.
Copyright © 2024 scite LLC. All rights reserved.
Made with 💙 for researchers
Part of the Research Solutions Family.