2013 IEEE International Conference on Computer Vision 2013
DOI: 10.1109/iccv.2013.369
|View full text |Cite
|
Sign up to set email alerts
|

Segmentation Driven Object Detection with Fisher Vectors

Abstract: We present an object detection system based on the Fisher vector (FV) image representation computed over SIFT and color descriptors. For computational and storage efficiency, we use a recent segmentation-based method to generate class-independent object detection hypotheses, in combination with data compression techniques. Our main contribution is a method to produce tentative object segmentation masks to suppress background clutter in the features. Re-weighting the local image features based on these masks is… Show more

Help me understand this report

Search citation statements

Order By: Relevance

Paper Sections

Select...
2
1
1
1

Citation Types

1
89
0

Year Published

2014
2014
2016
2016

Publication Types

Select...
7
1
1

Relationship

1
8

Authors

Journals

citations
Cited by 102 publications
(90 citation statements)
references
References 31 publications
1
89
0
Order By: Relevance
“…Here, spatiotemporal oriented features serve as primitives, owing to their strong performance in previous evaluations of dynamic scene recognition features. Recently, the Fisher vector representation [19] has shown state-of-the-art results for a variety of visual tasks (e.g., [5,14,18,23]); in contrast, here it is found that locality constrained linear coding [27] performs particularly well for dynamic scenes.…”
Section: V(xyt)mentioning
confidence: 57%
See 1 more Smart Citation
“…Here, spatiotemporal oriented features serve as primitives, owing to their strong performance in previous evaluations of dynamic scene recognition features. Recently, the Fisher vector representation [19] has shown state-of-the-art results for a variety of visual tasks (e.g., [5,14,18,23]); in contrast, here it is found that locality constrained linear coding [27] performs particularly well for dynamic scenes.…”
Section: V(xyt)mentioning
confidence: 57%
“…As in the original publications, pooling is performed by taking the average (VQ) or maximum (LLC) of the encoded features. For the proposed dynamic pooling, let V w and V h denote the width and height of the spacetime volume in the filtering process (5), then the integration region is set to R x = Vw 4 , R y = V h 4 , and R t is set to the temporal support of the largest filter used in (5). A 25 × K = 5000 dimensional feature vector is generated by the dynamic spacetime pyramid that also uses a hierarchical 3-level pyramid with the finest grid size of 4 × 4 for embedding geometry in 20 of the 25 channels.…”
Section: Implementation Summarymentioning
confidence: 99%
“…Analogously, traditional supervised methods for learning models of object classes from still images (Cootes et al 1998;Felzenszwalb and Huttenlocher 2005;Bourdev and Malik 2009;Felzenszwalb et al 2010;Girshick et al 2014) do not easily transfer to videos as they require expensive location annotations. The alignments recovered by our method could potentially replace the manual correspondences needed by most popular methods for learning object classes (Dalal and Triggs 2005;Felzenszwalb et al 2010;Viola et al 2005;Cinbis et al 2013;Girshick et al 2014), including those requiring part-level annotations (Felzenszwalb and Huttenlocher 2005;Bourdev and Malik 2009;Azizpour and Laptev 2012). They can also enable annotating large collections with little manual effort via knowledge transfer (Vezhnevets and Ferrari 2014;Kuettel et al 2012;Lampert et al 2009;Fei-Fei et al 2007;Malisiewicz et al 2011).…”
Section: Introductionmentioning
confidence: 91%
“…Best results are achieved with strongly supervised training [5,9,15,23] where object locations have to be annotated with bounding boxes. However, the annotation process is difficult, time-consuming and errorprone, especially when the objects are small.…”
Section: Introductionmentioning
confidence: 99%