[Invited Paper] TRECVid Semantic Indexing of Video: A 6-Year Retrospective

Awad, George; Snoek, Cees G. M.; Smeaton, Alan F.; Quénot, Georges

doi:10.3169/mta.4.187

Cited by 27 publications

(17 citation statements)

References 31 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…For many years, within the context of the TRECVid video benchmarking activity, researchers struggled to achieve high enough accuracy for the classifiers, as well as large enough numbers of tags in order to be usable [39]. Then, in 2012, things changed with the significant improvements in recognition accuracy obtainable when deep learning networks were applied to this computer vision problem for classifying and tagging images, all led by the work of Geoffrey Hinton's team [40].…”

Section: Semantic Indexing Of Visual Lifelogs: a Static Viewmentioning

confidence: 99%

“…This is in contrast to current refinement methods which learn inter-concept relationships explicitly from training corpora and then apply these to test sets. Because acceptable detection results can be obtained for concept with enough training samples, as witnessed by TRECVid benchmark [39] and ImageNet competition [48], it is feasible to utilise detections with high accuracies to enhance overall multi-concept detections since the concepts are highly correlated.…”

Section: Modeling Global and Local Occurrence Patternsmentioning

confidence: 99%

“…Detecting activities and actions is much more complex than detecting objects, especially when the actions are spread over time, like "preparing a cup of tea" or "eating a meal". In 2016 the TRECVid activity introduced a task on automatic captioning of video [39] where the target videos to be captioned were social media videos from the Vine website, of up to 8 seconds each. For some videos, captioning was comparable to manual annotation but for many videos the automatic captions were poor because they lacked context, i.e.…”

Section: Conclusion and Future Issuesmentioning

confidence: 99%

“…The key organisational metaphor for the data here was as a diary with a chronological ordering of events on any given day. It is worth pointing out that an event-level browser or lifelog data can reduce many thousands of images per day to a more manageable set of events (typically about [30][31][32][33][34][35][36][37][38][39][40]. Such interfaces usually include drill-down functionality to enable the user to examiner some or all images that occur within one selected event.…”

Section: Interacting With Visual Lifelogsmentioning

confidence: 99%

See 3 more Smart Citations

Computer Vision for Lifelogging

Wang

Sun

Smeaton

et al. 2018

Computer Vision for Assistive Healthcare

Self Cite

View full text Add to dashboard Cite

show abstract

Section: Semantic Indexing Of Visual Lifelogs: a Static Viewmentioning

confidence: 99%

Section: Modeling Global and Local Occurrence Patternsmentioning

confidence: 99%

Section: Conclusion and Future Issuesmentioning

confidence: 99%

Section: Interacting With Visual Lifelogsmentioning

confidence: 99%

See 2 more Smart Citations

Computer Vision for Lifelogging

Wang

Sun

Smeaton

et al. 2018

Computer Vision for Assistive Healthcare

Self Cite

View full text Add to dashboard Cite

show abstract

“…The idea of automatically assigning semantic concepts or tags to an image or video has been the subject of research for decades but more progress has been made within the last few years than in those previous decades [1]. The incorporation of deep learning into the process, coupled with the emergence of huge searchable image resources and training data means that automatic tagging of images is now offered by many websites like Aylien, IMAGGA, and others, as a commodity tagging service.…”

Section: Vision Media Analyticsmentioning

confidence: 99%

Semantic Indexing of Wearable Camera Images

Smeaton

McGuinness

Gurrin

et al. 2016

Proceedings of the 2016 ACM Workshop on Vision and Language Integration Meets Multimedia Fusion

Self Cite

View full text Add to dashboard Cite

In order to provide content-based search on image media, including images and video, they are typically accessed based on manual or automatically assigned concepts or tags, or sometimes based on image-image similarity depending on the use case. While great progress has been made in very recent years in automatic concept detection using machine learning, we are still left with a mis-match between the semantics of the concepts we can automatically detect, and the semantics of the words used in a user's query, for example. In this paper we report on a large collection of images from wearable cameras gathered as part of the Kids'Cam project, which have been both manually annotated from a vocabulary of 83 concepts, and automatically annotated from a vocabulary of 1,000 concepts. This collection allows us to explore issues around how language, in the form of two distinct concept vocabularies or spaces, one manually assigned and thus forming a ground-truth, is used to represent images, in our case taken using wearable cameras. It also allows us to discuss, in general terms, issues around mis-match of concepts in visual media, which derive from language mismatches. We report the data processing we have completed on this collection and some of our initial experimentation in mapping across the two language vocabularies.

show abstract