Content-oriented multimedia document understanding through cross-media correlation

Lü, Tong; Jin, Yukang; Su, Feng; Shivakumara, Palaiahnakote; Tan, Chew Lim

doi:10.1007/s11042-014-2044-9

Cited by 11 publications

(6 citation statements)

References 39 publications

Supporting

Mentioning

Contrasting

Order By: Relevance

“…A more systematical and detailed introduction to the discussed techniques may be found in the references for image processing and machine vision in [48][49][50][51]. …”

Section: Discussionmentioning

confidence: 99%

“…Essentially, multimodal video data originated from the same source tend to be correlated [1,2,48]. It means that different modalities can take a complementary role on solving video content analysis tasks, and the presence of one modality can help understand certain semantics of others.…”

Section: Discussionmentioning

confidence: 99%

“…To evaluate scene digit recognition algorithms, Netzer et al [46] build the Street View House Numbers (SVHN) dataset to detect and read house-number signs in street view scenes. The SVHN dataset is obtained from a large number of street view scene images [47] using a combination of automated algorithms and the Amazon Mechanical Turk framework [48], which is used to localize and transcribe the single digits. A very large set of images from urban areas in various countries are downloaded.…”

Section: Scene Text Datasetsmentioning

confidence: 99%

See 2 more Smart Citations

Video Text Detection

Palaiahnakote

Tan

et al. 2014

Advances in Computer Vision and Pattern Recognition

Self Cite

View full text Add to dashboard Cite

“…A more systematical and detailed introduction to the discussed techniques may be found in the references for image processing and machine vision in [48][49][50][51]. …”

Section: Discussionmentioning

confidence: 99%

Section: Discussionmentioning

confidence: 99%

Section: Scene Text Datasetsmentioning

confidence: 99%

See 1 more Smart Citation

Video Text Detection

Palaiahnakote

Tan

et al. 2014

Advances in Computer Vision and Pattern Recognition

Self Cite

View full text Add to dashboard Cite

“…For example, the latent semantic analysis derived from language processing is proposed as an interesting solution to learn overlapped audio events [47]. Lu et al [22] propose a multimodal correlation network, in which audio-to-audio retrievals can be improved by incorporating visual image information. However, this area requires more studying to apply the techniques efficiently into auditory scene understanding.…”

Section: Related Workmentioning

confidence: 99%

“…In this paper, we propose a novel audio event recognition framework for acoustic scene understanding based on our previous work on sound classification [3,19], audio summarization [20,21] and audio-visual correlation [4,22]. The term auditory scene here refers to the acoustic modeling of a specific location or site such as home, bus station, restaurant and shopping mall, which is similar to what an image of the same location provides visually.…”

mentioning

confidence: 99%

Context-based environmental audio event recognition for scene understanding

Lü

Wang

2014

Multimedia Systems

Self Cite

View full text Add to dashboard Cite

To the best of our knowledge, this is the first work that models event correlations as scene context for robust audio event detection from complex and noisy environments. Note that according to the recent report, the mean accuracy for the acoustic scene classification task by human listeners is only around 71 % on the data collected in office environments from the DCASE dataset. None of the existing methods performs well on all scene categories and the average accuracy of the best performances of the recent 11 methods is 53.8 %. The proposed method averagely achieves an accuracy of 62.3 % on the same dataset. Additionally, we create a 10-CASE dataset by manually collecting 5,250 audio clips of 10 scene types and 21 event categories. Our experimental results on 10-CASE show that the proposed method averagely achieves the enhanced performance of 78.3 %, and the average accuracy of audio event recognition can be effectively improved by capturing dominant audio sources and reasoning non-dominant events from the dominant ones through acoustic context modeling. In the future work, exploring the interactions between acoustic scene recognition and audio event detection, and incorporating other modalities to improve the accuracy are required to further advance the proposed framework.

show abstract

Ultrasound-elastic-image-assisted diagnosis of pulmonary nodules based on genetic algorithm

Dong

Hua

et al. 2020

Neural Comput & Applic

View full text Add to dashboard Cite

Content-oriented multimedia document understanding through cross-media correlation

Cited by 11 publications

References 39 publications

Video Text Detection

Video Text Detection

Context-based environmental audio event recognition for scene understanding

Ultrasound-elastic-image-assisted diagnosis of pulmonary nodules based on genetic algorithm

Contact Info

Product

Resources

About