Multimedia information are semi-organized or unstructured information elements whose essential substance is separately or by and large utilized for correspondence. Sight and sound information mining recognizes, arranges, and recovers important highlights from an assortment of media to recognize enlightening examples furthermore, connections for information acquisition. Computer Vision (CV)-based systems have been increasingly popular in recent years, owing to the growing number and complexity of datasets. In CV, finding meaningful photos in a huge dataset is a difficult task to solve. Traditional search engines retrieve photos based on text such as captions and metadata, but this strategy can result in a lot of irrelevant output, not to speak the time, effort, and money required to tag this textual data. In this paper, we proposed a pipelined deep learning oriented methodology framework for multimedia webdata mining based on content extracted feature maps in planner projection as input. Color, texture, form, and other high-level properties of images are represented as numerical feature vectors. This technique is based on the following computer vision tasks in general i.e., Image segmentation, Image classification, Object detection etc. In order to prove the computational efficiency and to validate its statistical behaviour, we have also presented the experimental evaluation on an standard multimedia dataset. The obtained performance results are then compared with some significant existing approaches in the terms of various statistical measures/parameters. Povzetek: Predstavljena je metoda rudarja multimedijev z globokim učenjem, ki temelji na lastnostih vsebine slik. Uporablja se za različne naloge računalniškega vida, kot so segmentacija, klasifikacija in zaznavanje objektov. Preizkušena je bila na standardnem multimedialnem naboru podatkov.