Building a web-scale image similarity search system

Batko, Michal; Falchi, Fabrizio; Lucchese, Claudio; Novák, David; Perego, Raffaele; Rabitti, Fausto; Sedmidubský, Jan; Zezula, Pavel

doi:10.1007/s11042-009-0339-z

Cited by 91 publications

(45 citation statements)

References 14 publications

Supporting

Mentioning

Contrasting

Unclassified

Order By: Relevance

“…copying data to Hadoop Distributed File Systems), as it is only optimized for massive batches of queries. Distributed tree-based systems have also been studied for horizontal index partitioning for CBMI Aly et al [2], Batko et al [6], but the e ectiveness of sub-tree based index partitioning is reduced when the dimensionality of the vectors to index increases [36], meaning that more nodes need to be queried. E ective partitioning of the search space is a key part of approximate nearest neighbour algorithms.…”

Section: Related Workmentioning

confidence: 99%

“…Figure 1 (a) shows a single assignment technique, where each document is assigned to a single partition (e.g. [6]). Figure 1 (b) shows a random assignment technique, where documents are assigned to a single partition randomly, and queries are assigned to all partitions.…”

Section: Space Partitioning Codebooksmentioning

confidence: 99%

“…[29], or based on existing partitions of single node algorithms, e.g. [6]. One of the works that goes towards our partitioning goals is by Ji et al [20].…”

Section: Introductionmentioning

confidence: 99%

See 2 more Smart Citations

Balanced Search Space Partitioning for Distributed Media Redundant Indexing

Mourão

Magalhães

2017

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

is paper addresses the problem of balanced, redundant indexing of media information. Our goal is to partition and distribute the search index, taking advantage of the distributed systems properties: balanced load across nodes, redundancy on node down and e cient node usage under concurrent querying. We follow an information compression approach to solve this problem and propose to represent data with overcomplete codebooks, where each document is represented by only a few codewords and an indexing node is responsible for several codewords.antization algorithms are designed to t the original data as best as possible, leading to bias towards codewords that t the principal directions of data. In this paper, we propose the balanced KSVD (B-KSVD) algorithm, that distributes the allocation of data across a balanced number of codewords, according to the global distribution of data. Indexing experiments showed that B-KSVD can achieve 38% 1-recall by inspecting only 1% of the full index, distributed over 10 partitions. Traditional methods based on k-means need to either use larger codebooks or to inspect a larger portion of the index to achieve the same retrieval performance.

show abstract

Section: Related Workmentioning

confidence: 99%

Section: Space Partitioning Codebooksmentioning

confidence: 99%

See 1 more Smart Citation

Balanced Search Space Partitioning for Distributed Media Redundant Indexing

Mourão

Magalhães

2017

Proceedings of the 2017 ACM on International Conference on Multimedia Retrieval

View full text Add to dashboard Cite

show abstract

“…Luo et al [24] fused information extracted from both a Flickr data set and a set of satellite images, in order to detect events. Batko et al [25] used MPEG-7 visual features and search into a set of over 50M photos from Flickr. Seah et al [26] created visual summaries on the results of visual queries on a data set of Flickr images that in contrast to previous works, e.g., the one of [27], they attempted to generate concept-preserving summaries.…”

Section: Shall We Consider Visual Characteristics?mentioning

confidence: 99%

A Survey of Geo-tagged Multimedia Content Analysis within Flickr

Spyrou

Mylonas

2014

Lecture Notes in Computer Science

View full text Add to dashboard Cite

Abstract. Our survey paper attempts to investigate how recent and undoubted emerge in enriched, geo-tagged social networks' multimedia content sharing works to the benefit of their users and whether it could be handled in a formal way, in order to capture the meaningful semantics rising from this newly introduced user experience. It further specializes its focus by providing an overview of current state-of-the-art techniques with respect to geo-tagged content access, processing and manipulation within the popular Flickr social network. In this manner it explores the role of information retrieval, integration and extraction from the technical point of view, coupled together with human social network activities, like, for instance, localization and recommendations based on pre-processed collaborative geo-tagged photos, resulting into more efficient, optimized search results.

show abstract

“…This is typically beyond the capabilities of classic exact match or keyword search techniques and thus the use of various similarity search technologies increases significantly in current applications. A considerable research effort has been invested in this topic resulting in both theoretical background [24] and large-scale practical results [17,3]. …”

mentioning

confidence: 99%

Efficiency and security in similarity cloud services

Kozák

Zezula

2013

Proc. VLDB Endow.

Self Cite

View full text Add to dashboard Cite

With growing popularity of cloud services, the trend in the industry is to outsource the data to a 3rd party system that provides searching in the data as a service. This approach naturally brings privacy concerns about the (potentially sensitive) data. Recently, quite extensive research of outsourcing classic exact-match or keyword search has been done. However, not much attention has been paid to the outsourcing of the similarity search, which becomes more and more important in information retrieval applications.In this work, we propose to the research community a model of outsourcing similarity search to the cloud environment (so called similarity cloud ). We establish privacy and efficiency requirements to be laid down for the similarity cloud with an emphasis on practical use of the system in real applications; this requirement list can be used as a general guideline for practical system analysis and we use it to analyze current existing approaches. We propose two new similarity indexes that ensure data privacy and thus are suitable for search systems outsourced in a cloud. The balance of the first proposed technique EM-Index is more on the efficiency side while the other (DSH Index) shifts this balance more to the privacy side. MOTIVATIONWith the rapid growth of the volume and diversity of digital data produced by all kinds of commercial, scientific and leisure-time applications, the retrieval in large data sets became one of the key IT tasks nowadays. The complex data types, such as multimedia or various sensor data, introduce a natural requirement to be searched not only by their metadata but also by the content of the data itself. This is typically beyond the capabilities of classic exact match or keyword search techniques and thus the use of various similarity search technologies increases significantly in current applications. A considerable research effort has been invested in this topic resulting in both theoretical background [24] and large-scale practical results [17,3].

show abstract

Building a web-scale image similarity search system

Cited by 91 publications

References 14 publications

Balanced Search Space Partitioning for Distributed Media Redundant Indexing

Balanced Search Space Partitioning for Distributed Media Redundant Indexing

A Survey of Geo-tagged Multimedia Content Analysis within Flickr

Efficiency and security in similarity cloud services

Contact Info

Product

Resources

About