局部敏感哈希 Kernelized Locality-Sensitive Hashing Page

Kernelized Locality-Sensitive Hashing Page

Brian Kulis (1) and Kristen Grauman (2)
(1) UC Berkeley EECS and ICSI, Berkeley, CA
(2) University of Texas, Department of Computer Sciences, Austin, TX

Introduction

Fast indexing and search for large databases is critical to content-based image and video retrieval---particularly given the ever-increasing availability of visual data in a variety of interesting domains, such as scientific image data, community photo collections on the Web, news photo collections, or surveillance archives. The most basic but essential task in image search is the "nearest neighbor" problem: to take a query image and accurately find the examples that are most similar to it within a large database. A naive solution to finding neighbors entails searching over all n database items and sorting them according to their similarity to the query, but this becomes prohibitively expensive when n is large or when the individual similarity function evaluations are expensive to compute. For vision applications, this complexity is amplified by the fact that often the most effective representations are high-dimensional or structured, and best known distance functions can require considerable computation to compare a single pair of objects.

To make large-scale search practical, vision researchers have recently explored approximate similarity search techniques, most notably locality-sensitive hashing (Indyk and Motwani 1998, Charikar 2002), where a predictable loss in accuracy is sacrificed in order to allow fast queries even for high-dimensional inputs. In spite of hashing‘s success for visual similarity search tasks, existing techniques have some important restrictions. Current methods generally assume that the data to be hashed comes from a multidimensional vector space, and require that the underlying embedding of the data be explicitly known and computable. For example, LSH relies on random projections with input vectors; spectral hashing (Weiss et al. NIPS 2008) assumes vectors with a known probability distribution.

This is a problematic limitation, given that many recent successful vision results employ kernel functions for which the underlying embedding is known only implicitly (i.e., only the kernel function is computable). It is thus far impossible to apply LSH and its variants to search data with a number of powerful kernels---including many kernels designed specifically for image comparisons, as well as some basic well-used functions like a Gaussian RBF. Further, since visual representations are often most naturally encoded with structured inputs (e.g., sets, graphs, trees), the lack of fast search methods with performance guarantees for flexible kernels is inconvenient.

In this work, we present an LSH-based technique for performing fast similarity searches over arbitrary kernel functions. The problem is as follows: given a kernel function and a database of n objects, how can we quickly find the most similar item to a query object in terms of the kernel function? Like standard LSH, our hash functions involve computing random projections; however, unlike standard LSH, these random projections are constructed using only the kernel function and a sparse set of examples from the database itself. Our main technical contribution is to formulate the random projections necessary for LSH in kernel space. Our construction relies on an appropriate use of the central limit theorem, which allows us to approximate a random vector using items from our database. The resulting scheme, which we call kernelized LSH (KLSH), generalizes LSH to scenarios when the feature space embeddings are either unknown or incomputable.

Method

The main idea behind our approach is to construct a random hyperplane hash function, as in standard LSH, but to perform computations purely in kernel space. The construction is based on the central limit theorem, which will compute an approximate random vector using items from the database. The central limit theorem states that, under very mild conditions, the mean of a set of objects from some underlying distribution will be Gaussian distributed in the limit as more objects are included in the set. Since for LSH we require a random vector from a particular Gaussian distribution---that of a zero-mean, identity covariance Gaussian---we can use the central limit theorem, along with an appropriate mean-shift and whitening, to form an approximate random vector from a unit-mean, identity covariance Gaussian. By performing this construction appropriately, the algorithm can be applied entirely in kernel space, and can also be applied efficiently over very large data sets.

Once we have computed the hash functions, we use standard LSH techniques to retrieve nearest neighbors of a query to the database in sublinear time. In particular, we employ the method of Charikar for obtaining a small set of candidate approximate nearest neighbors, and then these are sorted using the kernel function to yield a list of hashed nearest neighbors.

There are some limitations to the method. The random vector constructed by the KLSH routine is only approximately random; general bounds on the central limit theorem are unknown, so it is not clear how many database objects are required to get a sufficiently random vector for hashing. Further, we implicitly assume that the objects from the database selected to form the random vectors span the subspace from which the queries are drawn. That said, in practice the method is robust to the number of database objects chosen for the construction of the random vectors, and behaves comparably to standard LSH on non-kernelized data.

Experimental Results

80 Million Tiny Images. We ran KLSH over the 80 million images in the Tiny Image data set. We used the extracted Gist features from these images, and applied a nearest neighbor search on top of a Gaussian kernel.

The top left image in each set is the query. The remainder of the top row shows the top nearest neighbor using a linear scan (with the Gaussian kernel) and the second row shows the nearest neighbor using KLSH. Note that, with this data set, the hashing technique searched less than 1 percent of the database, and nearest neighbors were extracted in approximately .57 seconds (versus 45 seconds for a linear scan). Typically the hashing results appear qualitatively similar to (or match exactly) the linear scan results.

We can see quantitatively how the results of the nearest neighbors extracted from KLSH compare to the linear scan nearest neighbors in the above plot. It shows, for 10, 20, and 30 hashing nearest neighbors, how many linear scan nearest neighbors are required to cover the hashing nearest neighbors.

Flickr Scene Recognition. We performed a similar experiment with a set of Flickr images containing tourist photos from a set of landmarks. Here, we applied a chi-squared kernel on top of SIFT features for the nearest neighbor search. Note that these results did not appear in the conference paper.

We can also measure how the accuracy of a k-nearest neighbor classifier with KLSH approaches the accuracy of a linear scan k-NN classifier on this data set. The above plot shows that, as epsilon decreases, the hashing accuracy approaches the linear scan accuracy.

Object Recognition on Caltech-101. We applied our method on the Caltech-101 for object recognition, as there have been several recent kernel functions for images that have shown very good performance for object recognition, but have unknown or very complex featureembeddings. This data set also allowed us to test how changes in parameters affect hashing results.

The parameters p, t, and the number of hash bits only affect hashing accuracy marginally. The main parameter of interest is epsilon, a parameter from standard LSH which trades off speed for accuracy.

Local Patch Indexing with the Photo Tourism Data Set. Finally, we applied KLSH over a data set of 100,000 image patches from the Photo Tourism data set. We compared a standard Euclidean distance function (linear scan and hashing) with a learned kernel (linear scan and hashing). The particular learned kernel we used has no simple, explicit feature embedding (see the paper for details) but the linear scan retrieval results are significantly better than the baseline Euclidean distance, thus providing another example where KLSH is useful for retrieval. The results indicate that the hashing schemes do not degrade retrieval performance considerably on this data.

Summary. We have shown that hashing can be performed over arbitrary kernels to yield significant speed-ups for similarity searches with little loss in accuracy. In experiments, we have applied KLSH over several kernels, and over several domains:

Gaussian kernel (Tiny Images)

Chi-squared kernel (Flickr)

Correspondence kernel (Caltech-101)

Learned kernel (Photo Tourism)

Code

The code is available here. NOTE: the code was updated July 5, 2010 and September 23, 2010 to correct bugs in createHashTable.m. Please use the most recent version.

Paper

Kernelized Locality-Sensitive Hashing for Scalable Image Search
Brian Kulis & Kristen Grauman
In Proc. 12th International Conference on Computer Vision (ICCV), 2009.
[pdf]

Also see the following related papers, which apply LSH to learned Mahalanobis metrics:

Fast Similarity Search for Learned Metrics
Brian Kulis, Prateek Jain, & Kristen Grauman
IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 31, no. 12, pp. 2143--2157, 2009.
[pdf]

Fast Image Search for Learned Metrics
Prateek Jain, Brian Kulis, & Kristen Grauman
In. Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2008.
[pdf]

from: http://web.cse.ohio-state.edu/~kulis/klsh/klsh.htm

时间： 2024-11-06 16:35:26

局部敏感哈希 Kernelized Locality-Sensitive Hashing Page的相关文章

局部敏感哈希简介

上一年记录的东西,整理下... LSH,是Locality Sensitive Hashing的缩写,也翻译为局部敏感哈希,是一种通过设计满足特殊性质即局部敏感的哈希函数,提高相似查询效率的方法. 虽然从正式提出距今不过十余年,由于其局部敏感的特殊性质,以及在高维数据上相当于k-d树等方法的优越性,LSH被广泛地运用于各种检索(包括并不仅限于文本.音频.图片.视频.基因等)领域. 一.哈希检索概述 1.1 检索分类在检索技术中,索引一直需要研究的核心技术.当下,索引技术主要分为三类:基于树的索

局部敏感哈希(Locality Sensitive Hashing)

比较不同的文章.图片啊什么的是否相似,如果一对一的比较,数据量大的话,以O(n2)的时间复杂度来看,计算量相当惊人.所以如果是找相同就好了,直接扔到一个hashmap中即可.这样就是O(n)的复杂度了.不过相同的字符串一定会得到相同的hash,而不同的字符串,哪怕只有一点点不同,也极可能得到完全不同hash.很自然的想到,要是相似的object能够得到相似的hash就好了.局部敏感哈希就是这样的hash,实现了相似的object的hash也是相似的. 定义相似要找相似,首先是要定义什么事相似.

局部敏感哈希(Locality-Sensitive Hashing, LSH)方法介绍（转）

局部敏感哈希(Locality-Sensitive Hashing, LSH)方法介绍本文主要介绍一种用于海量高维数据的近似最近邻快速查找技术--局部敏感哈希(Locality-Sensitive Hashing, LSH),内容包括了LSH的原理.LSH哈希函数集.以及LSH的一些参考资料. 一.局部敏感哈希LSH 在很多应用领域中,我们面对和需要处理的数据往往是海量并且具有很高的维度,怎样快速地从海量的高维数据集合中找到与某个数据最相似(距离最近)的一个数据或多个数据成为了一个难点和问题.

局部敏感哈希(Locality-Sensitive Hashing, LSH)

转自局部敏感哈希(Locality-Sensitive Hashing, LSH) 一.局部敏感哈希LSH 在很多应用领域中,我们面对和需要处理的数据往往是海量并且具有很高的维度,怎样快速地从海量的高维数据集合中找到与某个数据最相似(距离最近)的一个数据或多个数据成为了一个难点和问题.如果是低维的小数据集,我们通过线性查找(Linear Search)就可以容易解决,但如果是对一个海量的高维数据集采用线性查找匹配的话,会非常耗时,因此,为了解决该问题,我们需要采用一些类似索引的技术来加快查找过程

局部敏感哈希LSH

之前介绍了Annoy,Annoy是一种高维空间寻找近似最近邻的算法(ANN)的一种,接下来再讨论一种ANN算法,LSH局部敏感哈希. LSH的基本思想是: 原始空间中相邻的数据点通过映射或投影变换后,在新空间中仍然相邻的概率很大,而不相邻的数据点映射后相邻的概率比较小. 也就是说,我们对原始空间中的数据进行hash映射后,希望相邻的数据能够映射到Hash的同一个桶内. 对所有的原始数据进行hash映射后,就会得到一个hashtable,这个hashtable同一个桶内的数据在原始空间中相邻的概率

局部敏感哈希之KSH

核函数Kernel Function 流程分析监督信息Supervised Information 内积法计算相似度Code Inner Product 目标函数Objective Function 贪心算法求解Greedy Optimization 频谱化宽松Spectral Relaxation Sigmoid平滑Sigmoid Smoothing 最终算法参考文献在局部敏感哈希文中,分析了局部敏感哈希方法是如何应用在检索过程中的,以及原始的哈希方法和基于p-stable分布的哈希方

【常用算法】KDTree，局部敏感哈希LSH，在基于最近邻的算法中，当N特别大的时候（TODO）

基于最近邻的算法,在各种情况下经常使用, 比如10万个用户,对每一个用户分别查找最相似的用户, 当N特别大的时候,效率就不是很高,比如当N=10^5,时已经不太好算了,因为暴力法时间复杂度为O(N^2). 故需要特殊的手段,这里有两个常用的方法, 一个是KDT树(还有Ball Tree),一个是局部敏感哈希(近似算法,得到得是满足一定置信区间的结果) KDT: O(N*longN) 局部敏感哈希(LSH):跟桶大小有关 1# K-Dimensional Tree,KDT, https://en

R语言实现︱局部敏感哈希算法（LSH）解决文本机械相似性的问题（二，textreuse介绍）

上一篇(R语言实现︱局部敏感哈希算法(LSH)解决文本机械相似性的问题(一,基本原理))讲解了LSH的基本原理,笔者在想这么牛气冲天的方法在R语言中能不能实现得了呢? 于是在网上搜索了一下,真的发现了一个叫textreuse的包可以实现这样的功能,而且该包较为完整,可以很好地满足要求. 现在的版本是 0.1.3,最近的更新的时间为 2016-03-28. 国内貌似比较少的用这个包来实现这个功能,毕竟R语言在运行大规模数据的性能比较差,而LSH又是处理大规模数据的办法,所以可能国内比较少的用R来执

为什么要用局部敏感哈希

一.题外话虽然是科普,不过笔者个人认为大道至简,也就是说越简单的东西很可能越值得探讨,或者另外一种说法越简单的东西越不好讲解:其实笔者认为这就是<编程之美>所要传递的——大道至简. 软件构建老师给我推荐的<走出软件作坊>还没看呢. 二.概述高维数据检索(high-dimentional retrieval)是一个有挑战的任务.对于给定的待检索数据(query),对数据库中的数据逐一进行相似度比较是不现实的,它将耗费大量的时间和空间.这里我们面对的问题主要有两个,第一,两个高维向