学习哈希及哈希在大数据检索和挖掘中的应用

http://cs.nju.edu.cn/lwj/conf/CIKM14Hash.htm

Learning to Hash with its Application to Big Data Retrieval and Mining

Overview

Nearest neighbor (NN) search plays a fundamental role in machine learning and related areas, such as information retrieval and data mining. Hence, there has been increasing interest in NN search in massive (large-scale) data sets in this big data era. In many real applications, it‘s not necessary for an algorithm to return the exact nearest neighbors for every possible query. Hence, in recent years approximate nearest neighbor (ANN) search algorithms with improved speed and memory saving have received more and more attention from researchers.

【最近邻搜索(Nearest neighbor (NN) search)】在机器学习等相关领域扮演着重要的角色,例如【信息检索(information retrieval,[??nf??me??n r??triv?l])】和【数据挖掘(data mining,[?det? ?ma?n??])】。因此,在这个大数据时代,人们对【大规模数据(massive (large-scale) data sets)】的最近邻搜索越来越感兴趣。在很多实际应用中,所以用的算法没必要对于每一个可能的查询都返回确切的最近邻居。因此,最近几年,可以提高速度和节省空间的【近似最近邻搜索(approximate nearest neighbor (ANN) search)】算法已经受到来自研究者们跟多的关注。

Due to its low storage cost and fast query speed, hashing has been widely adopted for ANN search in large-scale datasets. The essential idea of hashing is to map the data points from the original feature space into binary codes in the hashcode space with similarities between pairs of data points preserved. The advantage of binary codes representation over the original feature vector representation is twofold. Firstly, each dimension of a binary code can be stored using only 1 bit while several bytes are typically required for one dimension of the original feature vector, leading to a dramatic reduction in storage cost. Secondly, by using binary codes representation, all the data points within a specific Hamming distance to a given query can be retrieved in constant or sub-linear time regardless of the total size of the dataset. Hence, hashing has become one of the most effective methods for big data retrieval and mining.

由于哈希的低存储耗费和高查询速度,它被广泛应用于大数据的近似最邻近搜索。哈希的基本思想是将原始特征空间的数据点映射成哈希码空间的二进制码,同时也保存了每一对数据点之间的相似性。二进制码的表示相对于原始特征向量的表示有两点优势。首先,每一个二进制码可以通过1bit来存储,而一个原始特征向量则需要几个byte来存储,导致了存储耗费的大幅减少。其次,通过使用二进制码来表示,对于一个给定的查询,所有的在特定的【汉明距离(Hamming distance)】内的数据点都能够在常量时间或分段线性时间内被检索到,而不管数据集的总的大小。因此,哈希已经成为大数据检索和挖掘最有效的方法之一了。

To get effective hashing codes, most methods adopt machine learning techniques for hashing function learning. Hence, learning to hash, which tries to design effective machine learning methods for hashing, has recently become a very hot research topic with wide applications in many big data areas. This tutorial will provide a systematic introduction of learning to hash, including the motivation, models, learning algorithms, and applications. Firstly, we will introduce the challenges faced by us when performing retrieval and mining with big data, which are used to well motivate the adoption of hashing. Secondly, we will give a comprehensive coverage of the foundations and recent developments on learning to hash, including unsupervised hashing, supervised hashing, multimodal hashing, etc. Thirdly, quantization methods, which are used to turn the real values into binary codes in many hashing methods, will be presented. Fourthly, a large variety of applications with hashing will also be introduced, including image retrieval, cross-modal retrieval, recommender systems, and so on.

为了得到高效的哈希编码,对于哈希函数学习,很多方法采用机器学习技术。因此,学习哈希,即为哈希尽可能设计有效的机器学习方法,最近已经成为一个非常热的研究话题,同时在很多大数据领域也有很多应用。这个教程会提供一个学习哈希的系统的介绍,包括动力、模型、学习算法、应用。首先,我们会介绍当我们检索和挖掘大数据时所面临的挑战,这是采用哈希的很好的动力。其次,我们会给出一个关于学习哈希的基础和最近发展的综合性概述,包括无监管哈希、监管哈希、多模态哈希、等。第三,会介绍【量化方法(quantization methods)】,它在很多哈希方法中用来将真实的值转变为二进制码。第四,大量不同的哈希应用也会被介绍,包括图像检索,跨模态检索,推荐系统等等。

References

[1] Peichao Zhang, Wei Zhang, Wu-Jun Li, Minyi Guo. Supervised Hashing with Latent Factor Models. To Appear in Proceedings of the 37th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2014.

[2] Dongqing Zhang, Wu-Jun Li. Large-Scale Supervised Multimodal Hashing with Semantic Correlation Maximization. To Appear in Proceedings of the Twenty-Eighth AAAI Conference on Artificial Intelligence (AAAI), 2014.

[3] Ling Yan, Wu-Jun Li, Gui-Rong Xue, Dingyi Han. Coupled Group Lasso for Web-Scale CTR Prediction in Display Advertising. Proceedings of the 31st International Conference on Machine Learning (ICML), 2014.

[4] Weihao Kong, Wu-Jun Li. Isotropic Hashing. Proceedings of the 26th Annual Conference on Neural Information Processing Systems (NIPS), 2012.

[5] Weihao Kong, Wu-Jun Li, Minyi Guo. Manhattan Hashing for Large-Scale Image Retrieval. Proceedings of the 35th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2012.

[6] Weihao Kong, Wu-Jun Li. Double-Bit Quantization for Hashing. Proceedings of the Twenty-Sixth AAAI Conference on Artificial Intelligence (AAAI), 2012.

Slides & Outline(幻灯片&大纲)

TBD(To Be Determined 待决定; )

Presenter

  Wu-Jun Li

Dr. Wu-Jun Li is currently an associate professor of the Department of Computer Science and Technology at Nanjing University, P. R. China. From 2010 to 2013, he was a faculty member of the Department of Computer Science and Engineering at Shanghai Jiao Tong University, P. R. China. He received his PhD degree from the Department of Computer Science and Engineering at Hong Kong University of Science and Technology in 2010. Before that, he received his M.Eng. degree and B.Sc. degree from the Department of Computer Science and Technology, Nanjing University in 2006 and 2003, respectively. His main research interests include machine learning and pattern recognition, especially in statistical relational learning and big data machine learning (big learning). In these areas he has published more than 30 peer-reviewed papers, most in prestigious journals such as TKDE and top conferences such as AAAI, AISTATS, CVPR, ICML, IJCAI, NIPS, SIGIR. He has served as the PC member of ICML‘14, IJCAI‘13/‘11, NIPS‘14, SDM‘14, UAI‘14, etc.

李武军博士目前是中国·南京大学计算机科学与技术系的副教授。从2010 to 2013,他是中国·上海交大的计算机科学与工程系的教员。2010年,他在香港大学计算机科学与工程系荣获博士学位。在这之前,他分别在2006、2003年在南京大学大学计算机科学与技术系获得了硕士工学学位和学士理学学位。他的主要研究兴趣包括机器学习和模式识别,特别是在大数据的统计关系学习和机器学习。在这些领域,他发表了30多篇同行评审论文,大多在例如TKDE等著名的报刊和例如AAAI, AISTATS, CVPR, ICML, IJCAI, NIPS, SIGIR等顶级会议。他曾担任ICML‘14, IJCAI‘13/‘11, NIPS‘14, SDM‘14, UAI‘14的程序委员会成员。

时间: 2024-08-05 21:28:16

学习哈希及哈希在大数据检索和挖掘中的应用的相关文章

Jmeter 学习问题三: BeanShell 学习对Jmeter的帮助有多大?

Jmeter 学习问题三: BeanShell 学习对Jmeter的帮助有多大?后续学习过程中需调研 Jmeter 学习问题三: BeanShell 学习对Jmeter的帮助有多大?,布布扣,bubuko.com

【模版】简单哈希和哈希表处理冲突

哈希(Hash)算法就是单向散列算法,它把某个较大的集合P映射到另一个较小的集合Q中.数学原理听起来很抽象,在网上找到一个很生动的描述.我们有很多的小猪,每个的体重都不一样,假设体重分布比较平均(我们考虑到公斤级别),我们按照体重来分,划分成100个小猪圈. 然后把每个小猪,按照体重赶进各自的猪圈里,记录档案.如果我们要精确找到某个小猪怎么办呢?我们需要每个猪圈,每个小猪的比对吗? 当然不需要了. 我们先看看要找的这个小猪的体重,然后就找到了对应的猪圈了. 在这个猪圈里的小猪的数量就相对很少了.

42步进阶学习—让你成为优秀的Java大数据科学家!

作者 灯塔大数据 本文转自公众号灯塔大数据(DTbigdata),转载需授权 如果你对各种数据类的科学课题感兴趣,你就来对地方了.本文将给大家介绍让你成为优秀数据科学家的42个步骤.深入掌握数据准备,机器学习,SQL数据科学等. 本文将这42步骤分为六个部分, 前三个部分主要讲述从数据准备到初步完成机器学习的学习过程,其中包括对理论知识的掌握和Python库的实现. 第四部分主要是从如何理解的角度讲解深入学习的方法.最后两部分则是关于SQL数据科学和NoSQL数据库. 接下来让我们走进这42步进

学习游戏要学习编程语言吗?十大主流编程语言解析

计算机的发展,促使了一个新的职业的出现,程序员是近些年出现的并且得到了广泛关注的一个职业,相信这也是很多莘莘学子的职业梦想.但程序员也有很多种,并不是每一个程序员能够精通所有的编程语言.所谓术业有专攻,如果将来志在编程世界的网友就要注意了,今天给大家推荐一下2014年最流行的编程语言,他们可以说是未来程序员们生存的工具. 1.JavaScript JavaScript在Web应用上有着非常大的需求,JavaScript主要用于实现为Web浏览器,以提供增强的用户界面和动态网站. 直到google

51cto 学习第二天签到,请各位大神见证

51cto 学习第二天签到,请各位大神见证,希望各位51cto的大神鼓励帮助小菜逼,

九. 常用类库、向量与哈希6.哈希表及其应用

哈希表也称为散列表,是用来存储群体对象的集合类结构. 什么是哈希表 数组和向量都可以存储对象,但对象的存储位置是随机的,也就是说对象本身与其存储位置之间没有必然的联系.当要查找一个对象时,只能以某种顺序(如顺序查找或二分查找)与各个元素进行比较,当数组或向量中的元素数量很多时,查找的效率会明显的降低. 一种有效的存储方式,是不与其他元素进行比较,一次存取便能得到所需要的记录.这就需要在对象的存储位置和对象的关键属性(设为 k)之间建立一个特定的对应关系(设为 f),使每个对象与一个唯一的存储位置

NASNet学习笔记——?? 核心一:延续NAS论文的核心机制使得能够自动产生网络结构; ?? 核心二:采用resnet和Inception重复使用block结构思想; ?? 核心三:利用迁移学习将生成的网络迁移到大数据集上提出一个new search space。

from:https://blog.csdn.net/xjz18298268521/article/details/79079008 NASNet总结 论文:<Learning Transferable Architectures for Scalable Image Recognition> 注 ??先啥都不说,看看论文的实验结果,图1和图2是NASNet与其他主流的网络在ImageNet上测试的结果的对比,图3是NASNet迁移到目标检测任务上的检测结果,从这图瞬间感觉论文的厉害之处了,值

【哈希和哈希表】Beads

问题 G: [哈希和哈希表]Beads 时间限制: 1 Sec  内存限制: 128 MB提交: 6  解决: 2[提交] [状态] [讨论版] [命题人:admin] 题目描述 Byteasar once decided to start manufacturing necklaces. He subsequently bought a very long string of colourful coral beads for a bargain price. Byteasar now als

【Hadoop大数据分析与挖掘实战】(一)----------P19~22

这是一本书的名字,叫做[Hadoop大数据分析与挖掘实战],我从2017.1开始学习 软件版本为Centos6.4 64bit,VMware,Hadoop2.6.0,JDK1.7. 但是这本书的出版时间为2016.1,待到我2017.1使用时,一部分内容已经发生了翻天覆地的变化. 于是我开始写这么一个博客,把这些记录下来. 我使用的软件版本为: 软件 版本 操作系统 CentOS 7 64bit-1611 虚拟机 VMware 12.5.2 Hadoop 2.7.3 JDK 1.8.0 本人大二