综述类文章(Peng 等)阅读笔记Cross-media analysis and reasoning: advances and directions

综述类文章

Cross-media analysis and reasoning: advances and directions

Yu-xin PENG et al.

Front Inform Technol Electron Eng 浙江大学学报(英文版)2017 18(1):44-57

这篇文章主要讲了七个问题:

(1) theory and model for cross-media uniform representation;

(2) cross-media correlation understanding and deep mining;

(3) cross-media knowledge graph construction and learning methodologies;

(4) cross-media knowledge evolution and reasoning;

(5) cross-media description and generation;

(6) cross-media intelligent engines;

(7) cross-media intelligent applications.

个人觉得第一部分较为重要,大体提到了跨模态发展过程中比较重要的方法模型,当然只是笼统的提及,另一篇Overview的文章提及了具体的方法、数据集、准确率等(准备下周看那篇文章)。下面根据自己阅读的理解就前五部分的要点进行总结(后两部分基本上都是研究方向和意义):

  1. theory and model for cross-media uniform representation

作者认为对于处于易构空间的跨模态信息需要关注两个问题:

  1. how to build the shared space.
  2. how to project data into it.

文中总结了一些模型和方法:

CCA (Rasiwasia et al., 2010). It learns a commonly shared space by maximizing the correlation between pairwise co-occurring heterogeneous data and performs projection by linear functions.

Deep CCA (Andrew et al. 2013) extended CCA using a deep learning

technique to learn the correlations more comprehensively than those using CCA and kernel CCA.

MMD (Yang et al. 2008) the multimedia document (MMD). each MMD is a set of media objects of different modalities but carrying the same semantics. The distances between MMDs are related to each modality.

RBF network (Daras et al. 2012) radial basis function (RBF) network. address the problem of missing modalities.

The topic model:

LDA (Roller and Schulte im Walde 2013) integrated visual features into latent Dirichlet allocation (LDA) and proposed a multimodal LDA model to learn representations for textual and visual data.

M3R (Wang Y et al. 2014) the multimodal mutual topic reinforce model. It seeks to discover mutually consistent semantic topics via appropriate interactions between model factors. These schemes represent data as topic distributions, and similarities are measured by the likelihood of observed data in terms of latent topics.

PFAR (Mao et al. 2013) parallel field alignment retrieval. a manifold-based model, which considers cross-media retrieval as a manifold alignment problem using parallel fields.

Deep learning:

Autoencoder model (Ngiam et al 2011) learn uniform representations for speech audios coupled with videos of the lip movements.

Deep restricted Boltzmann machine (Srivastava and Salakhutdinov 2012)

learn joint representations for multimodal data.

Deep CCA (Andrew et al. 2013) a deep extension of the traditional CCA method.

DT-RNNs (Socher et al. 2014) dependency tree recursive neural networks. employed dependency trees to embed sentences into a vector space in order to retrieve images described by those sentences.

Autoencoders (Feng et al.2014) and (Wang W et al.2014) applied autoencoder to perform cross-modality retrieval.

Multimodal deep learning scheme (Wang et al. 2015) learn accurate and compact multimodal representations for multimodal data. This method facilitates efficient similarity search and other related applications on multimodal data.

ICMAE (Zhang et al. 2014a) an attribute discovery approach, named the independent component multimodal autoencoder (ICMAE), which can learn

shared high-level representation to identify attributes from a set of image and text pairs. Zhang et al. (2016) further proposed to learn image-text uniform representation from web social multimedia content, which is noisy, sparse, and diverse under weak supervision.

Deep-SM (Wei et al. 2017) a deep semantic matching(deep-SM) method that uses the convolutional neural network and fully connected network to map images and texts into their label vectors, achieving state-of-the-art accuracy. CMDN (Peng et al., 2016a) cross-media multiple deep network (CMDN) is a hierarchical structure with multiple deep networks, and can simultaneously preserve intra-media and inter-media information to further improve the retrieval accuracy.

这一部分提到的Deep-SM (Wei et al. 2017),查了一下,来自于文章Cross-Modal Retrieval With CNN Visual Features: A New Baseline, 准备接下来抽时间看看。

  1. cross-media correlation understanding and deep mining;

Basically, existing studies construct correlation learning on cross-media data

with representation learning, metric learning, and matrix factorization, which are usually performed in a batch learning fashion and can capture only the first-order correlations among data objects. How to develop more effective learning mechanisms to capture the high-order correlations and adapt to the

evolution that naturally exists among heterogeneous entities and heterogeneous relations, is the key research issue for future studies in cross-media correlation understanding.

  1. cross-media knowledge graph construction and learning methodologies;

知识图谱应用实例:The Knowledge Graph released by Google in 2012 (Singhal, 2012) provided a next-generation information retrieval service with ontology-based intelligent search based on free-style user queries. Similar techniques, e.g., Safari, were developed based on achievements in entity-centric search (Lin et al.,2012).

  1. cross-media knowledge evolution and reasoning;

Reinforcement learning and transfer learning, can be helpful for constructing more complex intelligent reasoning systems (Lazaric, 2012). Furthermore, lifelong learning (Lazer et al.,2014) is the key capability of advanced intelligence systems.

应用实例:Google DeepMind has constructed a machine intelligence system based on a reinforcement learning algorithm (Gibney, 2015). AlphaGo, developed by Google DeepMind, has been the first computer Go program that can beat a top professional human Go player. It even beat the world champion Lee Sedol in a five-game match.

Visual question answering (VQA) can be regarded as a good example of cross-media reasoning (Antol et al., 2015). VQA aims to provide natural

language answers for questions given in the form of combination of the image and natural language.

  1. cross-media description and generation;

Existing studies on visual content description can be divided into three groups.

1 The first group, based on language generation, first understands images in terms of objects, attributes, scene types, and their correlations, and then connects these semantic understanding outputs to generate a sentence description using natural language generation techniques.

2 The second group covers retrieval-based methods, retrieving content that is similar to a query and transferring the descriptions of the similar set to the

query.

3 The third group is based on deep neural networks,employing the CNN-RNN codec framework, where the convolutional neural network (CNN) is used to

extract features from images, and the recursive neural network (RNN) (Socher et al., 2011) or its variant, the long short-term memory network (LSTM) (Hochreiter and Schmidhuber, 1997), is used to encode and decode language models.

时间: 2024-08-10 21:27:46

综述类文章(Peng 等)阅读笔记Cross-media analysis and reasoning: advances and directions的相关文章

《教育信息化的宏观政策与战略研究》阅读笔记

      作者:焦建利 贾义敏 任改梅       单位:华南师范大学  未来教育研究中心  广东广州            期刊:远程教育杂志       文章摘要:       是一篇关于教育信息化的综述类文章,主要从教育信息化的内涵与外延着手,讲解构成教育信息化的6个要素,再从这6个要素的角度来分析各个国家的政策:最后分析我国近些年来在教育信息化方面的一些发展历史和政策,最后基于以上内容,提出他们这个系列文章的可行性和价值.      阅读感想:      作为一个刚开始学习教育信息化的

深度学习论文阅读笔记--Deep Learning Face Representation from Predicting 10,000 Classes

来自:CVPR 2014   作者:Yi Sun ,Xiaogang Wang,Xiaoao Tang 题目:Deep Learning Face Representation from Predicting 10,000 Classes 主要内容:通过深度学习来进行图像高级特征表示(DeepID),进而进行人脸的分类. 优点:在人脸验证上面做,可以很好的扩展到其他的应用,并且夸数据库有效性:在数据库中的类别越多时,其泛化能力越强,特征比较少,不像其他特征好几K甚至上M,好的泛化能力+不过拟合于

坂本千寻 《Visual C++ 冒险游戏程序设计》 个人阅读笔记 PART_1

因为是关于这本书的第一篇阅读笔记,书的大体内容,这里简略说一下: [书名]<Visual C++ 冒险游戏程序设计> [作者]坂本千寻(日本) [游戏内容]AVG 冒险类游戏(Galgame 是 AVG 的一种) [游戏目标环境]Microsoft Windows XP / 2000 / Me [开发用 IDE]Microsoft Visual C++ 6.0 [程序语言]C++ [图形库]Win32 GDI API [音效库]Win32 MCI API [类库]WinLib [脚本引擎]使用

ImageNet?Classification?with?Deep?Convolutional?Neural?Networks?阅读笔记 转载

ImageNet Classification with Deep Convolutional Neural Networks 阅读笔记 (2013-07-06 22:16:36) 转载▼ 标签: deep_learning imagenet hinton 分类: 机器学习 (决定以后每读一篇论文,都将笔记记录于博客上.) 这篇发表于NIPS2012的文章,是Hinton与其学生为了回应别人对于deep learning的质疑而将deep learning用于ImageNet(图像识别目前最大的

IceFig阅读笔记

嗯:就是这里了 http://research.worksap.com/research/icefig/ 一下阅读笔记: 嗯,时间有限,他们提供的又茫茫多,所以 就找出来了 几个 单独聊聊吧. 其他语言的看不太懂,所以跳过了部分,直接进入 有关java的部分. 分别是 http://research.worksap.com/research/junit4/ http://research.worksap.com/research/java-dc-listopslib/ http://resear

《代码阅读方法与实践》阅读笔记之二

时间过得真快,一转眼,10天就过去了,感觉上次写阅读笔记的场景仿佛还历历在目.<代码阅读方法与实践>这本书真的很难写笔记,本来我看这本书的名字还以为书里大概写的都是些代码阅读的简易方法,心想着这就好写笔记了,没想到竟然好多都是我们之前学过的东西,这倒让我有点无从下手了.大概像我们这些还没有太多经历的大学生,总是习惯于尽量避免自己的工作量,总是试图找到一些完成事情的捷径吧.总之,尽管我不想承认,但我自己心里很清楚,我就是这种人.下面开始言归正传,说说接下来的几章内容归纳. 这本书在前面已经分析了

CI框架源码阅读笔记3 全局函数Common.php

从本篇开始,将深入CI框架的内部,一步步去探索这个框架的实现.结构和设计. Common.php文件定义了一系列的全局函数(一般来说,全局函数具有最高的加载优先权,因此大多数的框架中BootStrap引导文件都会最先引入全局函数,以便于之后的处理工作). 打开Common.php中,第一行代码就非常诡异: if ( ! defined('BASEPATH')) exit('No direct script access allowed'); 上一篇(CI框架源码阅读笔记2 一切的入口 index

01软件构架实践阅读笔记之一

软件构架实践是我们下学期要学习的一本书,所以我想将这本书作为我阅读笔记的一本书. 在这本念书的第一章是总序,在其中提到: 1.所谓"正确的"就是在指功能.性能和成本几个方面都能满足用户要求且无缺陷: 2.所谓"无缺陷"就是在指编码后对软件系统进行彻底的穷举测试修复了所有的缺陷,保证所编写的代码本身不存在缺陷: 但是我们知道编写一个软件,并不可能很好的达到这种的效果,所以应该做到作者提到的"创造.应用.和推广"战略.但是我存在这样的问题: 1.创造

《构建之法》阅读笔记(1)

<构建之法>第一章阅读笔记 大马哈鱼洄游模型 软件工程按照经典的瀑布模型 1. 需求分析 2. 设计阶段 3. 实现阶段 4. 稳定阶段 5. 发布阶段 6. 维护阶段 事实上在现实世界中,软件工程师的职业发展与瀑布流程刚好相反 毕业进入公司(或者实习生),开始学习并维护一些已有的软件(维护阶段),主要由自己的师傅(Mentor)带领 能够在项目中改一些 Bug,然后发现发布小规模的更新版本(稳定/发布阶段),联系重构,开始和其他同事打交道 有机会负责重写一个较小的模块,没有多少文档,自己要写