面向互联网的文本信息处理,语音和音乐搜索技术的发展现状【搜集资料时学习所得,未详加整理】


  • Speech recognition:

Key Words:

Distributed Speech Recognition(DSR 将嵌入式语言识别系统的识别功能架构在服务器上[并非是指分布式服务器,而是指终端与服务器属于分布式关系[8]])

Network Speech Recognition(NSR 重点在于网络,终端高效实时传输语音信号,服务器处理[9])。当下都是终端语音信号由服务器/云来做处理。

Emotion Speech Recognition(ESR), Spoken Information Retrieval, Speech Recognition, Spoken Term Detection, Speaker Recognition, Voice Control, Language Modeling,Speech Signal Processing / Speech Processing, Speech Enhancement, Outbust Speech Recognition, Feature Compensation, Model Compensation, Automatic Speech Recognition(ASR), Speech Separation, Signal Analysis, Acoustic Speech Recognition Systems, Voice Activity Detection(VAD)

语音识别技术综述[1]:

语音识别系统:语音的声学模型(训练学习)、模式匹配(识别算法)| 语言模型 语言处理

声学模型:动态时间归整模型 (DTW)、隐马尔可夫模型(HMM)、人工神经网络模型(ANN)

语言模型:规则模型、统计模型

目前研究的难点主要表现在:(1)语音识别系统的适应性差。主要体现在对环境依赖性强。(2)高噪声环境下语音识别进展困难,因为此时人的发音变化很大,像声音变高,语速变慢,音调及共振峰变化等等,必须寻找新的信号分析处理方法。(3)如何把语言学、生理学、心理学方面知识量化、建模并有效用于语音识别,目前也是一个难点。(4)由于我们对人类的听觉理解、知识积累和学习机制以及大脑神经系统的控制机理等方面的认识还很不清楚,这必将阻碍语音识别的进一步发展。

目前语音识别领域的研究热点包括:稳健语音识别(识别的鲁棒性)、语音输入设备研究、声学HMM模型的细化、说话人自适应技术、大词汇量关键词识别、高效的识别(搜索)算法研究、可信度评测算法研究、ANN的应用、语言模型及深层次的自然语言理解。

说话人自适应技术 (Speaker Adaptation ,SA);非特定人 (Speaker Independent ,SI);特定人 (Speaker Dependent ,SD) 『SA+SI』

自适应:批处理式、在线式、立即式 | 监督 无监督

An Overview of Noise-Robust Automatic Speech Recognition[2]:

Historically, ASR applications have included voice dialing, call routing, interactive voice response, data entry and dictation, voice command and control, structured document creation (e.g., medical and legal transcriptions), appliance control by voice, computer-aided language learning, content-based spoken audio search, and robotics.

More recently, with the exponentialgrowth of big data and computing power, ASR technology hasadvanced to the stage where more challenging applications arebecoming a reality. Examples are voice search and interactionswith mobile devices (e.g., Siri on iPhone, Bing voice searchon winPhone, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on xBox), andvarious speech-centric information processing applicationscapitalizing on downstream processing of ASR outputs.


  • Music Search:

Key Words:

Speech Transcription, Multimedia Information Retrieval, Music Search, Search engine, Mobile Internet, Music Retrieval, Audio Information Retrieval, Audio Mining, Adaptive Music Retrieval ,Music Information Retrieval, Content-based Retrieval, Music Cognition, Music Creation, Music Database Retrieval, Query By Example—QBE, Query By Humming—QBH, Query By Voice (QBV), Audio-visual Speech Recognition, Speech-reading, Multimodal Database, Optical Music Recognition, Instrument Identification, Context-aware Music Retrieval (Content Based Music  Retrieval), Music Recommandation, Commercial music recommenders, Contextual music recommendation and retrieval,

研究方法:Fuzzy system, Neural network, Expert system, Genetic algorithm

多版本音乐识别技术:Feature extraction, key invariance(基调不变性), tempo invariance(节拍/速度不变性), structure invariance(结构不变性), similarity computing(相似度计算)

MIDI(Musie InstrumentDigitalInterface)格式, WAVE(Waveform Audio File Format)格式『一般研究MIDI格式』

Feature Extraction:

Time Domain  『ACF(Autocorrelation function), SMDF(Average magnitude difference function), SIFT(Simple inverse filter tracking)』

Frequency Domain『Harmonic product spectrum, Cepstrum』

Big Data for Musicology[4]:

Automatic Music Transcription (AMT, the process of converting an acoustic musical signal into some form of musical notation)

The most popular approach is parallelisation with Map-Reduce , using the Hadoop framework.

Modeling Concept Dynamics for Large Scale Music Search[5]:

DMCM (Dynamic Musical Concept Mixture)

SMCH  (Stochastic Music Concept Histogram)

The music preprocessing layer extracts multiple acoustic features and maps them into an audio word from a precomputed codebook.

The concept dynamics modeling layer derives from the underlying audio words a Stochastic Music Concept Histogram

(SMCH), essentially a probability distribution over the high-level concepts.

其他的技术:

Wang J C, Shih Y C, Wu M S, et al. Colorizing tags in tag cloud: a novel query-by-tag music search system[C]// Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011.【与云计算技术关系并不是很紧密,重点在于聚类、分类,符合自己的审美观,而且很有趣!】

声学模型训练


  • Text Processing:

Key Words:

text classification, maximum entropy model

最大熵:它反映了人类认识世界的一个朴素原则,即在对某个事件一无所知的情况下,选择一个模型使它的分布应该尽可能均匀[16]

基于云平台文本处理案例:

【图数据处理】

GraphLab:CMU提出了GraphLab开源分布式计算系统是一种新的面向机器学习的并行框架

Pregel:Google提出的适合复杂机器学习的分布式图数据计算框架

Horton:由Microsoft开发用于图数据匹配

相对的,现在并没有专门针对语音识别、语音挖掘处理所设计的分布式并行计算框架,当下,一般是用Hadoop、Sector/Sphere等已有的通用型开源框架来处理语音识别,当然,实现的效果并没有像GraphLab图数据处理效果那么明显,也就是说,针对语音识别、语音挖掘的框架设计是有必要的!

Multimedia Mining:

Image Mining, Video Mining, Audio Mining, Text Mining


  • Audio Mining:

To mine audio data, one could convert it into text using speech transcription techniques.Audio data could also be mined directly by using audio information processing techniques and then mining selected audio data.[10]

要么转换成文本信号再做Text Mining,要么直接对声信号处理再挖掘有用的声音数据

The text based approachalso known as large-vocabulary continuous speech recognition (LVCSR), converts speech to text and then identifies words in a dictionary that can contain several hundred thousand entries. If a word or name is not in the dictionary, the LVCSR system will choose the most similar word it can find. [11]


NLP: Natural Language Processing

词类区分(POS: Part-of-Speech tagging)

专名识别(NE: named entity tagging)《立委随笔:机器学习和自然语言处理》[12]

word accuracy, hit and miss rates, response time,efficiency, precision and system compatibility

WIKI:

Document Retrieval / text retrieval :   form based『suffix tree』 content based 『inverted index[13]

Full text Research / free-text Research:   a search engine examines all of the words in every stored document

" Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines."[14]

String Searching : string matching

Indexing[14]

When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning." This is what some tools, such as grep, do when searching.

However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.

The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive."



MIR[15]

Music Information Retrieval:MIR uses audio signal analysis to extract meaningful features of music.

Recommender systems : few are based upon MIR techniques, instead making use of similarity between users or laborious data compilation.

Track separation and instrument recognition

Automatic music transcription : converting an audio recording into symbolic notation

"multi-pitch detection, onset detection, duration estimation, instrument identification, and the extraction of rhythmic information"

Automatic categorization

Music generation


Contextual music information retrieval and recommendation: State of the art and challenges[3]:

Useful Unuseful
event-scale information (i.e., transcribing individual notes or chords) instrument detection, QYE, QYH  describe music
phrase-level information (i.e., analyzing note sequences for periodicities) analyzes longer temporal excerpts, tempo detection, playlist sequencing, music summarization 
piece-level information (i.e., analyzing longer excerpts of audio tracks) a more abstract representation of a music track, user’s perception of music,
Used for genre detection, content-based music recommenders

Four levels of retrieval tasks: 『研究主要集中在genre level, work level, instance level』

genre level searching for rock songs is a task at a genre level

artist level looking for artists similar to Bj?rk is clearly a task at an artist level

work level finding cover versions of the song “Let it Be” by The Beatles is a task at a work level

instance level        identifying a particular recording of Mahler’s fifth symphony is a task at an instance level

Content-based music information retrieval: QBE, QBH, Genre Classification,

Music recommendation: Collaborative filtering(CF), Content-based approach『很少用于Music Recommmendation, 可合用俩方式』

Contextual and social music retrieval and recommendation: Environment-related context(season, temperature, time, weather conditions), User-related context(Activity, Demographical, Emotional state), Multimedia context(Text, Images)

Emotion recognition in music: ML

Music and the social web: Tag acquisition『可用于MIR、Music Recommendation』



Other Key Words: Activity recognition,  computational data mining, raw audio, Clustering, Classification, regression, vector machines, KDD(Knowledge Discovery in Database)

References:

[1] 邢铭生, 朱浩, 王宏斌. 语音识别技术综述[J]. 科协论坛, 2010, (3):62-63. DOI:10.3969/j.issn.1007-3973.2010.03.033.

[2] Li J, Deng L, Gong Y, et al. An Overview of Noise-Robust Automatic Speech Recognition[J]. Audio Speech & Language Processing IEEE/ACM Transactions on, 2014, 22(4):745-777.

[3] Kaminskas M, Ricci F. Contextual music information retrieval and recommendation: State of the art and challenges[J]. Computer Science Review, 2012, 6:89–119.

[4] Weyde, Tillman, et al. "Big Data for Musicology." Proceedings of the 1st International Workshop on Digital Libraries for Musicology. ACM, 2014.

[5] Shen J, Pang H H, Wang M, et al. Modeling concept dynamics for large scale music search[J]. Research Collection School of Information Systems, 2012:455-464.

[6] Low Y, Gonzalez J, Kyrola A, et al. GraphLab: A Distributed Framework for Machine Learning in the Cloud[J]. Eprint Arxiv, 2011.

[7] Bhatt C A, Kankanhalli M S. 2011 MTAP Multimedia data mining state of the art and challenges[J]. Multimedia Tools & Applications, 2014, 51(1):35-76.

[8] 姜干新. 基于HMM的分布式语音识别系统的研究与应用[D]. 浙江大学计算机科学与技术学院, 2010.

[9] Shahzad Hussain. Web Based Network Speech Recognition[D]. Tampereen teknillinen yliopisto - Tampere University of Technology, 2013.

[10] Kamde P M, Algur S P. A Survey on Web Multimedia Mining[J]. International Journal of Multimedia & Its Applications, 2011, 3(3).

[11] Bhatt C A, Kankanhalli M S. Multimedia data mining: state of the art and challenges[J]. Multimedia Tools & Applications, 2011, 51(1):35-76.

[12] 李维. 立委随笔:机器学习和自然语言处理. http://blog.sciencenet.cn/blog-362400-294037.html , 2010-2-13.

[13] Wikipedia. https://en.wikipedia.org/wiki/Document_retrieval , 22 June 2015.

[14] Wikipedia. https://en.wikipedia.org/wiki/Full_text_search , 13 June 2015.

[15] Wikipedia. https://en.wikipedia.org/wiki/Music_information_retrieval , 14 June 2015.

[16] 李荣陆, 王建会, 陈晓云,等. 使用最大熵模型进行中文文本分类[J]. 计算机研究与发展, 2005, 42(1):94-101. DOI:10.1007/978-3-540-24655-8_63.

来自为知笔记(Wiz)

时间: 2024-10-23 22:09:26

面向互联网的文本信息处理,语音和音乐搜索技术的发展现状【搜集资料时学习所得,未详加整理】的相关文章

数字化精准测试工具ThreadingTestCloud面向互联网征集产品设计人员

数字化精准测试工具ThreadingTestCloud面向互联网征集产品设计人员 各位互联网上的测试伙伴,目前我们已经有大量的用户在使用TTC进行数字化的精准测试,从TT升级到TTC历经了整整一年的时间,我们团队这一年一直在全力以赴的研发,我们坚信未来具有规范性.标准性和专业性的数字化精准测试一定会成为专业测试的趋势和潮流. 目前TTC的2.x版本已经发布,该版本是高性能的稳定版本,并且实现了所有规划的核心底层技术,但TTC到目前为止产品设计还是有TTC团队独立完成的,现在已经是大数据时代,大家

C#文本转语音并保存wav和MP3文件

回顾上次写博客至今都有4个多月了,最近工作比较的忙没时间写博文.以后会多坚持写博文,与大家分享下最近遇到的问题.最近因为项目需要,研究了下用C#开发TTS.下面把大体的思路给大家说说,希望对大家有所帮助. 首先需要了解下MS的SAPI,它是微软的语音API.它包括了语音识别SR引擎和语音合成SS引擎两种语音引擎.等下会给大家看下语音合成SS引擎.它由不同的版本,操作系统的不同使用的版本不同,不过我喜欢使用其他的合成语音包,比如:NeoSpeech公司的合成语音包.回过头来,MS 的SAPI的版本

用TTS实现文本转语音

最近被toefl单词虐成狗::>_<:: 想做一个可以自动把单词转成语音的软件,这样就可以在路上戴耳机边走边听啦~ 用微软的TTS语音库可以很容易地实现.早期的TTS要想实现中英文混合朗读还很麻烦,然而Win10里自带的新版SDK已经解决了这个问题.可以自动识别出中文和英文.这点超赞~ 注意:如果用的是英文版系统,需要先安装中文Speech Library 然后在控制面板--语音识别选项里可以看到这个新的语音库: 进入VS,新建一个C#工程,然后在Solution Explorer----Re

互联网广告的发展现状与趋势分析

一. 什么是互联网广告? 来自维基百科的释义是狭义的网络广告又被称为在线广告或者互联网广告:而广义的网络广告除了包括以计算机为核心组成的计算机网络为媒介的广告行为外,还包括其他所有以电子设备相互连接而组成的网络为媒介的广告行为,例如以无线电话网络,电子信息亭网络为载体的广告行为. 搜索引擎广告大混战 ——为什么谷歌是领导者? 在百度.谷粉搜搜(香港谷歌).香港雅虎.必应四大主流搜索引擎内输入“互联网广告”,百度.必应展现搜索结果的方式均是页面头部.尾部.右边栏都是广告,以区别的底色标识,正文中也

李彦宏:5年后语音和图片搜索会超文字搜索

李彦宏指出,语音是比文字更早.更自然的表达方式,在PC互联网时代语音方式不能表达需求,但现在百度10%进入搜索请求是以语音方式作为表达. 9月3日消息,百度CEO李彦宏今日在百度世界大会上表示,移动互联网不仅给互联网公司带来冲击,也给传统企业带来很大影响和冲击,未来用户搜索的方式会发生很大改变,未来5年使用语音和图片搜索来表达需求的比例会超过文字搜索. 李彦宏指出,语音是比文字更早.更自然的表达方式,在PC互联网时代语音方式不能表达需求,但现在百度10%进入搜索请求是以语音方式作为表达.对越来越

Asp.net 两个链接实现虾米音乐搜索

起因 暑假刚结束,又要回到学校写代码了,本人写代码的时候特别喜欢听歌,一直使用的是虾米音乐,出于好奇,想给自己的网站集成虾米音乐搜索功能,但是一直找不到虾米开放api,所以只能自己找办法了,之后发现一位大神写的一篇文章,里面介绍了如何用js实现虾米音乐搜索,不过我并没有太多的接触过js语言,所以我打算用c#语言来实现虾米音乐搜索,拾取文中两个重要的json接口,开始我的音乐搜索之旅. 准备 要实现功能,首先要准备好思路,首先我要掌握json最基础的用法,如果提取网页中的json,json本人接触

Goole音乐搜索

本博文的主要内容有   .Goole音乐搜索的介绍  1.Goole音乐搜索的介绍 https://zh.wikipedia.org/wiki/%E8%B0%B7%E6%AD%8C%E9%9F%B3%E4%B9%90 www.google.cn/music/ https://play.google.com/music/listen 2012年9月21日,谷歌中国宣布关闭谷歌音乐搜索 http://www.google.cn/music/ 暂时,访问不了. 后续,持续推送更新!

Android开发本地及网络Mp3音乐播放器(十三)网络音乐搜索功能实现,歌名歌手专辑名搜索

实现功能: 实现网络音乐搜索功能 使用观察者设计模式 使用URLEncoder.encode转码 SearchMusicUtils是重点 截止到目前的源码下载: http://download.csdn.net/detail/iwanghang/9507635 欢迎移动开发爱好者交流:我的微信是iwanghang 另外,我打算开始找工作,如果沈阳或周边城市公司有意,也请与我联系. 实现效果如图: 实现代码如下: NetMusicListFragment如下: package com.iwangh

90 行 Python 搭一个音乐搜索工具

之前一段时间读到了这篇博客,其中描述了作者如何用java实现国外著名音乐搜索工具shazam的基本功能.其中所提到的文章又将我引向了关于shazam的一篇论文及另外一篇博客.读完之后发现其中的原理并不十分复杂,但是方法对噪音的健壮性却非常好,出于好奇决定自己用python自己实现了一个简单的音乐搜索工具—— Song Finder, 它的核心功能被封装在SFEngine 中,第三方依赖方面只使用到了 scipy. 工具demo 这个demo在ipython下展示工具的使用,本项目名称为Song