
  • Speech recognition:

Key Words:

Distributed Speech Recognition(DSR 将嵌入式语言识别系统的识别功能架构在服务器上[并非是指分布式服务器,而是指终端与服务器属于分布式关系[8]])

Network Speech Recognition(NSR 重点在于网络,终端高效实时传输语音信号,服务器处理[9])。当下都是终端语音信号由服务器/云来做处理。

Emotion Speech Recognition(ESR), Spoken Information Retrieval, Speech Recognition, Spoken Term Detection, Speaker Recognition, Voice Control, Language Modeling,Speech Signal Processing / Speech Processing, Speech Enhancement, Outbust Speech Recognition, Feature Compensation, Model Compensation, Automatic Speech Recognition(ASR), Speech Separation, Signal Analysis, Acoustic Speech Recognition Systems, Voice Activity Detection(VAD)


语音识别系统:语音的声学模型(训练学习)、模式匹配(识别算法)| 语言模型 语言处理

声学模型:动态时间归整模型 (DTW)、隐马尔可夫模型(HMM)、人工神经网络模型(ANN)




说话人自适应技术 (Speaker Adaptation ,SA);非特定人 (Speaker Independent ,SI);特定人 (Speaker Dependent ,SD) 『SA+SI』

自适应:批处理式、在线式、立即式 | 监督 无监督

An Overview of Noise-Robust Automatic Speech Recognition[2]:

Historically, ASR applications have included voice dialing, call routing, interactive voice response, data entry and dictation, voice command and control, structured document creation (e.g., medical and legal transcriptions), appliance control by voice, computer-aided language learning, content-based spoken audio search, and robotics.

More recently, with the exponentialgrowth of big data and computing power, ASR technology hasadvanced to the stage where more challenging applications arebecoming a reality. Examples are voice search and interactionswith mobile devices (e.g., Siri on iPhone, Bing voice searchon winPhone, and Google Now on Android), voice control in home entertainment systems (e.g., Kinect on xBox), andvarious speech-centric information processing applicationscapitalizing on downstream processing of ASR outputs.

  • Music Search:

Key Words:

Speech Transcription, Multimedia Information Retrieval, Music Search, Search engine, Mobile Internet, Music Retrieval, Audio Information Retrieval, Audio Mining, Adaptive Music Retrieval ,Music Information Retrieval, Content-based Retrieval, Music Cognition, Music Creation, Music Database Retrieval, Query By Example—QBE, Query By Humming—QBH, Query By Voice (QBV), Audio-visual Speech Recognition, Speech-reading, Multimodal Database, Optical Music Recognition, Instrument Identification, Context-aware Music Retrieval (Content Based Music  Retrieval), Music Recommandation, Commercial music recommenders, Contextual music recommendation and retrieval,

研究方法:Fuzzy system, Neural network, Expert system, Genetic algorithm

多版本音乐识别技术:Feature extraction, key invariance(基调不变性), tempo invariance(节拍/速度不变性), structure invariance(结构不变性), similarity computing(相似度计算)

MIDI(Musie InstrumentDigitalInterface)格式, WAVE(Waveform Audio File Format)格式『一般研究MIDI格式』

Feature Extraction:

Time Domain  『ACF(Autocorrelation function), SMDF(Average magnitude difference function), SIFT(Simple inverse filter tracking)』

Frequency Domain『Harmonic product spectrum, Cepstrum』

Big Data for Musicology[4]:

Automatic Music Transcription (AMT, the process of converting an acoustic musical signal into some form of musical notation)

The most popular approach is parallelisation with Map-Reduce , using the Hadoop framework.

Modeling Concept Dynamics for Large Scale Music Search[5]:

DMCM (Dynamic Musical Concept Mixture)

SMCH  (Stochastic Music Concept Histogram)

The music preprocessing layer extracts multiple acoustic features and maps them into an audio word from a precomputed codebook.

The concept dynamics modeling layer derives from the underlying audio words a Stochastic Music Concept Histogram

(SMCH), essentially a probability distribution over the high-level concepts.


Wang J C, Shih Y C, Wu M S, et al. Colorizing tags in tag cloud: a novel query-by-tag music search system[C]// Proceedings of the 19th ACM international conference on Multimedia. ACM, 2011.【与云计算技术关系并不是很紧密,重点在于聚类、分类,符合自己的审美观,而且很有趣!】


  • Text Processing:

Key Words:

text classification, maximum entropy model








Multimedia Mining:

Image Mining, Video Mining, Audio Mining, Text Mining

  • Audio Mining:

To mine audio data, one could convert it into text using speech transcription techniques.Audio data could also be mined directly by using audio information processing techniques and then mining selected audio data.[10]

要么转换成文本信号再做Text Mining,要么直接对声信号处理再挖掘有用的声音数据

The text based approachalso known as large-vocabulary continuous speech recognition (LVCSR), converts speech to text and then identifies words in a dictionary that can contain several hundred thousand entries. If a word or name is not in the dictionary, the LVCSR system will choose the most similar word it can find. [11]

NLP: Natural Language Processing

词类区分(POS: Part-of-Speech tagging)

专名识别(NE: named entity tagging)《立委随笔:机器学习和自然语言处理》[12]

word accuracy, hit and miss rates, response time,efficiency, precision and system compatibility


Document Retrieval / text retrieval :   form based『suffix tree』 content based 『inverted index[13]

Full text Research / free-text Research:   a search engine examines all of the words in every stored document

" Text retrieval is a critical area of study today, since it is the fundamental basis of all internet search engines."[14]

String Searching : string matching


When dealing with a small number of documents, it is possible for the full-text-search engine to directly scan the contents of the documents with each query, a strategy called "serial scanning." This is what some tools, such as grep, do when searching.

However, when the number of documents to search is potentially large, or the quantity of search queries to perform is substantial, the problem of full-text search is often divided into two tasks: indexing and searching. The indexing stage will scan the text of all the documents and build a list of search terms (often called an index, but more correctly named a concordance). In the search stage, when performing a specific query, only the index is referenced, rather than the text of the original documents.

The indexer will make an entry in the index for each term or word found in a document, and possibly note its relative position within the document. Usually the indexer will ignore stop words (such as "the" and "and") that are both common and insufficiently meaningful to be useful in searching. Some indexers also employ language-specific stemming on the words being indexed. For example, the words "drives", "drove", and "driven" will be recorded in the index under the single concept word "drive."


Music Information Retrieval:MIR uses audio signal analysis to extract meaningful features of music.

Recommender systems : few are based upon MIR techniques, instead making use of similarity between users or laborious data compilation.

Track separation and instrument recognition

Automatic music transcription : converting an audio recording into symbolic notation

"multi-pitch detection, onset detection, duration estimation, instrument identification, and the extraction of rhythmic information"

Automatic categorization

Music generation

Contextual music information retrieval and recommendation: State of the art and challenges[3]:

Useful Unuseful
event-scale information (i.e., transcribing individual notes or chords) instrument detection, QYE, QYH  describe music
phrase-level information (i.e., analyzing note sequences for periodicities) analyzes longer temporal excerpts, tempo detection, playlist sequencing, music summarization 
piece-level information (i.e., analyzing longer excerpts of audio tracks) a more abstract representation of a music track, user’s perception of music,
Used for genre detection, content-based music recommenders

Four levels of retrieval tasks: 『研究主要集中在genre level, work level, instance level』

genre level searching for rock songs is a task at a genre level

artist level looking for artists similar to Bj?rk is clearly a task at an artist level

work level finding cover versions of the song “Let it Be” by The Beatles is a task at a work level

instance level        identifying a particular recording of Mahler’s fifth symphony is a task at an instance level

Content-based music information retrieval: QBE, QBH, Genre Classification,

Music recommendation: Collaborative filtering(CF), Content-based approach『很少用于Music Recommmendation, 可合用俩方式』

Contextual and social music retrieval and recommendation: Environment-related context(season, temperature, time, weather conditions), User-related context(Activity, Demographical, Emotional state), Multimedia context(Text, Images)

Emotion recognition in music: ML

Music and the social web: Tag acquisition『可用于MIR、Music Recommendation』

Other Key Words: Activity recognition,  computational data mining, raw audio, Clustering, Classification, regression, vector machines, KDD(Knowledge Discovery in Database)


