常用数据库记录

记录一下常用的数据库。

TIMIT
也忘记当时从哪下的了，网上也没看到好一点的链接。
TIMIT全称The DARPA TIMIT Acoustic-Phonetic Continuous Speech Corpus, 是由德州仪器(TI)、麻省理工学院(MIT)和坦福研究院(SRI)合作构建的声学-音素连续语音语料库。TIMIT数据集的语音采样频率为16kHz，一共包含6300个句子，由来自美国八个主要方言地区的630个人每人说出给定的10个句子，所有的句子都在音素级别(phone level)上进行了手动分割，标记。70%的说话人是男性；大多数说话者是成年白人。

THCHS30
THCHS30是Dong Wang, Xuewei Zhang, Zhiyong Zhang这几位大神发布的开放语音数据集，可用于开发中文语音识别系统。
CSTR VCTK Corpus

Google Wavenet用到的数据库。
This CSTR VCTK Corpus includes speech data uttered by 109 native speakers of English with various accents. Each speaker reads out about 400 sentences, most of which were selected from a newspaper plus the Rainbow Passage and an elicitation paragraph intended to identify the speaker‘s accent. The newspaper texts were taken from The Herald (Glasgow), with permission from Herald & Times Group. Each speaker reads a different set of the newspaper sentences, where each set was selected using a greedy algorithm designed to maximise the contextual and phonetic coverage. The Rainbow Passage and elicitation paragraph are the same for all speakers. The Rainbow Passage can be found in the International Dialects of English Archive: (http://web.ku.edu/~idea/readings/rainbow.htm). The elicitation paragraph is identical to the one used for the speech accent archive (http://accent.gmu.edu). The details of the the speech accent archive can be found at http://www.ualberta.ca/~aacl2009/PDFs/WeinbergerKunath2009AACL.pdf

All speech data was recorded using an identical recording setup: an omni-directional head-mounted microphone (DPA 4035), 96kHz sampling frequency at 24 bits and in a hemi-anechoic chamber of the University of Edinburgh. All recordings were converted into 16 bits, were downsampled to 48 kHz based on STPK, and were manually end-pointed. This corpus was recorded for the purpose of building HMM-based text-to-speech synthesis systems, especially for speaker-adaptive HMM-based speech synthesis using average voice models trained on multiple speakers and speaker adaptation technologies.

VoxForge(开源的识别库)

VoxForge创建的初衷是为免费和开源的语音识别引擎收集标注录音（在Linux／Unix，Windows以及Mac平台上）。
我们以GPL协议开放所有提交的录音文件，并制作声学模型，以供开源语音识别引擎使用，如CMUSphinx，ISIP，Julias（github）和HTK（注意：HTK有分发限制）。

OpenSL

OpenSLR是一个有声书数据集。

OpenSLR is a site devoted to hosting speech and language resources, such as training corpora for speech recognition, and software related to speech recognition. We intend to be a convenient place for anyone to put resources that they have created, so that they can be downloaded publicly.

以下摘自：http://www.cnblogs.com/AriesQt/articles/6742721.html

来自论文 Zhang et al., 2015。这是有八个文字分类数据集组成的大型数据库。对于新的文字分类基准，它是最常用的。样本大小为 120K 到 3.6M，包括了从二元到 14 阶的问题。来自 DBPedia, Amazon, Yelp, Yahoo!，搜狗和 AG 的数据集。

地址：https://drive.google.com/drive/u/0/folders/0Bz8a_Dbh9Qhbfll6bVpmNUtUcFdjYmF2SEpmZUZUcVNiMUw1TWN6RDV3a0JHT3kxLVhVR2M

WikiText

源自高品质维基百科文章的大型语言建模语料库。Salesforce MetaMind 维护。

地址：http://metamind.io/research/the-wikitext-long-term-dependency-language-modeling-dataset/

Question Pairs

Quora 发布的第一个数据集，包含副本/语义近似值标记。

地址：https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

SQuAD

斯坦福的问答社区数据集——适用范围较广的问题回答和阅读理解数据集。每一个回答都被作为一个 span，或者一段文本。

地址：https://rajpurkar.github.io/SQuAD-explorer/

CMU Q/A Dataset

人工创建的仿真陈述问题/回答组合，还有维基百科文章的难度评分。

地址：http://www.cs.cmu.edu/~ark/QA-data/

Maluuba Datasets

为 NLP 研究人工创建的复杂数据集。

地址：https://datasets.maluuba.com/

Billion Words

大型、通用型建模数据集。时常用来训练散布音（distributed）的词语表达，比如 word2vec 或 GloVe。

地址：http://www.statmt.org/lm-benchmark/

Common Crawl

PB（拍字节）级别的网络爬虫。最经常被用来学习词语嵌入。可从 Amazon S3 免费获取。对于 WWW 万维网的信息采集，是一个比较有用的网络数据集。

地址：http://commoncrawl.org/the-data/

bAbi

Facebook AI Research (FAIR) 推出的合成阅读理解和问题回答数据集。

地址：https://research.fb.com/projects/babi/

The Children‘s Book Test

Project Gutenberg（一项正版数字图书免费分享工程）儿童图书里提取的成对数据（问题加情境，回答）基准。对问答、阅读理解、仿真陈述（factoid）查询比较有用。

地址：https://research.fb.com/projects/babi/

Stanford Sentiment Treebank

标准的情绪数据集，对每一句话每一个节点的语法树，都有细致的情感注解。

地址：http://nlp.stanford.edu/sentiment/code.html

20 Newsgroups

一个较经典的文本分类数据集。通常作为纯粹分类或者对 IR / indexing 算法验证的基准，在这方面比较有用。

地址：http://qwone.com/~jason/20Newsgroups/

Reuters

较老的、基于纯粹分类的数据集。文本来自于路透社新闻专线。常被用于教程之中。

地址：https://archive.ics.uci.edu/ml/datasets/Reuters-21578+Text+Categorization+Collection

IMDB

较老的、相对比较小的数据集。用于情绪分类。但在文学基准方面逐渐失宠，让位于更大的数据集。

地址：http://ai.stanford.edu/~amaas/data/sentiment/

UCI’s Spambase

较老的、经典垃圾邮件数据集，源自于 UCI Machine Learning Repository。由于数据集的管理细节，在学习私人订制垃圾信息过滤方面，这会是一个有趣的基准。

地址：https://archive.ics.uci.edu/ml/datasets/Spambase

语音

大多数语音识别数据库都是专有的——这些数据对其所有公司而言有巨大价值。绝大部分该领域的公共数据集已经很老了。

2000 HUB5 English

只包含英语的语音数据。最近一次被使用是百度的深度语音论文。

地址：https://catalog.ldc.upenn.edu/LDC2002T43

LibriSpeech

有声图书数据集，包含文字和语音。接近 500 个小时的清楚语音，来自于多名朗读者和多个有声读物，根据图书章节来组织。

地址：http://www.openslr.org/12/

VoxForge

带口音英语的清晰语音数据集。如果你需要有强大的不同口音、语调识别能力，会比较有用。

地址：http://www.voxforge.org/

TIMIT

只含英语的语音识别数据集。

地址：https://catalog.ldc.upenn.edu/LDC93S1

CHIME

含大量噪音的语音识别挑战杯数据集。它包含真实、模拟和清晰的录音：真实，是因为该数据集包含四个说话对象在四个不同吵闹环境下接近 9000 段的录音；模拟，是通过把多个环境与语音结合来生成；清晰，是指没有噪音的清楚录音。

地址：http://spandh.dcs.shef.ac.uk/chime_challenge/data.html

TED-LIUM

TED 演讲的音频转录。包含 1495 场 TED 演讲，以及它们的完整字幕文本。

地址：http://www-lium.univ-lemans.fr/en/content/ted-lium-corpus

时间： 2024-10-15 10:25:57

常用数据库记录

WikiText

Question Pairs

CMU Q/A Dataset

Billion Words

Common Crawl

bAbi

20 Newsgroups

Reuters

语音

LibriSpeech

CHIME

TED-LIUM

常用数据库记录的相关文章

Atitit.并发测试解决方案(2) -----获取随机数据库记录随机抽取数据随机排序原理and实现

androidj常用数据库操作JDBC Utils

ORACLE常用数据库字段类型

转载：30多条mysql数据库优化方法,千万级数据库记录查询轻松解决

MySQL常用数据库小结

AIX LVM 常用命令记录

android布局常用属性记录

常用数据库的JDBC 的URL形式

使用数字签名实现数据库记录防篡改（Java实现）