Open Data for Deep Learning

Open Data for Deep Learning

Here you’ll find an organized list of interesting, high-quality datasets for machine learning research. We welcome your contributions for curating this list! You can find other lists of such datasets on Wikipedia, for example.

Recent Additions

Natural-Image Datasets

  • MNIST: handwritten digits: The most commonly used sanity check. Dataset of 25x25, centered, B&W handwritten digits. It is an easy task?—?just because something works on MNIST, doesn’t mean it works.
  • CIFAR10 / CIFAR100: 32x32 color images with 10 / 100 categories. Not commonly used anymore, though once again, can be an interesting sanity check.
  • Caltech 101: Pictures of objects belonging to 101 categories.
  • Caltech 256: Pictures of objects belonging to 256 categories.
  • STL-10 dataset: is an image recognition dataset for developing unsupervised feature learning, deep learning, self-taught learning algorithms. Like CIFAR-10 with some modifications.
  • The Street View House Numbers (SVHN): House numbers from Google Street View. Think of this as recurrent MNIST in the wild.
  • NORB: Binocular images of toy figurines under various illumination and pose.
  • Pascal VOC: Generic image Segmentation / classification?—?not terribly useful for building real-world image annotation, but great for baselines
  • Labelme: A large dataset of annotated images.
  • ImageNet: The de-facto image dataset for new algorithms. Many image API companies have labels from their REST interfaces that are suspiciously close to the 1000 category; WordNet; hierarchy from ImageNet.
  • LSUN: Scene understanding with many ancillary tasks (room layout estimation, saliency prediction, etc.) and an associated competition.
  • MS COCO: Generic image understanding / captioning, with an associated competition.
  • COIL 20: Different objects imaged at every angle in a 360 rotation.
  • COIL100: Different objects imaged at every angle in a 360 rotation.
  • Google’s Open Images: A collection of 9 million URLs to images “that have been annotated with labels spanning over 6,000 categories” under Creative Commons.

Geospatial data

  • OpenStreetMap: Vector data for the entire planet under a free license. It contains (an older version of) the US Census Bureau’s data.
  • Landsat8: Satellite shots of the entire Earth surface, updated every several weeks.
  • NEXRAD: Doppler radar scans of atmospheric conditions in the US.

Artificial Datasets

Facial Datasets

Video Datasets

  • Youtube-8M: A large and diverse labeled video dataset for video understanding research.

Text Datasets

  • 20 newsgroups: Classification task, mapping word occurences to newsgroup ID. One of the classic datasets for text classification) usually useful as a benchmark for either pure classification or as a validation of any IR / indexing algorithm.
  • Reuters News dataset: (Older) purely classification-based dataset with text from the newswire. Commonly used in tutorial.
  • Penn Treebank: Used for next word prediction or next character prediction.
  • UCI’s Spambase: (Older) classic spam email dataset from the famous UCI Machine Learning Repository. Due to details of how the dataset was curated, this can be an interesting baseline for learning personalized spam filtering.
  • Broadcast News: Large text dataset, classically used for next word prediction.
  • Text Classification Datasets: From; Zhang et al., 2015; An extensive set of eight datasets for text classification. These are the benchmark for new text classification baselines. Sample size of 120K to 3.6M, ranging from binary to 14 class problems. Datasets from DBPedia, Amazon, Yelp, Yahoo! and AG.
  • WikiText: A large language modeling corpus from quality Wikipedia articles, curated by Salesforce MetaMind.
  • SQuAD: The Stanford Question Answering Dataset?—?broadly useful question answering and reading comprehension dataset, where every answer to a question is posed as a segment of text.
  • Billion Words dataset: A large general-purpose language modeling dataset. Often used to train distributed word representations such as word2vec.
  • Common Crawl: Petabyte-scale crawl of the web?—?most frequently used for learning word embeddings. Available for free from Amazon S3. Can also be useful as a network dataset for it’s a crawl of the WWW.
  • Google Books Ngrams: Successive words from Google books. Offers a simple method to explore when a word first entered wide usage.

Question answering

  • Maluuba News QA Dataset: 120K Q&A pairs on CNN news articles.
  • Quora Question Pairs: first dataset release from Quora containing duplicate / semantic similarity labels.
  • CMU Q/A Dataset: Manually-generated factoid question/answer pairs with difficulty ratings from Wikipedia articles.
  • Maluuba goal-oriented dialogue: Procedural conversational dataset where the dialogue aims at accomplishing a task or taking a decision. Often used to work on chat bots.
  • bAbi: Synthetic reading comprehension and question answering datasets from Facebook AI Research (FAIR).
  • The Children’s Book Test: Baseline of (Question + context, Answer) pairs extracted from Children’s books available through Project Gutenberg. Useful for question-answering (reading comprehension) and factoid look-up.

Sentiment

  • Multidomain sentiment analysis dataset An older, academic dataset.
  • IMDB: An older, relatively small dataset for binary sentiment classification. Fallen out of favor for benchmarks in the literature in lieu of larger datasets.
  • Stanford Sentiment Treebank: Standard sentiment dataset with fine-grained sentiment annotations at every node of each sentence’s parse tree.

Recommendation and ranking systems

  • Movielens: Movie ratings dataset from the Movielens website, in various sizes ranging from demo to mid-size.
  • Million Song Dataset: Large, metadata-rich, open source dataset on Kaggle that can be good for people experimenting with hybrid recommendation systems.
  • Last.fm: Music recommendation dataset with access to underlying social network and other metadata that can be useful for hybrid systems.
  • Book-Crossing dataset:: From the Book-Crossing community. Contains 278,858 users providing 1,149,780 ratings about 271,379 books.
  • Jester: 4.1 million continuous ratings (-10.00 to +10.00) of 100 jokes from 73,421 users.
  • Netflix Prize:: Netflix released an anonymized version of their movie rating dataset; it consists of 100 million ratings, done by 480,000 users who have rated between 1 and all of the 17,770 movies. First major Kaggle style data challenge. Only available unofficially, as privacy issues arose.

Networks and Graphs

  • Amazon Co-Purchasing: Amazon Reviews crawled data from “the users who bought this also bought…” section of Amazon, as well as Amazon review data for related products. Good for experimenting with recommendation systems in networks.
  • Friendster Social Network Dataset: Before their pivot as a gaming website, Friendster released anonymized data in the form of friends lists for 103,750,348 users.

Speech Datasets

  • 2000 HUB5 English: English-only speech data used most recently in the Deep Speech paper from Baidu.
  • LibriSpeech: Audio books data set of text and speech. Nearly 500 hours of clean speech of various audio books read by multiple speakers, organized by chapters of the book containing both the text and the speech.
  • VoxForge: Clean speech dataset of accented english. Useful for instances in which you expect to need robustness to different accents or intonations.
  • TIMIT: English-only speech recognition dataset.
  • CHIME: Noisy speech recognition challenge dataset. Dataset contains real simulated and clean voice recordings. Real being actual recordings of 4 speakers in nearly 9000 recordings over 4 noisy locations, simulated is generated by combining multiple environments over speech utterances and clean being non-noisy recordings.
  • TED-LIUM: Audio transcription of TED talks. 1495 TED talks audio recordings along with full text transcriptions of those recordings.

Symbolic Music Datasets

Miscellaneous Datasets

Health & Biology Data

Government & statistics data

Thanks to deeplearning.net and Luke de Oliveira for many of these links and dataset descriptions. Any suggestions of open data sets we should include for the Deeplearning4j community are welcome!

https://deeplearning4j.org/opendata?

时间: 2024-10-05 20:28:08

Open Data for Deep Learning的相关文章

Does Deep Learning Come from the Devil?

Does Deep Learning Come from the Devil? Deep learning has revolutionized computer vision and natural language processing. Yet the mathematics explaining its success remains elusive. At the Yandex conference on machine learning prospects and applicati

【转载】Distributed Deep Learning on MPP and Hadoop

Distributed Deep Learning on MPP and Hadoop DECEMBER 17, 2014 | FEATURES | BY REGUNATHAN RADHAKRISHNAN Joint work performed by Regunathan Radhakrishnan, Gautam Muralidhar, Ailey Crow, and Sarah Aerni of Pivotal’s Data Science Labs. Deep learning grea

(转) Deep Learning Resources

转自:http://www.jeremydjacksonphd.com/category/deep-learning/ Deep Learning Resources Posted on May 13, 2015 Videos Deep Learning and Neural Networks with Kevin Duh: course page NY Course by Yann LeCun: 2014 version, 2015 version NIPS 2015 Deep Learnin

好文章之AI系列(—)——Deep learning with synthetic data

Deep learning with synthetic data will make AI accessible to the masses https://towardsdatascience.com/deep-learning-with-synthetic-data-will-make-ai-accessible-to-the-masses-15b99343dd0e 原文地址:https://www.cnblogs.com/carsonzhu/p/10268770.html

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.1

3.Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.1 http://blog.csdn.net/sunbow0 Spark MLlib Deep Learning工具箱,是根据现有深度学习教程<UFLDL教程>中的算法,在SparkMLlib中的实现.具体Spark MLlib Deep Learning(深度学习)目录结构: 第一章Neural Net(NN) 1.源码 2.源码解析 3.实例 第二章D

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.2

3.Spark MLlib Deep Learning Convolution Neural Network(深度学习-卷积神经网络)3.2 http://blog.csdn.net/sunbow0 第三章Convolution Neural Network (卷积神经网络) 2基础及源码解析 2.1 Convolution Neural Network卷积神经网络基础知识 1)基础知识: 自行google,百度,基础方面的非常多,随便看看就可以,只是很多没有把细节说得清楚和明白: 能把细节说清

Spark MLlib Deep Learning Convolution Neural Network (深度学习-卷积神经网络)3.3

3.Spark MLlib Deep Learning Convolution Neural Network(深度学习-卷积神经网络)3.3 http://blog.csdn.net/sunbow0 第三章Convolution Neural Network (卷积神经网络) 3实例 3.1 测试数据 按照上例数据,或者新建图片识别数据. 3.2 CNN实例 //2 测试数据 Logger.getRootLogger.setLevel(Level.WARN) valdata_path="/use

TensorFlow和深度学习新手教程(TensorFlow and deep learning without a PhD)

前言 上月导师在组会上交我们用tensorflow写深度学习和卷积神经网络.并把其PPT的參考学习资料给了我们, 这是codelabs上的教程:<TensorFlow and deep learning,without a PhD> 当然登入须要FQ,我也顺带巩固下,做个翻译.不好之处请包括指正. 当然须要安装python,教程推荐使用python3.假设是Mac,能够參考博主的另外两片博文,Mac下升级python2.7到python3.6, Mac安装tensorflow1.0 好多专业词

(转) Deep learning architecture diagrams

FastML Machine learning made easy RSS Home Contents Popular Links Backgrounds About Deep learning architecture diagrams 2016-09-30 Like a wild stream after a wet season in African savanna diverges into many smaller streams forming lakes and puddles,