Link-based Classification相关数据集

Datasets

Document Classification Datasets:

CiteSeer: The CiteSeer dataset consists of 3312 scientific publications classified into one of six classes. The citation network consists of 4732 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 3703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
Cora: The Cora dataset consists of 2708 scientific publications classified into one of seven classes. The citation network consists of 5429 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1433 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.
WebKB: The WebKB dataset consists of 877 scientific publications classified into one of five classes. The citation network consists of 1608 links. Each publication in the dataset is described by a 0/1-valued word vector indicating the absence/presence of the corresponding word from the dictionary. The dictionary consists of 1703 unique words. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.

Social Network Datasets:

Terrorists: This dataset contains information about terrorists and their relationships. Unlike the previous datasets, this dataset was designed for classification experiments aimed at classifying the relationships among terrorists. The dataset contains 851 relationships, each described by a 0/1-valued vector of attributes where each entry indicates the absence/presence of a feature. There are a total of 1224 distinct features. Each relationship can be assigned one or more labels out of a maximum of four labels making this dataset suitable for multi-label classification tasks. The README file provides more details. Click here to download the tarball containing the dataset.
Terrorist Attacks: This dataset consists of 1293 terrorist attacks each assigned one of 6 labels indicating the type of the attack. Each attack is described by a 0/1-valued vector of attributes whose entries indicate the absence/presence of a feature. There are a total of 106 distinct features. The files in the dataset can be used to create two distinct graphs. The README file in the dataset provides more details. Click here to download the tarball containing the dataset.

更多 http://www.cs.umd.edu/~sen/lbc-proj/LBC.html

时间： 2024-10-30 20:56:43

Link-based Classification相关数据集的相关文章

监控视频相关数据集

BOSS dataset Website: Datasets are available here. Dataset: The BOSS project aims at developing an innovative and bandwidth efficient communication system to transmit large data rate communications between public transport vehicles and the wayside. I

机器学习相关数据集

from:http://www.cppblog.com/cdy20/archive/2012/10/10/193134.html KDD杯的中心,所有的数据,任务和结果. UCI机器学习和知识发现研究中使用的大型数据集KDD数据库存储库. UCI机器学习数据库. AWS(亚马逊网络服务)公共数据集,提供了一个集中的资料库,可以无缝集成到基于AWS的云应用程序的公共数据集. 生物测定数据,在虚拟筛选,生物测定数据,对化学信息学,J.由阿曼达Schierz的,有21个生物测定数据集(有效/无效的化

无人驾驶相关数据集

普林斯顿大学人工智能自动驾驶汽车项目: 代码V1:http://deepdriving.cs.princeton.edu/DeepDrivingCode_v1.zip 代码V2: http://deepdriving.cs.princeton.edu/DeepDrivingCode_v2.zip 训练集(50G+):http://deepdriving.cs.princeton.edu/TORCS_trainset.zip 测试集(8G+) :http://deepdriving.cs.prin

常用图像数据集大全

1.搜狗实验室数据集: http://www.sogou.com/labs/dl/p.html 互联网图片库来自sogou图片搜索所索引的部分数据.其中收集了包括人物.动物.建筑.机械.风景.运动等类别,总数高达2,836,535张图片.对于每张图片,数据集中给出了图片的原图.缩略图.所在网页以及所在网页中的相关文本.200多G 2 http://www.imageclef.org/ IMAGECLEF致力于位图片相关领域提供一个基准(检索.分类.标注等等) Cross Language Eva

Realitymining 数据集简单介绍与使用

数据集的官网 http://realitycommons.media.mit.edu/index.html(可能需要FQ) ,下面是数据集的简要介绍(摘自官方网站) The goal of this experiment was to explore the capabilities of the smart phones that enabled social scientists to investigate human interactions beyond the traditional

100+诡异的数据集，20万Eclipse Bug、死囚遗言等

摘要:近日,Robert Seaton整理了100多个最有趣的数据集,其中包括Jeopardy真题,死囚的最后一句话,20万个Eclipse Bug,足球比赛相关,柏拉图式的爱情,太阳系以外的行星,11.3万个恐怖事件等. [编者按]在数据爆发式增长的逼迫下,当下数据分析能力已得到长足的发展,机器学习更成为数据处理中必不可缺少的一环.这里,为大家分享Robert Seaton在其个人博客上整理的100+最有趣的数据集,从柏拉图式的爱情到政治竞选再到死刑囚犯,可谓是应有尽有,旨在给大家的模型训练的

机器学习:手写数字数据集

手写数字数据集(下载地址:http://www.cs.nyu.edu/~roweis/data.html) 手写数字数据集包括1797个0-9的手写数字数据,每个数字由8*8大小的矩阵构成,矩阵中值的范围是0-16,代表颜色的深度. 使用sklearn.datasets.load_digits即可加载相关数据集. 参数:* return_X_y:若为True ,则以(data, target)形式返回数据:默认为False,表示以字典形式返回数据全部信息(包括data和target).* n_c

1.1.3：sklearn库中的标准数据集及基本功能

sklearn的数据集种类: 自带的小数据集(packaged dataset):sklearn.datasets.load_<name> 可在线下载的数据集(Downloaded Dataset):sklearn.datasets.fetch_<name> 计算机生成的数据集(Generated Dataset):sklearn.datasets.make_<name> svmlight/libsvm格式的数据集:sklearn.datasets.load_svmli

迁移学习（Transfer Learning）（转载）

原文地址:http://blog.csdn.net/miscclp/article/details/6339456 在传统的机器学习的框架下,学习的任务就是在给定充分训练数据的基础上来学习一个分类模型:然后利用这个学习到的模型来对测试文档进行分类与预测.然而,我们看到机器学习算法在当前的Web挖掘研究中存在着一个关键的问题:一些新出现的领域中的大量训练数据非常难得到.我们看到Web应用领域的发展非常快速.大量新的领域不断涌现,从传统的新闻,到网页,到图片,再到博客.播客等等.传统的机器学习需要