Bunch 转换为 HDF5 文件:高效存储 Cifar 等数据集

关于如何将数据集封装为 Bunch 可参考 关于 『AI 专属数据库的定制』的改进

PyTablesPython 与 HDF5 数据库/文件标准的结合。它专门为优化 I/O 操作的性能、最大限度地利用可用硬件而设计,并且它还支持压缩功能。

下面的代码均是在 Jupyter NoteBook 下完成的:

import sys
sys.path.append(‘E:/xinlib‘)
from base.filez import DataBunch
import tables as tb
import numpy as np

def bunch2hdf5(root):
    ‘‘‘
    这里我仅仅封装了 Cifar10、Cifar100、MNIST、Fashion MNIST 数据集,
    使用者还可以自己追加数据集。
    ‘‘‘
    db = DataBunch(root)
    filters = tb.Filters(complevel=7, shuffle=False)
    # 这里我采用了压缩表,因而保存为 `.h5c` 但也可以保存为 `.h5`
    with tb.open_file(f‘{root}X.h5c‘, ‘w‘, filters=filters, title=‘Xinet\‘s dataset‘) as h5:
        for name in db.keys():
            h5.create_group(‘/‘, name, title=f‘{db[name].url}‘)
            if name != ‘cifar100‘:
                h5.create_array(h5.root[name], ‘trainX‘, db[name].trainX, title=‘训练数据‘)
                h5.create_array(h5.root[name], ‘trainY‘, db[name].trainY, title=‘训练标签‘)
                h5.create_array(h5.root[name], ‘testX‘, db[name].testX, title=‘测试数据‘)
                h5.create_array(h5.root[name], ‘testY‘, db[name].testY, title=‘测试标签‘)
            else:
                h5.create_array(h5.root[name], ‘trainX‘, db[name].trainX, title=‘训练数据‘)
                h5.create_array(h5.root[name], ‘testX‘, db[name].testX, title=‘测试数据‘)
                h5.create_array(h5.root[name], ‘train_coarse_labels‘, db[name].train_coarse_labels, title=‘超类训练标签‘)
                h5.create_array(h5.root[name], ‘test_coarse_labels‘, db[name].test_coarse_labels, title=‘超类测试标签‘)
                h5.create_array(h5.root[name], ‘train_fine_labels‘, db[name].train_fine_labels, title=‘子类训练标签‘)
                h5.create_array(h5.root[name], ‘test_fine_labels‘, db[name].test_fine_labels, title=‘子类测试标签‘)

        for k in [‘cifar10‘, ‘cifar100‘]:
            for name in db[k].meta.keys():
                name = name.decode()
                if name.endswith(‘names‘):
                    label_names = np.asanyarray([label_name.decode() for label_name in db[k].meta[name.encode()]])
                    h5.create_array(h5.root[k], name, label_names, title=‘标签名称‘)

完成 BunchHDF5 的转换

root = ‘E:/Data/Zip/‘
bunch2hdf5(root)
h5c = tb.open_file(‘E:/Data/Zip/X.h5c‘)
h5c
File(filename=E:/Data/Zip/X.h5c, title="Xinet‘s dataset", mode=‘r‘, root_uep=‘/‘, filters=Filters(complevel=7, complib=‘zlib‘, shuffle=False, bitshuffle=False, fletcher32=False, least_significant_digit=None))
/ (RootGroup) "Xinet‘s dataset"
/cifar10 (Group) ‘https://www.cs.toronto.edu/~kriz/cifar.html‘
/cifar10/label_names (Array(10,)) ‘标签名称‘
  atom := StringAtom(itemsize=10, shape=(), dflt=b‘‘)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar10/testX (Array(10000, 32, 32, 3)) ‘测试数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar10/testY (Array(10000,)) ‘测试标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar10/trainX (Array(50000, 32, 32, 3)) ‘训练数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar10/trainY (Array(50000,)) ‘训练标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100 (Group) ‘https://www.cs.toronto.edu/~kriz/cifar.html‘
/cifar100/coarse_label_names (Array(20,)) ‘标签名称‘
  atom := StringAtom(itemsize=30, shape=(), dflt=b‘‘)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/fine_label_names (Array(100,)) ‘标签名称‘
  atom := StringAtom(itemsize=13, shape=(), dflt=b‘‘)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/testX (Array(10000, 32, 32, 3)) ‘测试数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/test_coarse_labels (Array(10000,)) ‘超类测试标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100/test_fine_labels (Array(10000,)) ‘子类测试标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100/trainX (Array(50000, 32, 32, 3)) ‘训练数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/cifar100/train_coarse_labels (Array(50000,)) ‘超类训练标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/cifar100/train_fine_labels (Array(50000,)) ‘子类训练标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/fashion_mnist (Group) ‘https://github.com/zalandoresearch/fashion-mnist‘
/fashion_mnist/testX (Array(10000, 28, 28, 1)) ‘测试数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/fashion_mnist/testY (Array(10000,)) ‘测试标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/fashion_mnist/trainX (Array(60000, 28, 28, 1)) ‘训练数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/fashion_mnist/trainY (Array(60000,)) ‘训练标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/mnist (Group) ‘http://yann.lecun.com/exdb/mnist‘
/mnist/testX (Array(10000, 28, 28, 1)) ‘测试数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/mnist/testY (Array(10000,)) ‘测试标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None
/mnist/trainX (Array(60000, 28, 28, 1)) ‘训练数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None
/mnist/trainY (Array(60000,)) ‘训练标签‘
  atom := Int32Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘little‘
  chunkshape := None

从上面的结构可看出我将 Cifar10Cifar100MNISTFashion MNIST 进行了封装,并且还附带了它们各种的数据集信息。比如标签名,数字特征(以数组的形式进行封装)等。

%%time
arr = h5c.root.cifar100.trainX.read() # 读取数据十分快速
Wall time: 125 ms
arr.shape
(50000, 32, 32, 3)
h5c.root
/ (RootGroup) "Xinet‘s dataset"
  children := [‘cifar10‘ (Group), ‘cifar100‘ (Group), ‘fashion_mnist‘ (Group), ‘mnist‘ (Group)]

X.h5c 使用说明

下面我们以 Cifar100 为例来展示我们自创的数据集 X.h5c(我将其上传到了百度云盘「链接:https://pan.baidu.com/s/1nzaicwHmFZH9Xgf2foSw6Q 密码:bl2e」可以下载直接使用;亦可你自己生成,不过我推荐自己生成,可以对数据集加深理解)

cifar100 = h5c.root.cifar100
cifar100
/cifar100 (Group) ‘https://www.cs.toronto.edu/~kriz/cifar.html‘
  children := [‘coarse_label_names‘ (Array), ‘fine_label_names‘ (Array), ‘testX‘ (Array), ‘test_coarse_labels‘ (Array), ‘test_fine_labels‘ (Array), ‘trainX‘ (Array), ‘train_coarse_labels‘ (Array), ‘train_fine_labels‘ (Array)]

‘coarse_label_names‘ 指的是粗粒度或超类标签名,‘fine_label_names‘ 则是细粒度标签名。

可以使用 read() 方法直接获取信息,也可以使用索引的方式获取。

coarse_label_names = cifar100.coarse_label_names[:]
# 或者
coarse_label_names = cifar100.coarse_label_names.read()
coarse_label_names.astype(‘str‘)
array([‘aquatic_mammals‘, ‘fish‘, ‘flowers‘, ‘food_containers‘,
       ‘fruit_and_vegetables‘, ‘household_electrical_devices‘,
       ‘household_furniture‘, ‘insects‘, ‘large_carnivores‘,
       ‘large_man-made_outdoor_things‘, ‘large_natural_outdoor_scenes‘,
       ‘large_omnivores_and_herbivores‘, ‘medium_mammals‘,
       ‘non-insect_invertebrates‘, ‘people‘, ‘reptiles‘, ‘small_mammals‘,
       ‘trees‘, ‘vehicles_1‘, ‘vehicles_2‘], dtype=‘<U30‘)
fine_label_names = cifar100.fine_label_names[:].astype(‘str‘)
fine_label_names
array([‘apple‘, ‘aquarium_fish‘, ‘baby‘, ‘bear‘, ‘beaver‘, ‘bed‘, ‘bee‘,
       ‘beetle‘, ‘bicycle‘, ‘bottle‘, ‘bowl‘, ‘boy‘, ‘bridge‘, ‘bus‘,
       ‘butterfly‘, ‘camel‘, ‘can‘, ‘castle‘, ‘caterpillar‘, ‘cattle‘,
       ‘chair‘, ‘chimpanzee‘, ‘clock‘, ‘cloud‘, ‘cockroach‘, ‘couch‘,
       ‘crab‘, ‘crocodile‘, ‘cup‘, ‘dinosaur‘, ‘dolphin‘, ‘elephant‘,
       ‘flatfish‘, ‘forest‘, ‘fox‘, ‘girl‘, ‘hamster‘, ‘house‘,
       ‘kangaroo‘, ‘keyboard‘, ‘lamp‘, ‘lawn_mower‘, ‘leopard‘, ‘lion‘,
       ‘lizard‘, ‘lobster‘, ‘man‘, ‘maple_tree‘, ‘motorcycle‘, ‘mountain‘,
       ‘mouse‘, ‘mushroom‘, ‘oak_tree‘, ‘orange‘, ‘orchid‘, ‘otter‘,
       ‘palm_tree‘, ‘pear‘, ‘pickup_truck‘, ‘pine_tree‘, ‘plain‘, ‘plate‘,
       ‘poppy‘, ‘porcupine‘, ‘possum‘, ‘rabbit‘, ‘raccoon‘, ‘ray‘, ‘road‘,
       ‘rocket‘, ‘rose‘, ‘sea‘, ‘seal‘, ‘shark‘, ‘shrew‘, ‘skunk‘,
       ‘skyscraper‘, ‘snail‘, ‘snake‘, ‘spider‘, ‘squirrel‘, ‘streetcar‘,
       ‘sunflower‘, ‘sweet_pepper‘, ‘table‘, ‘tank‘, ‘telephone‘,
       ‘television‘, ‘tiger‘, ‘tractor‘, ‘train‘, ‘trout‘, ‘tulip‘,
       ‘turtle‘, ‘wardrobe‘, ‘whale‘, ‘willow_tree‘, ‘wolf‘, ‘woman‘,
       ‘worm‘], dtype=‘<U13‘)

‘testX‘‘trainX‘ 分别代表数据的测试数据和训练数据,而其他的节点所代表的含义也是类似的。

例如,我们可以看看训练集的数据和标签:

trainX = cifar100.trainX
train_coarse_labels = cifar100.train_coarse_labels
array([11, 15,  4, ...,  8,  7,  1])

shape(50000, 32, 32, 3),数据的获取,我们一样可以采用索引的形式或者使用 read()

train_data = trainX[:]
print(train_data[0].shape)
print(train_data.dtype)
(32, 32, 3)
uint8

当然,我们也可以直接使用 trainX 做运算。

for x in cifar100.trainX:
    y = x * 2
    break

print(y.shape)
(32, 32, 3)
h5c.get_node(h5c.root.cifar100, ‘trainX‘)
/cifar100/trainX (Array(50000, 32, 32, 3)) ‘训练数据‘
  atom := UInt8Atom(shape=(), dflt=0)
  maindim := 0
  flavor := ‘numpy‘
  byteorder := ‘irrelevant‘
  chunkshape := None

更甚者,我们可以直接定义迭代器来获取数据:

trainX = cifar100.trainX
train_coarse_labels = cifar100.train_coarse_labels
def data_iter(X, Y, batch_size):
    n = X.nrows
    idx = np.arange(n)
    if X.name.startswith(‘train‘):
        np.random.shuffle(idx)
    for i in range(0, n ,batch_size):
        k = idx[i: min(n, i + batch_size)].tolist()
        yield np.take(X, k, 0), np.take(Y, k, 0)
for x, y in data_iter(trainX, train_coarse_labels, 8):
    print(x.shape, y)
    break
(8, 32, 32, 3) [ 7  7  0 15  4  8  8  3]

原文地址:https://www.cnblogs.com/q735613050/p/9244223.html

时间: 2024-07-31 05:28:32

Bunch 转换为 HDF5 文件:高效存储 Cifar 等数据集的相关文章

【转】将网页转换为PDF文件?用浏览器或在线工具轻松搞定

转载自http://mtoou.info/web-pdf/ 现在将就将用浏览器和在线工具把网页转换为PDF文件的两种方法总结给大家: 用浏览器转换 这个方法是我认为最简单.高效的,只要您安装了360浏览器或者火狐及Chrome谷歌浏览器就可以轻松实现.下面笔者以360浏览器为例,我们只要在浏览器的右上角点击“文件”按钮,如下图: 然后选择“打印”,然后选择另存为PDF就可以了,如下图所示: 选择目标另存为PDF 另存为PDF后就可以了,这个是将整个网页转换成PDF文件的, 如果网页很长可能会有2

如何高效存储稀疏矩阵?

为了节省存储空间并且加快并行程序处理速度,需要对稀疏矩阵进行压缩存储,压缩存储的原则是:不重复存储相同元素:不存储零值元素.常用的几种矩阵的存储格式如下:COO,CSR,DIA,ELL,HYB等:一般情况下,稀疏矩阵指的是元素大部分是0的矩阵(有些资料定义非零元素不超过5%的矩阵,为稀疏矩阵);所以如何高效存储以及其格式如何确定,往往会影响并行程序的运行效率. 接下来介绍几种存储格式: (一)Coordinate(COO) 这种存储格式比较简单易懂,每一个元素需要用一个三元组来表示,分别是(行号

js上传文件带参数,并且,返回给前台文件路径,解析上传的xml文件,存储到数据库中

ajaxfileupload.js jQuery.extend({ createUploadIframe: function(id, uri) { //create frame var frameId = 'jUploadFrame' + id; if(window.ActiveXObject) { var io = document.createElement('<iframe id="' + frameId + '" name="' + frameId + '&qu

pdf转换为word文件,你真的会吗?

pdf转换为word文件一直困扰着大家,尤其是办公一族,今天小编就针对这个问题来给大家具体讲解一下,不会的可以跟着学习下.    一.2M内文件转换:(http://app.xunjiepdf.com)   1.首先,在转换前,需要看下文件的大小,当文件大小在2M内,我们可以这样去进行转换. 2.打开"pdf转word在线网站",并单击页面中的“pdf转word”按钮.(这里小编以pdf转word为例,若是其它格式,需选择对应的按钮.) 3.模式选择好了,需要将pdf文件上传至网站中,

HIVE RCFile高效存储结构

本文介绍了Facebook公司数据分析系统中的RCFile存储结构,该结构集行存储和列存储的优点于一身,在 MapReduce环境下的大规模数据分析中扮演重要角色. Facebook曾在2010 ICDE(IEEE International Conference on Data Engineering)会议上介绍了数据仓库Hive.Hive存储海量数据在Hadoop系统中,提供了一套类数据库的数据存储和处理机制.它采用类 SQL语言对数据进行自动化管理和处理,经过语句解析和转换,最终生成基于H

【Python系列】HDF5文件介绍

一个HDF5文件是一种存放两类对象的容器:dataset和group. Dataset是类似于数组的数据集,而group是类似文件夹一样的容器,存放dataset和其他group.在使用h5py的时候需要牢记一句话:groups类比词典,dataset类比Numpy中的数组. HDF5的dataset虽然与Numpy的数组在接口上很相近,但是支持更多对外透明的存储特征,如数据压缩,误差检测,分块传输. 深度学习中也常用HDF5存储模型文件或数据集 import h5py  #导入工具包   im

小文件的存储

对于小文件的存储,指小于16M的文件 import bson bson.binary.Binary() 功能: 将bytes格式字符串装换为mongodb的二进制存储格式 将文件存储到数据库中: #小文件存储方案 #直接装换为二进程格式插入到数据库 from pymongo import MongoClient import bson.binary conn = MongoClient("localhost", 27017) db = conn.image myset = db.pyt

python之模块py_compile用法(将py文件转换为pyc文件)

# -*- coding: cp936 -*- #python 27 #xiaodeng #python之模块py_compile用法(将py文件转换为pyc文件) #二进制文件,是由py文件经过编译后,生成的文件. ''' import py_compile #不带转义r py_compile.compile('D:\test.py') Traceback (most recent call last): File "<pyshell#1>", line 1, in &l

Android 存储文件方式之一---SharedPreferences 内容提供者,以xml 的方式进行数据 存储。是一种轻量级的文件数据存储

? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 //UI界面的布局 文件<br><LinearLayout xmlns:android="http://schemas.android.com/apk/res/android"     xmlns:tools="http://schemas.android.com/tools"