New packages for reading data into R — fast

小伙伴儿们有福啦,2015年4月10日,Hadley Wickham大牛(开发了著名的ggplots包和plyr包等)和RStudio小组又出新作啦,新作品readr包readxl包分别用于R读取text数据和Excel电子表格数据。事实上,R已经有了一堆读取数据的函数,比如read.table家族以及其巨多的变形,那么为了牛牛们为什么还要开发这两个包呢?原因很简单,这两个包的读取速度比R内置数据读入函数更快!!!记住哦,是快很多哈!不信,我们下来试试就知道啦!哈哈!平时读取小数据的童鞋可能不会有感觉,但读入的数据量比较大时,速度快就是一个很突出的优势啊,有木有?!废话不多说,上菜!

1)readr包示例

readr包提供了几个用R读取表格/文本数据的函数,并增添了额外的功能,而且更快!这在之间通常是用read.table家族函数来完成这些使命,现在可以轻松很多了啊!

首先,来看看readr包中第一个牛逼轰轰的函数read_table,它替换了之前read.table的功能,关键是更快,请记住,快、速度是这个包诞生的重要原因,可能是受大数据时代这股趋势的推动!我们来做一个实验!让这两个函数同时读取一个包含了4百万航数据的文件(数据地址:http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt ),看看有什么有趣的发现!

Step1

看看数据格式,可以看到有四列,分别代表日,月,年和一个数值

Step2

打开R,运行以下命令,看看两个命令的运行时间!

> system.time(read_table(file = ‘http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt‘,col_names = c(‘DAY‘,‘MONTH‘,‘YEAR‘,‘TEMP‘)))

用户 系统 流逝

3.30 11.06 14.43

> system.time(read.table(file = ‘http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt‘,col.names = c(‘DAY‘,‘MONTH‘,‘YEAR‘,‘TEMP‘)))

用户 系统 流逝

1.92 1.62 96.10

这两个命令看起来类似,但是read.table函数大约花费了96.1秒完成,而read_table再不到15秒就完成啦(这可能是我这台破电脑的原因,官方的说法是:前者花了30秒左右,而后者不到一秒就搞定啦!!擦….这性能…无法比啊!)。也许有人会问,为什么会这样呢?原因在于:read_table函数把数据当成一个固定格式的稳健,底层使用C++快速的处理数据(与之对比的是,read.table支持列间任意数量的空格,而read_table要求每一列都排的很整齐,即一列中不能有"出头鸟")。但是,话是这么说,实际运用时,并没有这样严格的限制!

R基本包中有一个读取固定宽度数据集的函数,请看下面,再次见证readr包的神奇,对!!!就是这么神奇!!!

> system.time(dat <- read_fwf(‘http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt‘,

+ fwf_widths(c(3,15,16,12),

+ col_names=c("DAY","MONTH","YEAR","TEMP"))))

用户 系统 流逝

0.67 1.70 2.40

> system.time(dat2 <- read.fwf(‘http://academic.udayton.edu/kissock/http/Weather/gsod95-current/NYNEWYOR.txt‘, c(3,15,16,12),

+ col.names=c("DAY","MONTH","YEAR","TEMP")))

用户 系统 流逝

0.73 0.49 89.03

看吧,这一对比,知道readr包的腻害了吧!

当然,上面只是readr包中一个简单的例子!readr中包括的其他函数还有:

readr::read_csv        Read a delimited file into a data frame.

readr::read_file        Read a file into a string.

readr::fwf_empty        Read a fixed width file.

readr::read_lines        Read lines from a file or string.

readr::read_log        Read common/combined log file.

readr::read_table        Read text file where columns are separated by whitespace.

2)readxl包示例

对于Excel格式的数据,对应了这里的readxl包,这个包提供了读取后缀为.xls和.xlsx格式的Excel表格。

需要注意地是,readxl包是托管在https://github.com/hadley/readxl 上的,因此,安装的时候安装地址要指定是github上的readxl库!

> library(devtools) #先安装这个包,可以快速的完成readxl包的安装!!!

> library(devtools)

> devtools::install_github("hadley/readxl")

目前,readxl包提供的函数只有read_excel,格式如下

Read_excel(spreadsheet, sheet=1, na,…. )

使用方法一看便知,这里就不再啰嗦啦!感兴趣的小伙伴儿赶紧去亲自探索吧!!!

时间: 2024-10-20 01:56:50

New packages for reading data into R — fast的相关文章

SQL data reader reading data performance test

/*Author: Jiangong SUN*/ As I've manipulated a lot of data using SQL data reader in recent project. And people says it's not good to access the data by column name. So I've made an performance test in reading data from SQL data reader. Firstly, I've

mysql从库Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: &#39;Could not find first log file name in binary log index file&#39;报错处理

年后回来查看mysql运行状况与备份情况,登录mysql从库查看主从同步状态 1 mysql> show slave status\G; 2 *************************** 1. row *************************** 3 Slave_IO_State: 4 Master_Host: 101.200.*.* 5 Master_User: backup 6 Master_Port: 3306 7 Connect_Retry: 60 8 Master_

OpenTSDB-Querying or Reading Data

Querying or Reading Data OpenTSDB offers a number of means to extract data such as CLI tools, an HTTP API and as a GnuPlot graph. Querying with OpenTSDB's tag based system can be a bit tricky so read through this document and checkout the following p

Got fatal error 1236 from master when reading data from binary log: &#39;Could not find first log file name in binary log index file&#39;系列一:

主库添加log-bin-index 参数后,从库报这个错误:Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in binary log index file' Got fatal error 1236 from master when reading data from binary log: 'could not find next l

Got fatal error 1236 from master when reading data from binary log: &#39;Could not find first log file name in binary log index file&#39;系列二:reset slave

reset slave会清除从库的所有复制信息.一般应用场景:如切换为不同的Master, 主从重做等: 1. 命令在slave上执行,执行前一定要stop slave. 2. 执行reset slave后,会清除复制相关的所有信息,包括:master.info, relay-log.info, 及无条件删除所有的中继日志(relay logs). 注意是无条件的,也就是不管理你Slave SQL线程是否把所有的relay log重放完了. 3. 注意,stop slave后,先保存show s

mysql 主从 Got fatal error 1236 from master when reading data from binary log: &#39;Could not find first 错误

本地MySQL环境,是两台MySQL做M-M复制.今天发现错误信息: mysql 5.5.28-log> show slave status\G *************************** 1. row ***************************                Slave_IO_State:                   Master_Host: 88.88.88.88                   Master_User: replicate

Got fatal error 1236 from master when reading data from binary log: &#39;Client requested master to start replication from impossible position&#39;

[[email protected] bin]# mysqlbinlog logbin.000002 /*!40019 SET @@session.max_insert_delayed_threads=0*/; /*!50003 SET @[email protected]@COMPLETION_TYPE,COMPLETION_TYPE=0*/; DELIMITER /*!*/; # at 4 #150511 20:57:36 server id 1 end_log_pos 106 Start:

Text Mining Twitter Data in R

Project 1 (20 Points Total)Text Mining Twitter Data in R (using “tidytext”) This is a two-week project spanning Weeks 2 and 3.All parts are due at the end of Week 3. PurposeIn this project you will use twitter data with the tidytext package in R to e

Got fatal error 1236 from master when reading data from binary log: &#39;Could not find first log file name in binary log index file&#39;

mysql> show slave status \G Slave_IO_Running: No Slave_SQL_Running: Yes Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 1236 Last_IO_Error: Got fatal error 1236 from master when reading data from binary log: 'Could not find first log file name in bi