Accessing data in Hadoop using dplyr and SQL

If your primary objective is to query your data in Hadoop to browse, manipulate, and extract it into R, then you probably want to use SQL. You can write SQL code explicitly to interact with Hadoop, or you can write SQL code implicitly with dplyr. The dplyrpackage has a generalized backend for data sources that translates your R code into SQL. You can use RStudio and dplyr to work with several of the most popular software packages in the Hadoop ecosystem, including Hive, Impala, HBase and Spark.

There are two methods for accessing data in Hadoop using dplyr and SQL.

ODBC

You can connect R and RStudio to Hadoop with an ODBC connection. This effectively treats Hadoop like any other data source (i.e., as if Hadoop were a relational database). You will need a data source specific driver (e.g., Hive, Impala, HBase) installed on your desktop or your sever. You will also need a few R packages. We recommend using these R packages: DBIdplyr, and odbc. Note that the dplyr package may also reference the dbplyr package to help translate R into specific variants of SQL. You can use the odbc package to create a connection with Hadoop and run queries:

library(odbc)

con <- dbConnect(odbc::odbc(),
                 driver = <driver>,
                 host = <host>,
                 dbname = <dbname>,
                 user = <user>,
                 password = <password>,
                 port = 10000)

tbl(con, "mytable") # dplyrdbGetQuery(con, "SELECT * FROM mytable") # SQL

dbDisconnect(con)

Spark

If you are running Spark on Hadoop, you may also elect to use the sparklyr package to access your data in HDFS. Spark is a general engine for large-scale data processing, and it supports SQL. The sparklyr package communicates with the Spark API to run SQL queries, and it also has a dplyr backend. You can use sparklyr to create a connect with Spark run queries:

library(sparklyr)
con <- spark_connect(master = "yarn-client")

tbl(con, "mytable") # dplyrdbGetQuery(con, "SELECT * FROM mytable") # SQL

spark_disconnect(con)

转自:https://support.rstudio.com/hc/en-us/articles/115008241668-Accessing-data-in-Hadoop-using-dplyr-and-SQL

原文地址:https://www.cnblogs.com/payton/p/8758893.html

时间: 2024-08-28 23:30:14

Accessing data in Hadoop using dplyr and SQL的相关文章

android 出现Make sure the Cursor is initialized correctly before accessing data from it

Make sure the Cursor is initialized correctly before accessing data from it 详细错误是:java.lang.IllegalStateException: Couldn't read row 0, col 2 from CursorWindow.  Make sure the Cursor is initialized correctly before accessing data from it. 出现这个原因是因为我在

java.lang.IllegalStateException:Couldn&#39;t read row 0, col -1 from CursorWindow. Make sure the Cursor is initialized correctly before accessing data from it.

java.lang.RuntimeException: Unable to start activity ComponentInfo{com.xxx...}: java.lang.IllegalStateException: Couldn't read row 0, col -1 from CursorWindow.  Make sure the Cursor is initialized correctly before accessing data from it.  要检查列名拼写!列名拼

[SAP BASIS] com.adobe.ProcessingException: com.adobe.ProcessingException: Problem accessing data from Destination: dest:FP_ICF_DATA_&lt;SID&gt;//sap/bc/fp/form/layout/FP_FORM_SECURITY_TEST.XDP

ADS 关键字:SYSTEM ERROR: ADS: com.adobe.ProcessingException: com.adobe.ProcessingException: Problem accessing data from Destination: dest:FP_ICF_DATA_<SID>//sap/bc/fp/form/layout/FP_FORM_SECURITY_TEST.XDP 解决问题: 参阅 2215134 - ADS HTTP authentication in I

举例说明:Hadoop vs. NoSql vs. Sql vs. NewSql

转自:http://blog.jobbole.com/86269/   尽管层次数据库如今在大型机上依然被广泛使用,但关系数据库(RDBMS)(SQL)已经占领了数据库市场,并且表现的相当优异.我们存的钱不会跑到别人的账户,我们预定机票可以确保我们在飞机上有一个专属的座位,而且我们也不会因为没有做过的事而受到责备等等.关系数据库的数据完整性是因为它遵循了ACID(原子性,一致性,独立性以及持久性)原则.关系数据库技术可追溯到上世纪70年代. 那么,现在有什么变化呢?Web 技术开启了这次变革.如

[AngularJS] Accessing Data in HTML -- controllerAs, using promises

<!DOCTYPE html> <html> <head> <title>Access Data From HTML</title> </head> <body ng-app="app" ng-controller="TodoCtrl as todoCtrl"> <div ng-repeat="todo in todoCtrl.todos"> {{to

This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its de 错误解决

这是我们开启了bin-log, 我们就必须指定我们的函数是否是1 DETERMINISTIC 不确定的2 NO SQL 没有SQl语句,当然也不会修改数据3 READS SQL DATA 只是读取数据,当然也不会修改数据4 MODIFIES SQL DATA 要修改数据5 CONTAINS SQL 包含了SQL语句 其中在function里面,只有 DETERMINISTIC, NO SQL 和 READS SQL DATA 被支持.如果我们开启了 bin-log, 我们就必须为我们的funct

转载:Character data is represented incorrectly when the code page of the client computer differs from the code page of the database in SQL Server 2005

https://support.microsoft.com/en-us/kb/904803 Character data is represented incorrectly when the code page of the client computer differs from the code page of the database in SQL Server 2005 Email Print SYMPTOMS Consider the following scenario: In

This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its de 错误解决办法

这是我们开启了bin-log, 我们就必须指定我们的函数是否是1 DETERMINISTIC 不确定的2 NO SQL 没有SQl语句,当然也不会修改数据3 READS SQL DATA 只是读取数据,当然也不会修改数据4 MODIFIES SQL DATA 要修改数据5 CONTAINS SQL 包含了SQL语句 其中在function里面,只有 DETERMINISTIC, NO SQL 和 READS SQL DATA 被支持.如果我们开启了 bin-log, 我们就必须为我们的funct

This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA

This function has none of DETERMINISTIC, NO SQL解决办法            2011-12-01 14:07:01 This function has none of DETERMINISTIC, NO SQL解决办法 创建存储过程时 出错信息: ERROR 1418 (HY000): This function has none of DETERMINISTIC, NO SQL, or READS SQL DATA in its declara