简单示例用例(Simple Example Use Cases)--hive GettingStarted用例翻译

1、MovieLens User Ratings

First, create a table with tab-delimited text file format:

首先,创建一个通过tab分隔的表:


CREATE TABLE u_data (

userid INT,

movieid INT,

rating INT,

unixtime STRING)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘

STORED AS TEXTFILE;

Then, download the data files from MovieLens 100k on the GroupLens datasets page (which also has a README.txt file and index of unzipped files):

然后,下载数据文件从下面方法:


wget http://files.grouplens.org/datasets/movielens/ml-100k.zip

or:

curl --remote-name http://files.grouplens.org/datasets/movielens/ml-100k.zip

Note:  If the link to GroupLens datasets does not work, please report it on HIVE-5341 or send a message to the [email protected] mailing list.

Unzip the data files:

解压缩这个文件:


unzip ml-100k.zip

And load u.data into the table that was just created:

并且加载数据到刚刚创建的u_data表中:


LOAD DATA LOCAL INPATH ‘<path>/u.data‘ OVERWRITE INTO TABLE u_data;

Count the number of rows in table u_data:

统计表u_data的行数:


SELECT COUNT(*) FROM u_data;

Note that for older versions of Hive which don‘t include HIVE-287, you‘ll need to use COUNT(1) in place of COUNT(*).

Now we can do some complex data analysis on the table u_data:

现在我们可以做一些复杂的数据分析针对表u_data:


Create weekday_mapper.py:


import sys

import datetime

for line in sys.stdin:

line = line.strip()

userid, movieid, rating, unixtime = line.split(‘\t‘)

weekday = datetime.datetime.fromtimestamp(float(unixtime)).isoweekday()

print ‘\t‘.join([userid, movieid, rating, str(weekday)])

Use the mapper script:

使用这个脚本:


CREATE TABLE u_data_new (

userid INT,

movieid INT,

rating INT,

weekday INT)

ROW FORMAT DELIMITED FIELDS TERMINATED BY ‘\t‘;


add FILE weekday_mapper.py;


INSERT OVERWRITE TABLE u_data_new

SELECT

TRANSFORM (userid, movieid, rating, unixtime)

USING ‘python weekday_mapper.py‘

AS (userid, movieid, rating, weekday)

FROM u_data;

解释:这里通过python脚本清洗表u_data中数据,使用python脚本通过

TRANSFORM (userid, movieid, rating, unixtime)   --输入字段

USING ‘python weekday_mapper.py‘              --脚本处理

AS (userid, movieid, rating, weekday)         --输出字段


SELECT weekday, COUNT(*)

FROM u_data_new

GROUP BY weekday;

Note that if you‘re using Hive 0.5.0 or earlier you will need to use COUNT(1) in place of COUNT(*).

2、Apache Weblog Data

The format of Apache weblog is customizable, while most webmasters use the default.
For default Apache weblog, we can create a table with the following command.

More about RegexSerDe can be found here in HIVE-662 and HIVE-1719.


CREATE TABLE apachelog (

host STRING,

identity STRING,

user STRING,

time STRING,

request STRING,

status STRING,

size STRING,

referer STRING,

agent STRING)

ROW FORMAT SERDE ‘org.apache.hadoop.hive.serde2.RegexSerDe‘

WITH SERDEPROPERTIES (

"input.regex" = "([^ ]*) ([^ ]*) ([^ ]*) (-|\\[^\\]*\\]) ([^ \"]*|\"[^\"]*\") (-|[0-9]*) (-|[0-9]*)(?: ([^ \"]*|\".*\") ([^ \"]*|\".*\"))?"

)

STORED AS TEXTFILE;

时间: 2024-10-12 13:49:13

简单示例用例(Simple Example Use Cases)--hive GettingStarted用例翻译的相关文章

关于Ajax实现的简单示例

一.代码示例 关于Ajax的基本概念(包括XMLHttpRequest对象及其相关方法属性)移步这里(w3school中文版)学习了解. <!doctype html> <html lang = "en"> <head> <meta charset = "utf-8"> <title>使用Ajax异步加载数据</title> <script type = "text/javasc

c#webservice的简单示例

webservice.就概念上来说,可能比较复杂,不过我们可以有个宏观的了解:webservice就是个对外的接口,里面有 函数可供外部客户调用(注意:里面同样有客户不可调用的函数).假若我们是服务端,我们写好了个webservice,然后把它给了客户(同时我们给了他们调用规则),客户就可以在从服务端获取信息时处于一个相对透明的状态.即使客户不了解(也不需要)其过程,他们只获取数据. webservice传递的数据只能是序列化的数据,典型的就是xml数据. 下面以一个简单例子为例: (一)新建—

SignalR 简单示例

原文:SignalR 简单示例 一.什么是 SignalR ASP.NET SignalR is a library for ASP.NET developers that simplifies the process of adding real-time web functionality to applications. Real-time web functionality is the ability to have server code push content to connec

Linux内核模块简单示例

1. Linux 内核的整体结构非常庞大,其包含的组件也非常多,使用这些组件的方法有两种: ① 直接编译进内核文件,即zImage或者bzImage(问题:占用内存过多) ② 动态添加 * 模块本身并不被编译进内核文件 * 根据需求,在内核运行期间动态安装或卸载 2. 内核模块动态安装与卸载 ①安装 insmod 例:insmod /home/dnw_usb.ko ②卸载 rmmod 例:rmmod dnw_usb ③查看 lsmod 例: lsmod 3. 模块声明 ① MODULE_LICE

thrift简单示例 (基于C++)

这个thrift的简单示例, 来源于官网 (http://thrift.apache.org/tutorial/cpp), 因为我觉得官网的例子已经很简单了, 所以没有写新的示例, 关于安装的教程, 可以参考https://www.cnblogs.com/albizzia/p/10838646.html, 关于thrift文件的语法, 可以参考: https://www.cnblogs.com/albizzia/p/10838646.html. thrift文件 首先给出shared.thrif

thrift简单示例 (go语言)

这个thrift的简单示例来自于官网 (http://thrift.apache.org/tutorial/go), 因为官方提供的例子简单易懂, 所以没有必要额外考虑新的例子. 关于安装的教程, 可以参考https://www.cnblogs.com/albizzia/p/10838646.html, 关于thrift文件的语法, 可以参考: https://www.cnblogs.com/albizzia/p/10838646.html. 运行下面的示例, 除了需要安装thrift外, 还有

AMQP消息队列之RabbitMQ简单示例

前面一篇文章讲了如何快速搭建一个ActiveMQ的示例程序,ActiveMQ是JMS的实现,那这篇文章就再看下另外一种消息队列AMQP的代表实现RabbitMQ的简单示例吧.在具体讲解之前,先通过一个图来概览下: 1.添加Maven依赖 <!-- rabbitmq begin --> <dependency> <groupId>org.springframework.amqp</groupId> <artifactId>spring-rabbit

HMM的维特比算法简单示例

今天读了一位大牛的关于HMM的技术博客,读完之后,写了一个关于维特比算法的简单示例,用scala和java语言混合编写的.现在上传之. package com.txq.hmm import java.utilimport scala.collection._ /** * HMM维特比算法,根据显示状态链条估计隐式链条 * @param states 隐式states * @param observations 显式states * @param start_probability 初始概率向量

spring-servlet.xml简单示例

spring-servlet.xml简单示例 某个项目中的spring-servlet.xml 记下来以后研究用 1 <!-- springMVC简单配置 --> 2 <?xml version="1.0" encoding="UTF-8"?> 3 <beans xmlns="http://www.springframework.org/schema/beans" 4 xmlns:xsi="http://w