Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4<3>

经过前几天的学习，基本上能够小试牛刀编写一些小程序玩一玩了，在此之前做几项准备工作

明白我要用hadoop干什么

大体学习一下mapreduce

ubuntu重新启动后，再启动hadoop会报连接异常的问题

答：

数据提炼、探索数据、挖掘数据

map=切碎，reduce=合并

重新启动后会清空tmp目录，默认namenode会存在这里，须要在core-site.xml文件里添加（别忘了创建目录，没权限的话，须要用root创建并把权限改成777）：
```
<property>

     <name>hadoop.tmp.dir</name>

     <value>/usr/local/hadoop/tmp</value>

</property>
```

大数据，我的第一反应是现有关系型数据库中的数据怎么跟hadoop结合使用，网上搜了一些资料，使用的是DBInputFormat，那就简单编写一个从数据库读取数据，然后经过处理后，生成文件的小样例吧

数据库弄的简单一点吧，id是数值整型、test是字符串型，需求非常easy，统计TEST字段出现的数量

数据读取类：

import java.io.DataInput;

import java.io.DataOutput;

import java.io.IOException;

import java.sql.PreparedStatement;

import java.sql.ResultSet;

import java.sql.SQLException;

import org.apache.hadoop.io.Writable;

import org.apache.hadoop.mapreduce.lib.db.DBWritable;
public class DBRecoder implements Writable, DBWritable{

	String test;

	int id;

	@Override

	public void write(DataOutput out) throws IOException {

		out.writeUTF(test);

		out.writeInt(id);

	}

	@Override

	public void readFields(DataInput in) throws IOException {

		test = in.readUTF();

		id = in.readInt();

	}

	@Override

	public void readFields(ResultSet arg0) throws SQLException {

		test = arg0.getString("test");

		id = arg0.getInt("id");

	}

	@Override

	public void write(PreparedStatement arg0) throws SQLException {

		arg0.setString(1, test);

		arg0.setInt(2, id);

	}

}

mapreduce操作类

import java.io.IOException;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.db.DBConfiguration;

import org.apache.hadoop.mapreduce.lib.db.DBInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import org.apache.hadoop.util.GenericOptionsParser;
public class DataCountTest {

	public static class TokenizerMapper extends Mapper<LongWritable, DBRecoder, Text, IntWritable> {

		public void map(LongWritable key, DBRecoder value, Context context) throws IOException, InterruptedException {

			context.write(new Text(value.test), new IntWritable(1));

		}

	}
public static class IntSumReducer extends Reducer<Text, IntWritable, Text, IntWritable> {

		private IntWritable result = new IntWritable();
public void reduce(Text key, Iterable<IntWritable> values,

				Context context) throws IOException, InterruptedException {

			int sum = 0;

			for (IntWritable val : values) {

				sum += val.get();

			}

			result.set(sum);

			context.write(key, result);

		}

	}
public static void main(String[] args) throws Exception {

		args = new String[1];

		args[0] = "hdfs://192.168.203.137:9000/user/chenph/output1111221";
Configuration conf = new Configuration();
DBConfiguration.configureDB(conf, "oracle.jdbc.driver.OracleDriver",

                "jdbc:oracle:thin:@192.168.101.179:1521:orcl", "chenph", "chenph");
String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
Job job = new Job(conf, "DB count");
job.setJarByClass(DataCountTest.class);

		job.setMapperClass(TokenizerMapper.class);

		job.setReducerClass(IntSumReducer.class);

		job.setOutputKeyClass(Text.class);

		job.setOutputValueClass(IntWritable.class);

		job.setMapOutputKeyClass(Text.class);

		job.setMapOutputValueClass(IntWritable.class);

        String[] fields1 = { "id", "test"};

        DBInputFormat.setInput(job, DBRecoder.class, "t1", null, "id",  fields1);
FileOutputFormat.setOutputPath(job, new Path(otherArgs[0]));
System.exit(job.waitForCompletion(true) ? 0 : 1);

	}

}

--------------------------------------------------------------------------------------------------

开发过程中遇到的问题：

Job被标记为已作废，那应该用什么我还没有查到

乱码问题，hadoop默认是utf8格式的，假设读取的是gbk的须要进行处理

这类样例网上挺少的，有也是老版的，新版的资料没有，我全然是拼凑出来的，非常多地方还不甚了解，须要进一步学习官方资料

搜索资料时，有资料说不建议採用这样的方式处理实际的大数据问题，原因就是并发过高，会瞬间秒杀掉数据库，一般都会採用导成文本文件的形式

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4<3>

时间： 2024-12-26 13:27:12

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4<3>的相关文章

在电脑中使用xshell连接虚拟机内的系统

在电脑中使用Xshell连接虚拟机内的系统暂时有两种方法,一种是当虚拟机的网络为桥接模式时,这种模式可以连接外网,也就是如果电脑能够上网,那么在虚拟机内也能直接上网.在这种模式下只需要打开虚拟机内的系统,将防火墙关闭,然后进入终端模式,输入ifconfig回车,此时就可以在显示出来的信息中寻找当前系统的ip地址 (如果找不到就检查一下是否连接网络,我用的是centos7,可以在右上角查看) 记下这个ip后就可以将虚拟机最小化,然后打开xshell 进入之后页面大概是这样, (这个软件有两个可以输

【甘道夫】Win7环境下Eclipse连接Hadoop2.2.0

准备: 确保hadoop2.2.0集群正常运行 1.eclipse中建立java工程,导入hadoop2.2.0相关jar包 2.在src根目录下拷入log4j.properties,通过log4j查看详细日志 log4j.rootLogger=debug, stdout, R log4j.appender.stdout=org.apache.log4j.ConsoleAppender log4j.appender.stdout.layout=org.apache.log4j.PatternLa

Ubuntu 16.04 LTS软件包管理基本操作使用APT简化命令行下面我们列出 Ubuntu 16.04 LTS 中使用 ATP 命令与老版本 Ubuntu 中软件包管理的用法对比： Ubuntu 16.04 LTS 老版本Ubuntu apt install 包名替代 apt-get install 包名 apt remove 包名替代 apt-get remove 包名 apt se

前文 Ubuntu 16.04 新特性中我们已经介绍过,随着 Ubuntu 16.04 LTS 的发布,Ubuntu 的软件包管理命令也发生了变化,新系统采用了 Debian 项目中所使用的 APT(Advanced Package Tool)来完成各种的不同的任务,ATP 命令全面取代了我们之前在 Linux 软件包管理基本操作入门中所介绍的 apt-get.apt-cache 等功能. ATP 在创建之初便是为了解决大量软件包管理所遇到的问题,希望结束类似 Linux 早期系统依赖的一大弊病

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4<3>

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4<3>的相关文章

在电脑中使用xshell连接虚拟机内的系统

【甘道夫】Win7环境下Eclipse连接Hadoop2.2.0

通过win下的eclipse连接虚拟机中伪分布的hadoop进行调试

windows中eclipse连接虚拟机hdfs

Window10中利用Windbg与虚拟机（window7）中调试驱动建立方法

Win7环境下Eclipse连接Hadoop2.2.0

[开发]Win7环境下Eclipse连接Hadoop2.2.0

在win10上用 x-shell连接虚拟机里的 Ubuntu

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4&lt;3&gt;

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4&lt;3&gt;的相关文章

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4<3>

Win7中使用Eclipse连接虚拟机中的Ubuntu中的Hadoop2.4<3>的相关文章