如何利用MapReduce的分治策略提高KNN算法的运行速度

集群环境介绍:

hadoop2.4.1 64位
6台服务器:
hadoop11   NameNode 、SecondaryNameNode
hadoop22   ResourceManager
hadoop33   DataNode、NodeManager
hadoop44   DataNode、NodeManager
hadoop55   DataNode、NodeManager
hadoop66   DataNode、NodeManager 

实验1:训练集train.txt样例个数为245057(3.24M) 测试集test.txt样例个数为51444(640kb),并将全部测试集都存放在test.txt中

[root@hadoop11 local]# hadoop fs -lsr /dir6/
-rw-r--r--   3 root supergroup    3400816 2016-07-17 19:28 /dir6/test.txt
注意:此时所有的测试集都在一个文本中(test.txt)存放,作为输入路径

KNN算法运行日志:

16/07/17 19:32:24 INFO client.RMProxy: Connecting to ResourceManager at hadoop22/10.187.84.51:8032
16/07/17 19:32:25 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/07/17 19:32:25 INFO input.FileInputFormat: Total input paths to process : 1
16/07/17 19:32:25 INFO mapreduce.JobSubmitter: number of splits:1
16/07/17 19:32:26 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468752229715_0016
16/07/17 19:32:26 INFO impl.YarnClientImpl: Submitted application application_1468752229715_0016
16/07/17 19:32:26 INFO mapreduce.Job: The url to track the job: http://hadoop22:8088/proxy/application_1468752229715_0016/
16/07/17 19:32:26 INFO mapreduce.Job: Running job: job_1468752229715_0016
16/07/17 19:32:32 INFO mapreduce.Job: Job job_1468752229715_0016 running in uber mode : false
16/07/17 19:32:32 INFO mapreduce.Job:  map 0% reduce 0%
16/07/17 19:32:49 INFO mapreduce.Job:  map 1% reduce 0%
16/07/17 19:33:05 INFO mapreduce.Job:  map 2% reduce 0%
16/07/17 19:33:20 INFO mapreduce.Job:  map 3% reduce 0%
16/07/17 19:33:35 INFO mapreduce.Job:  map 4% reduce 0%
16/07/17 19:33:50 INFO mapreduce.Job:  map 5% reduce 0%
16/07/17 19:34:02 INFO mapreduce.Job:  map 6% reduce 0%
16/07/17 19:34:17 INFO mapreduce.Job:  map 7% reduce 0%
16/07/17 19:34:32 INFO mapreduce.Job:  map 8% reduce 0%
16/07/17 19:34:47 INFO mapreduce.Job:  map 9% reduce 0%
16/07/17 19:35:02 INFO mapreduce.Job:  map 10% reduce 0%
16/07/17 19:35:14 INFO mapreduce.Job:  map 11% reduce 0%
16/07/17 19:35:29 INFO mapreduce.Job:  map 12% reduce 0%
16/07/17 19:35:44 INFO mapreduce.Job:  map 13% reduce 0%
16/07/17 19:35:59 INFO mapreduce.Job:  map 14% reduce 0%
16/07/17 19:36:12 INFO mapreduce.Job:  map 15% reduce 0%
16/07/17 19:36:27 INFO mapreduce.Job:  map 16% reduce 0%
16/07/17 19:36:42 INFO mapreduce.Job:  map 17% reduce 0%
16/07/17 19:36:57 INFO mapreduce.Job:  map 18% reduce 0%
16/07/17 19:37:12 INFO mapreduce.Job:  map 19% reduce 0%
16/07/17 19:37:27 INFO mapreduce.Job:  map 20% reduce 0%
16/07/17 19:37:39 INFO mapreduce.Job:  map 21% reduce 0%
16/07/17 19:37:54 INFO mapreduce.Job:  map 22% reduce 0%
16/07/17 19:38:09 INFO mapreduce.Job:  map 23% reduce 0%
16/07/17 19:38:24 INFO mapreduce.Job:  map 24% reduce 0%
16/07/17 19:38:39 INFO mapreduce.Job:  map 25% reduce 0%
16/07/17 19:38:51 INFO mapreduce.Job:  map 26% reduce 0%
16/07/17 19:39:06 INFO mapreduce.Job:  map 27% reduce 0%
16/07/17 19:39:22 INFO mapreduce.Job:  map 28% reduce 0%
16/07/17 19:39:37 INFO mapreduce.Job:  map 29% reduce 0%
16/07/17 19:39:52 INFO mapreduce.Job:  map 30% reduce 0%
16/07/17 19:40:07 INFO mapreduce.Job:  map 31% reduce 0%
16/07/17 19:40:22 INFO mapreduce.Job:  map 32% reduce 0%
16/07/17 19:40:37 INFO mapreduce.Job:  map 33% reduce 0%
16/07/17 19:40:52 INFO mapreduce.Job:  map 34% reduce 0%
16/07/17 19:41:04 INFO mapreduce.Job:  map 35% reduce 0%
16/07/17 19:41:22 INFO mapreduce.Job:  map 36% reduce 0%
16/07/17 19:41:37 INFO mapreduce.Job:  map 37% reduce 0%
16/07/17 19:41:52 INFO mapreduce.Job:  map 38% reduce 0%
16/07/17 19:42:07 INFO mapreduce.Job:  map 39% reduce 0%
16/07/17 19:42:22 INFO mapreduce.Job:  map 40% reduce 0%
16/07/17 19:42:37 INFO mapreduce.Job:  map 41% reduce 0%
16/07/17 19:42:53 INFO mapreduce.Job:  map 42% reduce 0%
16/07/17 19:43:08 INFO mapreduce.Job:  map 43% reduce 0%
16/07/17 19:43:23 INFO mapreduce.Job:  map 44% reduce 0%
16/07/17 19:43:41 INFO mapreduce.Job:  map 45% reduce 0%
16/07/17 19:43:56 INFO mapreduce.Job:  map 46% reduce 0%
16/07/17 19:44:12 INFO mapreduce.Job:  map 47% reduce 0%
16/07/17 19:44:30 INFO mapreduce.Job:  map 48% reduce 0%
16/07/17 19:44:45 INFO mapreduce.Job:  map 49% reduce 0%
16/07/17 19:45:00 INFO mapreduce.Job:  map 50% reduce 0%
16/07/17 19:45:15 INFO mapreduce.Job:  map 51% reduce 0%
16/07/17 19:45:30 INFO mapreduce.Job:  map 52% reduce 0%
16/07/17 19:45:48 INFO mapreduce.Job:  map 53% reduce 0%
16/07/17 19:46:03 INFO mapreduce.Job:  map 54% reduce 0%
16/07/17 19:46:18 INFO mapreduce.Job:  map 55% reduce 0%
16/07/17 19:46:33 INFO mapreduce.Job:  map 56% reduce 0%
16/07/17 19:46:49 INFO mapreduce.Job:  map 57% reduce 0%
16/07/17 19:47:07 INFO mapreduce.Job:  map 58% reduce 0%
16/07/17 19:47:22 INFO mapreduce.Job:  map 59% reduce 0%
16/07/17 19:47:37 INFO mapreduce.Job:  map 60% reduce 0%
16/07/17 19:47:55 INFO mapreduce.Job:  map 61% reduce 0%
16/07/17 19:48:10 INFO mapreduce.Job:  map 62% reduce 0%
16/07/17 19:48:25 INFO mapreduce.Job:  map 63% reduce 0%
16/07/17 19:48:43 INFO mapreduce.Job:  map 64% reduce 0%
16/07/17 19:48:58 INFO mapreduce.Job:  map 65% reduce 0%
16/07/17 19:49:13 INFO mapreduce.Job:  map 66% reduce 0%
16/07/17 19:49:28 INFO mapreduce.Job:  map 67% reduce 0%
16/07/17 19:49:30 INFO mapreduce.Job:  map 100% reduce 0%
16/07/17 19:49:37 INFO mapreduce.Job:  map 100% reduce 100%
16/07/17 19:49:38 INFO mapreduce.Job: Job job_1468752229715_0016 completed successfully
16/07/17 19:49:39 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=2892255
                FILE: Number of bytes written=5971253
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=4056338
                HDFS: Number of bytes written=861195
                HDFS: Number of read operations=7
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=1016177
                Total time spent by all reduces in occupied slots (ms)=4948
                Total time spent by all map tasks (ms)=1016177
                Total time spent by all reduce tasks (ms)=4948
                Total vcore-seconds taken by all map tasks=1016177
                Total vcore-seconds taken by all reduce tasks=4948
                Total megabyte-seconds taken by all map tasks=1040565248
                Total megabyte-seconds taken by all reduce tasks=5066752
        Map-Reduce Framework
                Map input records=51444
                Map output records=154332
                Map output bytes=2583585
                Map output materialized bytes=2892255
                Input split bytes=103
                Combine input records=0
                Combine output records=0
                Reduce input groups=51444
                Reduce shuffle bytes=2892255
                Reduce input records=154332
                Reduce output records=51444
                Spilled Records=308664
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=5836
                CPU time spent (ms)=1033510
                Physical memory (bytes) snapshot=517627904
                Virtual memory (bytes) snapshot=1786634240
                Total committed heap usage (bytes)=306774016
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=655419
        File Output Format Counters
                Bytes Written=861195

统计:

精确度:51444   51367
CPU time spent (ms)=1033510
map tasks=1

实验2:训练集train.txt样例个数为245057不变 测试集test.txt样例个数为51444,并将全部测试集存放在

test1.txt(25568)和test2.txt(25857)中

[[email protected] local]# hadoop fs -lsr /dir6/
-rw-r--r--   3 root supergroup     368774 2016-07-17 20:15 /dir6/test1.txt
-rw-r--r--   3 root supergroup     312210 2016-07-17 20:15 /dir6/test2.txt

KNN算法运行日志:

先看进程日志:

[root@hadoop66 ~]# jps
24659 YarnChild  (mapper任务)
22777 DataNode
25592 Jps
24660 YarnChild  (mapper任务)
24557 MRAppMaster
22622 NodeManager

计数器日志:

[[email protected] local]# app1.sh
16/07/17 20:21:03 INFO client.RMProxy: Connecting to ResourceManager at hadoop22/10.187.84.51:8032
16/07/17 20:21:03 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/07/17 20:21:03 INFO input.FileInputFormat: Total input paths to process : 2
16/07/17 20:21:03 INFO mapreduce.JobSubmitter: number of splits:2
16/07/17 20:21:03 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468752229715_0019
16/07/17 20:21:04 INFO impl.YarnClientImpl: Submitted application application_1468752229715_0019
16/07/17 20:21:04 INFO mapreduce.Job: The url to track the job: http://hadoop22:8088/proxy/application_1468752229715_0019/
16/07/17 20:21:04 INFO mapreduce.Job: Running job: job_1468752229715_0019
16/07/17 20:21:10 INFO mapreduce.Job: Job job_1468752229715_0019 running in uber mode : false
16/07/17 20:21:10 INFO mapreduce.Job:  map 0% reduce 0%
16/07/17 20:21:21 INFO mapreduce.Job:  map 1% reduce 0%
16/07/17 20:21:30 INFO mapreduce.Job:  map 2% reduce 0%
16/07/17 20:21:40 INFO mapreduce.Job:  map 3% reduce 0%
16/07/17 20:21:46 INFO mapreduce.Job:  map 4% reduce 0%
16/07/17 20:21:55 INFO mapreduce.Job:  map 5% reduce 0%
16/07/17 20:22:01 INFO mapreduce.Job:  map 6% reduce 0%
16/07/17 20:22:10 INFO mapreduce.Job:  map 7% reduce 0%
16/07/17 20:22:17 INFO mapreduce.Job:  map 8% reduce 0%
16/07/17 20:22:26 INFO mapreduce.Job:  map 9% reduce 0%
16/07/17 20:22:35 INFO mapreduce.Job:  map 10% reduce 0%
16/07/17 20:22:41 INFO mapreduce.Job:  map 11% reduce 0%
16/07/17 20:22:47 INFO mapreduce.Job:  map 12% reduce 0%
16/07/17 20:22:56 INFO mapreduce.Job:  map 13% reduce 0%
16/07/17 20:23:05 INFO mapreduce.Job:  map 14% reduce 0%
16/07/17 20:23:11 INFO mapreduce.Job:  map 15% reduce 0%
16/07/17 20:23:17 INFO mapreduce.Job:  map 16% reduce 0%
16/07/17 20:23:26 INFO mapreduce.Job:  map 17% reduce 0%
16/07/17 20:23:35 INFO mapreduce.Job:  map 18% reduce 0%
16/07/17 20:23:41 INFO mapreduce.Job:  map 19% reduce 0%
16/07/17 20:23:50 INFO mapreduce.Job:  map 20% reduce 0%
16/07/17 20:23:56 INFO mapreduce.Job:  map 21% reduce 0%
16/07/17 20:24:05 INFO mapreduce.Job:  map 22% reduce 0%
16/07/17 20:24:11 INFO mapreduce.Job:  map 23% reduce 0%
16/07/17 20:24:20 INFO mapreduce.Job:  map 24% reduce 0%
16/07/17 20:24:26 INFO mapreduce.Job:  map 25% reduce 0%
16/07/17 20:24:35 INFO mapreduce.Job:  map 26% reduce 0%
16/07/17 20:24:42 INFO mapreduce.Job:  map 27% reduce 0%
16/07/17 20:24:51 INFO mapreduce.Job:  map 28% reduce 0%
16/07/17 20:24:57 INFO mapreduce.Job:  map 29% reduce 0%
16/07/17 20:25:06 INFO mapreduce.Job:  map 30% reduce 0%
16/07/17 20:25:12 INFO mapreduce.Job:  map 31% reduce 0%
16/07/17 20:25:21 INFO mapreduce.Job:  map 32% reduce 0%
16/07/17 20:25:27 INFO mapreduce.Job:  map 33% reduce 0%
16/07/17 20:25:36 INFO mapreduce.Job:  map 34% reduce 0%
16/07/17 20:25:42 INFO mapreduce.Job:  map 35% reduce 0%
16/07/17 20:25:51 INFO mapreduce.Job:  map 36% reduce 0%
16/07/17 20:25:57 INFO mapreduce.Job:  map 37% reduce 0%
16/07/17 20:26:06 INFO mapreduce.Job:  map 38% reduce 0%
16/07/17 20:26:12 INFO mapreduce.Job:  map 39% reduce 0%
16/07/17 20:26:21 INFO mapreduce.Job:  map 40% reduce 0%
16/07/17 20:26:30 INFO mapreduce.Job:  map 41% reduce 0%
16/07/17 20:26:36 INFO mapreduce.Job:  map 42% reduce 0%
16/07/17 20:26:45 INFO mapreduce.Job:  map 43% reduce 0%
16/07/17 20:26:51 INFO mapreduce.Job:  map 44% reduce 0%
16/07/17 20:27:00 INFO mapreduce.Job:  map 45% reduce 0%
16/07/17 20:27:06 INFO mapreduce.Job:  map 46% reduce 0%
16/07/17 20:27:15 INFO mapreduce.Job:  map 47% reduce 0%
16/07/17 20:27:21 INFO mapreduce.Job:  map 48% reduce 0%
16/07/17 20:27:30 INFO mapreduce.Job:  map 49% reduce 0%
16/07/17 20:27:36 INFO mapreduce.Job:  map 50% reduce 0%
16/07/17 20:27:45 INFO mapreduce.Job:  map 51% reduce 0%
16/07/17 20:27:51 INFO mapreduce.Job:  map 52% reduce 0%
16/07/17 20:28:01 INFO mapreduce.Job:  map 53% reduce 0%
16/07/17 20:28:07 INFO mapreduce.Job:  map 54% reduce 0%
16/07/17 20:28:16 INFO mapreduce.Job:  map 55% reduce 0%
16/07/17 20:28:23 INFO mapreduce.Job:  map 56% reduce 0%
16/07/17 20:28:31 INFO mapreduce.Job:  map 57% reduce 0%
16/07/17 20:28:38 INFO mapreduce.Job:  map 58% reduce 0%
16/07/17 20:28:46 INFO mapreduce.Job:  map 59% reduce 0%
16/07/17 20:28:53 INFO mapreduce.Job:  map 60% reduce 0%
16/07/17 20:29:02 INFO mapreduce.Job:  map 61% reduce 0%
16/07/17 20:29:10 INFO mapreduce.Job:  map 62% reduce 0%
16/07/17 20:29:17 INFO mapreduce.Job:  map 63% reduce 0%
16/07/17 20:29:26 INFO mapreduce.Job:  map 64% reduce 0%
16/07/17 20:29:32 INFO mapreduce.Job:  map 65% reduce 0%
16/07/17 20:29:41 INFO mapreduce.Job:  map 66% reduce 0%
16/07/17 20:29:42 INFO mapreduce.Job:  map 83% reduce 0%
16/07/17 20:29:52 INFO mapreduce.Job:  map 83% reduce 17%
16/07/17 20:29:54 INFO mapreduce.Job:  map 100% reduce 17%
16/07/17 20:29:55 INFO mapreduce.Job:  map 100% reduce 70%
16/07/17 20:29:56 INFO mapreduce.Job:  map 100% reduce 100%
16/07/17 20:29:56 INFO mapreduce.Job: Job job_1468752229715_0019 completed successfully
16/07/17 20:29:56 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=2892255
                FILE: Number of bytes written=6064619
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=7482816
                HDFS: Number of bytes written=861195
                HDFS: Number of read operations=11
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=2
                Launched reduce tasks=1
                Data-local map tasks=2
                Total time spent by all maps in occupied slots (ms)=1032086
                Total time spent by all reduces in occupied slots (ms)=11757
                Total time spent by all map tasks (ms)=1032086
                Total time spent by all reduce tasks (ms)=11757
                Total vcore-seconds taken by all map tasks=1032086
                Total vcore-seconds taken by all reduce tasks=11757
                Total megabyte-seconds taken by all map tasks=1056856064
                Total megabyte-seconds taken by all reduce tasks=12039168
        Map-Reduce Framework
                Map input records=51444
                Map output records=154332
                Map output bytes=2583585
                Map output materialized bytes=2892261
                Input split bytes=200
                Combine input records=0
                Combine output records=0
                Reduce input groups=51444
                Reduce shuffle bytes=2892261
                Reduce input records=154332
                Reduce output records=51444
                Spilled Records=308664
                Shuffled Maps =2
                Failed Shuffles=0
                Merged Map outputs=2
                GC time elapsed (ms)=8264
                CPU time spent (ms)=1045670
                Physical memory (bytes) snapshot=762257408
                Virtual memory (bytes) snapshot=2654359552
                Total committed heap usage (bytes)=496762880
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=680984
        File Output Format Counters
                Bytes Written=861195
16/07/17 20:29:58 INFO client.RMProxy: Connecting to ResourceManager at hadoop22/10.187.84.51:8032
16/07/17 20:29:59 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/07/17 20:29:59 INFO input.FileInputFormat: Total input paths to process : 1
16/07/17 20:29:59 INFO mapreduce.JobSubmitter: number of splits:1
16/07/17 20:29:59 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468752229715_0020
16/07/17 20:29:59 INFO impl.YarnClientImpl: Submitted application application_1468752229715_0020
16/07/17 20:30:00 INFO mapreduce.Job: The url to track the job: http://hadoop22:8088/proxy/application_1468752229715_0020/
16/07/17 20:30:00 INFO mapreduce.Job: Running job: job_1468752229715_0020
16/07/17 20:30:05 INFO mapreduce.Job: Job job_1468752229715_0020 running in uber mode : false
16/07/17 20:30:05 INFO mapreduce.Job:  map 0% reduce 0%
16/07/17 20:30:12 INFO mapreduce.Job:  map 100% reduce 0%
16/07/17 20:30:18 INFO mapreduce.Job:  map 100% reduce 100%
16/07/17 20:30:18 INFO mapreduce.Job: Job job_1468752229715_0020 completed successfully
16/07/17 20:30:18 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=24
                FILE: Number of bytes written=186173
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=861298
                HDFS: Number of bytes written=12
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3973
                Total time spent by all reduces in occupied slots (ms)=3243
                Total time spent by all map tasks (ms)=3973
                Total time spent by all reduce tasks (ms)=3243
                Total vcore-seconds taken by all map tasks=3973
                Total vcore-seconds taken by all reduce tasks=3243
                Total megabyte-seconds taken by all map tasks=4068352
                Total megabyte-seconds taken by all reduce tasks=3320832
        Map-Reduce Framework
                Map input records=51444
                Map output records=1
                Map output bytes=16
                Map output materialized bytes=24
                Input split bytes=103
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=24
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=70
                CPU time spent (ms)=2340
                Physical memory (bytes) snapshot=451612672
                Virtual memory (bytes) snapshot=1790021632
                Total committed heap usage (bytes)=309002240
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=861195
        File Output Format Counters
                Bytes Written=12

统计:

精确度:51444   51367
CPU time spent (ms)=1045670  (时间之所以长:在于mapper任务的创建花费了时间,并且两个mapper任务都在同一个服务器hadoop66运行)
map tasks=2

实验3:训练集train.txt样例个数为245057不变 测试集test.txt样例个数为51444,并将全部测试集存放在

test1.txt(25402)和test2.txt(15224)和test3.txt(10818)中

[[email protected] local]# hadoop fs -lsr /dir6/
lsr: DEPRECATED: Please use ‘ls -R‘ instead.
-rw-r--r--   3 root supergroup     128161 2016-07-17 20:54 /dir6/test1.txt
-rw-r--r--   3 root supergroup     366313 2016-07-17 20:54 /dir6/test2.txt
-rw-r--r--   3 root supergroup     201566 2016-07-17 20:54 /dir6/test3.txt

先看进程日志:

[root@hadoop33 ~]# jps
26501 Jps
26279 YarnChild  (mapper任务)
2399 QuorumPeerMain
26280 YarnChild  (mapper任务)
23800 DataNode
23648 NodeManager
26133 MRAppMaster
[root@hadoop66 ~]# jps
22777 DataNode
26652 Jps
26302 YarnChild  (mapper任务)
22622 NodeManager
此时可以看出,此时mapper任务的执行有两台服务器来执行---分而治之!

具体运行日志:

[[email protected] local]# app1.sh
16/07/17 20:55:17 INFO client.RMProxy: Connecting to ResourceManager at hadoop22/10.187.84.51:8032
16/07/17 20:55:18 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/07/17 20:55:18 INFO input.FileInputFormat: Total input paths to process : 3
16/07/17 20:55:18 INFO mapreduce.JobSubmitter: number of splits:3
16/07/17 20:55:18 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468752229715_0021
16/07/17 20:55:19 INFO impl.YarnClientImpl: Submitted application application_1468752229715_0021
16/07/17 20:55:19 INFO mapreduce.Job: The url to track the job: http://hadoop22:8088/proxy/application_1468752229715_0021/
16/07/17 20:55:19 INFO mapreduce.Job: Running job: job_1468752229715_0021
16/07/17 20:55:25 INFO mapreduce.Job: Job job_1468752229715_0021 running in uber mode : false
16/07/17 20:55:25 INFO mapreduce.Job:  map 0% reduce 0%
16/07/17 20:55:37 INFO mapreduce.Job:  map 1% reduce 0%
16/07/17 20:55:40 INFO mapreduce.Job:  map 2% reduce 0%
16/07/17 20:55:45 INFO mapreduce.Job:  map 3% reduce 0%
16/07/17 20:55:49 INFO mapreduce.Job:  map 4% reduce 0%
16/07/17 20:55:54 INFO mapreduce.Job:  map 5% reduce 0%
16/07/17 20:55:58 INFO mapreduce.Job:  map 6% reduce 0%
16/07/17 20:56:03 INFO mapreduce.Job:  map 7% reduce 0%
16/07/17 20:56:07 INFO mapreduce.Job:  map 8% reduce 0%
16/07/17 20:56:12 INFO mapreduce.Job:  map 9% reduce 0%
16/07/17 20:56:16 INFO mapreduce.Job:  map 10% reduce 0%
16/07/17 20:56:20 INFO mapreduce.Job:  map 11% reduce 0%
16/07/17 20:56:24 INFO mapreduce.Job:  map 12% reduce 0%
16/07/17 20:56:29 INFO mapreduce.Job:  map 13% reduce 0%
16/07/17 20:56:33 INFO mapreduce.Job:  map 14% reduce 0%
16/07/17 20:56:37 INFO mapreduce.Job:  map 15% reduce 0%
16/07/17 20:56:42 INFO mapreduce.Job:  map 16% reduce 0%
16/07/17 20:56:47 INFO mapreduce.Job:  map 17% reduce 0%
16/07/17 20:56:51 INFO mapreduce.Job:  map 18% reduce 0%
16/07/17 20:56:56 INFO mapreduce.Job:  map 19% reduce 0%
16/07/17 20:57:00 INFO mapreduce.Job:  map 20% reduce 0%
16/07/17 20:57:05 INFO mapreduce.Job:  map 21% reduce 0%
16/07/17 20:57:08 INFO mapreduce.Job:  map 22% reduce 0%
16/07/17 20:57:13 INFO mapreduce.Job:  map 23% reduce 0%
16/07/17 20:57:18 INFO mapreduce.Job:  map 24% reduce 0%
16/07/17 20:57:23 INFO mapreduce.Job:  map 25% reduce 0%
16/07/17 20:57:27 INFO mapreduce.Job:  map 26% reduce 0%
16/07/17 20:57:32 INFO mapreduce.Job:  map 27% reduce 0%
16/07/17 20:57:36 INFO mapreduce.Job:  map 28% reduce 0%
16/07/17 20:57:41 INFO mapreduce.Job:  map 29% reduce 0%
16/07/17 20:57:45 INFO mapreduce.Job:  map 30% reduce 0%
16/07/17 20:57:50 INFO mapreduce.Job:  map 31% reduce 0%
16/07/17 20:57:54 INFO mapreduce.Job:  map 32% reduce 0%
16/07/17 20:57:59 INFO mapreduce.Job:  map 33% reduce 0%
16/07/17 20:58:03 INFO mapreduce.Job:  map 34% reduce 0%
16/07/17 20:58:08 INFO mapreduce.Job:  map 35% reduce 0%
16/07/17 20:58:12 INFO mapreduce.Job:  map 36% reduce 0%
16/07/17 20:58:15 INFO mapreduce.Job:  map 37% reduce 0%
16/07/17 20:58:20 INFO mapreduce.Job:  map 38% reduce 0%
16/07/17 20:58:24 INFO mapreduce.Job:  map 39% reduce 0%
16/07/17 20:58:29 INFO mapreduce.Job:  map 40% reduce 0%
16/07/17 20:58:33 INFO mapreduce.Job:  map 41% reduce 0%
16/07/17 20:58:38 INFO mapreduce.Job:  map 42% reduce 0%
16/07/17 20:58:42 INFO mapreduce.Job:  map 43% reduce 0%
16/07/17 20:58:47 INFO mapreduce.Job:  map 44% reduce 0%
16/07/17 20:58:51 INFO mapreduce.Job:  map 45% reduce 0%
16/07/17 20:58:56 INFO mapreduce.Job:  map 46% reduce 0%
16/07/17 20:59:00 INFO mapreduce.Job:  map 58% reduce 0%
16/07/17 20:59:06 INFO mapreduce.Job:  map 59% reduce 0%
16/07/17 20:59:11 INFO mapreduce.Job:  map 59% reduce 11%
16/07/17 20:59:15 INFO mapreduce.Job:  map 60% reduce 11%
16/07/17 20:59:21 INFO mapreduce.Job:  map 61% reduce 11%
16/07/17 20:59:30 INFO mapreduce.Job:  map 62% reduce 11%
16/07/17 20:59:39 INFO mapreduce.Job:  map 63% reduce 11%
16/07/17 20:59:48 INFO mapreduce.Job:  map 64% reduce 11%
16/07/17 20:59:58 INFO mapreduce.Job:  map 65% reduce 11%
16/07/17 21:00:04 INFO mapreduce.Job:  map 66% reduce 11%
16/07/17 21:00:13 INFO mapreduce.Job:  map 67% reduce 11%
16/07/17 21:00:23 INFO mapreduce.Job:  map 68% reduce 11%
16/07/17 21:00:26 INFO mapreduce.Job:  map 79% reduce 11%
16/07/17 21:00:27 INFO mapreduce.Job:  map 79% reduce 22%
16/07/17 21:00:35 INFO mapreduce.Job:  map 80% reduce 22%
16/07/17 21:00:59 INFO mapreduce.Job:  map 81% reduce 22%
16/07/17 21:01:20 INFO mapreduce.Job:  map 82% reduce 22%
16/07/17 21:01:44 INFO mapreduce.Job:  map 83% reduce 22%
16/07/17 21:02:08 INFO mapreduce.Job:  map 84% reduce 22%
16/07/17 21:02:32 INFO mapreduce.Job:  map 85% reduce 22%
16/07/17 21:02:56 INFO mapreduce.Job:  map 86% reduce 22%
16/07/17 21:03:17 INFO mapreduce.Job:  map 87% reduce 22%
16/07/17 21:03:41 INFO mapreduce.Job:  map 88% reduce 22%
16/07/17 21:04:06 INFO mapreduce.Job:  map 89% reduce 22%
16/07/17 21:04:15 INFO mapreduce.Job:  map 100% reduce 22%
16/07/17 21:04:16 INFO mapreduce.Job:  map 100% reduce 90%
16/07/17 21:04:17 INFO mapreduce.Job:  map 100% reduce 100%
16/07/17 21:04:17 INFO mapreduce.Job: Job job_1468752229715_0021 completed successfully
16/07/17 21:04:17 INFO mapreduce.Job: Counters: 50
        File System Counters
                FILE: Number of bytes read=2892255
                FILE: Number of bytes written=6158011
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=10898788
                HDFS: Number of bytes written=861195
                HDFS: Number of read operations=15
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Killed map tasks=2
                Launched map tasks=5
                Launched reduce tasks=1
                Data-local map tasks=5
                Total time spent by all maps in occupied slots (ms)=1417294
                Total time spent by all reduces in occupied slots (ms)=313657
                Total time spent by all map tasks (ms)=1417294
                Total time spent by all reduce tasks (ms)=313657
                Total vcore-seconds taken by all map tasks=1417294
                Total vcore-seconds taken by all reduce tasks=313657
                Total megabyte-seconds taken by all map tasks=1451309056
                Total megabyte-seconds taken by all reduce tasks=321184768
        Map-Reduce Framework
                Map input records=51444
                Map output records=154332
                Map output bytes=2583585
                Map output materialized bytes=2892267
                Input split bytes=300
                Combine input records=0
                Combine output records=0
                Reduce input groups=51444
                Reduce shuffle bytes=2892267
                Reduce input records=154332
                Reduce output records=51444
                Spilled Records=308664
                Shuffled Maps =3
                Failed Shuffles=0
                Merged Map outputs=3
                GC time elapsed (ms)=9078
                CPU time spent (ms)=1054730
                Physical memory (bytes) snapshot=1011130368
                Virtual memory (bytes) snapshot=3553914880
                Total committed heap usage (bytes)=575209472
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=696040
        File Output Format Counters
                Bytes Written=861195
16/07/17 21:04:19 INFO client.RMProxy: Connecting to ResourceManager at hadoop22/10.187.84.51:8032
16/07/17 21:04:19 WARN mapreduce.JobSubmitter: Hadoop command-line option parsing not performed. Implement the Tool interface and execute your application with ToolRunner to remedy this.
16/07/17 21:04:20 INFO input.FileInputFormat: Total input paths to process : 1
16/07/17 21:04:20 INFO mapreduce.JobSubmitter: number of splits:1
16/07/17 21:04:20 INFO mapreduce.JobSubmitter: Submitting tokens for job: job_1468752229715_0022
16/07/17 21:04:20 INFO impl.YarnClientImpl: Submitted application application_1468752229715_0022
16/07/17 21:04:20 INFO mapreduce.Job: The url to track the job: http://hadoop22:8088/proxy/application_1468752229715_0022/
16/07/17 21:04:20 INFO mapreduce.Job: Running job: job_1468752229715_0022
16/07/17 21:04:27 INFO mapreduce.Job: Job job_1468752229715_0022 running in uber mode : false
16/07/17 21:04:27 INFO mapreduce.Job:  map 0% reduce 0%
16/07/17 21:04:33 INFO mapreduce.Job:  map 100% reduce 0%
16/07/17 21:04:38 INFO mapreduce.Job:  map 100% reduce 100%
16/07/17 21:04:38 INFO mapreduce.Job: Job job_1468752229715_0022 completed successfully
16/07/17 21:04:38 INFO mapreduce.Job: Counters: 49
        File System Counters
                FILE: Number of bytes read=24
                FILE: Number of bytes written=186173
                FILE: Number of read operations=0
                FILE: Number of large read operations=0
                FILE: Number of write operations=0
                HDFS: Number of bytes read=861298
                HDFS: Number of bytes written=12
                HDFS: Number of read operations=6
                HDFS: Number of large read operations=0
                HDFS: Number of write operations=2
        Job Counters
                Launched map tasks=1
                Launched reduce tasks=1
                Data-local map tasks=1
                Total time spent by all maps in occupied slots (ms)=3580
                Total time spent by all reduces in occupied slots (ms)=3393
                Total time spent by all map tasks (ms)=3580
                Total time spent by all reduce tasks (ms)=3393
                Total vcore-seconds taken by all map tasks=3580
                Total vcore-seconds taken by all reduce tasks=3393
                Total megabyte-seconds taken by all map tasks=3665920
                Total megabyte-seconds taken by all reduce tasks=3474432
        Map-Reduce Framework
                Map input records=51444
                Map output records=1
                Map output bytes=16
                Map output materialized bytes=24
                Input split bytes=103
                Combine input records=0
                Combine output records=0
                Reduce input groups=1
                Reduce shuffle bytes=24
                Reduce input records=1
                Reduce output records=1
                Spilled Records=2
                Shuffled Maps =1
                Failed Shuffles=0
                Merged Map outputs=1
                GC time elapsed (ms)=89
                CPU time spent (ms)=2360
                Physical memory (bytes) snapshot=435548160
                Virtual memory (bytes) snapshot=1775456256
                Total committed heap usage (bytes)=310444032
        Shuffle Errors
                BAD_ID=0
                CONNECTION=0
                IO_ERROR=0
                WRONG_LENGTH=0
                WRONG_MAP=0
                WRONG_REDUCE=0
        File Input Format Counters
                Bytes Read=861195
        File Output Format Counters
                Bytes Written=12

统计:

精确度:51444   51367
CPU time spent (ms)=1054730  (此时看来数据量很小的时候,不太适合分而治之,间接说明了hadoop适合大数据)
map tasks=3

总结:MapReduce在处理大数据的时候,会逐渐发挥集群的优势,通过mapper任务的并行处理,提高大数据的处理速度!

时间: 2024-10-16 00:17:58

如何利用MapReduce的分治策略提高KNN算法的运行速度的相关文章

南邮算法分析与设计实验1 分治策略

分治策略 实验目的: 理解分治法的算法思想,阅读实现书上已有的部分程序代码并完善程序,加深对分治法的算法原理及实现过程的理解. 实验内容: 用分治法实现一组无序序列的两路合并排序和快速排序.要求清楚合并排序及快速排序的基本原理,编程实现分别用这两种方法将输入的一组无序序列排序为有序序列后输出. 代码: #include <iostream> #include <cstdlib> #include <ctime> using namespace std; void Swa

【经典算法】分治策略

一.什么是分治 有很多算法是递归的:为了解决一个给定的问题,算法要一次或多次递归调用其自身来解决的子问题.这些算法通常采用分治策略:将原问题划分为n个规模较小而结构与原问题相似的子问题:递归地解决这些子问题,然后再合并其结果,就得到原问题的解. 二.分治算法的三个步骤 分治模式在每一层递归上都有三个步骤: 分解(Divide)步骤将问题划分为一些子问题,子问题的形式与原问题一样,只是规模更小. 解决(Conquer)步骤递归地求解出子问题.如果子问题规模足够小,则停止递归,直接求解. 合并(Co

算法导论第四章分治策略编程实践(二)

在上一篇中,通过一个求连续子数组的最大和的例子讲解,想必我们已经大概了然了分治策略和递归式的含义,可能会比较模糊,知道但不能用语言清晰地描述出来.但没关系,我相信通过这篇博文,我们会比较清楚且容易地用自己的话来描述. 通过前面两章的学习,我们已经接触了两个例子:归并排序和子数组最大和.这两个例子都用到了分治策略,通过分析,我们可以得出分治策略的思想:顾名思义,分治是将一个原始问题分解成多个子问题,而子问题的形式和原问题一样,只是规模更小而已,通过子问题的求解,原问题也就自然出来了.总结一下,大致

计算机算法设计与分析之递归与分治策略——二分搜索技术

递归与分治策略 二分搜索技术 我们所熟知的二分搜索算法是运用分治策略的典型例子,针对这个算法,先给出一个简单的案例. 目的:给定已排好序的n个元素a[0:n-1],现要在这n个元素中找出一特定的元素x. 我们首先想到的最简单的是用顺序搜索方法,逐个比较a[0:n-1]中元素,直至找出元素x或搜索遍整个数组后确定x不在其中.这个方法没有很好地利用n个元素已排好序的这个条件,因此在最坏的情况下,顺序搜索方法需要O(n)次比较. 而二分搜索方法充分利用了元素间的次序关系,采用分治策略,可在最坏情况下用

C++分治策略实现二分搜索

由于可能需要对分治策略实现二分搜索的算法效率进行评估,故使用大量的随机数对算法进行实验(生成随机数的方法见前篇随笔). 由于二分搜索需要数据为有序的,故在进行搜索前利用函数库中sort函数对输入的数据进行排序. 代码主要用到的是经典的二分查找加上递归. 其中limit为所要从随机数文件中提取的数据的数量,以此为限来决定算法需要在多少个数据中进行搜索. 为了使代码更人性化,加入了查找成功与失败的提醒,主要区别于Search函数中的返回值,若查找成功则返回1(满足1>0,即为查找成功),其余则返回0

分治策略 - 典型实例 - 选择问题

选择问题最常见的问题有: 1.1选最大 1.2同时选最大和最小的算法 1.3找第二大 2选第k小(分治策略) 1.1选最大 选择算法 统一描述:设L是n个算法的集合,从L中选出第k小的元素,1<=k<=n,当L中元素按从小到大排好序后,排在第k个位置的数,就是第k小的数. 下面介绍 顺序比较法 算法Findmax 输入:n个数的数组L 输出:max,k max <- L[1]; k <- 1 for i <- 2 to n do //for循环执行n-1次 if max &l

分治策略 - 最大子序列问题

自开始学习算法起,我感觉就是跪着把<算法导论>的代码看一遍.理解一遍然后敲一遍...说实话自己来写并且要求时间复杂度达到要求,我肯定是不能做到的,但我想前辈们辛苦积累的研究成果贡献出来也是为了让后人少走一些弯路,所以我的作用就是把前辈们的成果学习之后加以理解,然后积累经验,领悟到他们解决问题时的思路和灵感.还有就是把个人理解后的知识存储在不会忘记的地方作为复习备用... 当然什么是写博客呢,我个人认为是把所学的知识加上自己的理解然后用较为通俗的语言来解释一遍,至少这样才有可能把学到的东西变为自

【从零学习经典算法系列】分治策略实例——二分查找

1.二分查找算法简介 二分查找算法是一种在有序数组中查找某一特定元素的搜索算法.搜素过程从数组的中间元素开始,如果中间元素正好是要查找的元素,则搜索过程结束:如果某一特定元素大于或者小于中间元素,则在数组大于或小于中间元素的那一半中查找,而且跟开始一样从中间元素开始比较.如果在某一步骤数组 为空,则代表找不到.这种搜索算法每一次比较都使搜索范围缩小一半.折半搜索每次把搜索区域减少一半,时间复杂度为Ο(logn). 二分查找的优点是比较次数少,查找速度快,平均性能好:其缺点是要求待查表为有序表,且

第四章 分治策略 4.1 最大子数组问题 (暴力求解算法)

/** * 最大子数组的暴力求解算法,复杂度为o(n2) * @param n * @return */ static MaxSubarray findMaxSubarraySlower(int[] n) { long tempSum = 0; int left = 0; int right = 0; long sum = Long.MIN_VALUE; for (int i = 0; i < n.length; i++) { for (int j = i; j < n.length; j++