python MapReduce单词统计

骑猪看日落 2022-06-06 03:24 239阅读 0赞

用python写mapreduce还需要了解HadoopStreaming  
HadoopStreaming是可运行特殊脚本的mapperredece作业的工具  
使用格式如下：  
`$HADOOP_HOME/bin/hadoop jar $HADOOP_HOME/hadoop-streaming.jar \ -input myInputDirs \ -output myOutputDir \ -mapper /bin/cat \ -reducer /bin/wc`  
官方文档:  
[http://hadoop.apache.org/docs/r1.0.4/cn/streaming.html][http_hadoop.apache.org_docs_r1.0.4_cn_streaming.html]

### 操作步骤： ###

### 1. ###

将单词文件上传到hdfs上

hello java
    hello python
    word word

### 2.编写mapreduce函数 ###

map

#!/usr/bin/env python
    #coding:utf-8
    import sys
    import sys
    for line in sys.stdin: #从输入切分成行
        line = line.strip() #首尾空格
        words = line.split()
        for word in words:
            print '%s\t%s' % (word,1) #将每一行内容作为key,value
        ''' 默认情况下，一行中第一个tab之前的部分作为key，之后的（不包括tab）作为value。 如果没有tab，整行作为key值，value值为null '''

reduce函数

#coding:utf-8
    import sys
    current_word = None #记录前一个单词， 用于比较
    count = 0
    word = None
    current_count = 0  #每个单词最终的数量
    
    for line in sys.stdin: #切分成行
        line = line.strip()
        word, count = line.split('\t', 1) #key为第一个\t前的值, 只截断一次
        try:
            count = int(count)
        except ValueError:  # count如果不是数字的话，直接忽略掉
            continue
        if current_word == word: #上一个是否和当前的相同
            current_count += count
        else:
            if current_word:#不相同且不是第一个就输出
                print "%s\t%s" % (current_word, current_count)
            current_count = count
            current_word = word
    
    if word == current_word:  # 不要忘记最后的输出
        print "%s\t%s" % (current_word, current_count)

### 3.通过HadoopStreaming运行程序 ###

hadoop jar /usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar -file mapper.py -mapper “python mapper.py” -file reduce.py -reducer “python reduce.py” -input /input/data -output /output  
命令分析如下：

表明streaming所在的位置/usr/local/hadoop/share/hadoop/tools/lib/hadoop-streaming-2.6.5.jar
    
    -file mapper.py 表示我的文件所在位置(我的文件就在当前路径，所以可以直接写)
    
    -mapper "python mapper.py" 表示mapper程序
    
    -file reduce.py 文件位置
    
    -reducer "python reduce.py" reduce程序
    
    -input /input/data  -output /output  在hdfs上的输入输出

### 结果 ###

在map执行结束后会输入以下内容到reduce

hello   1
    hello   1
    java    1
    python  1
    word    1
    word    1

reduce将内容进行统计，将每个单词和总数输出,得到最终的part-00000文件

hello   2
    java    1
    python  1
    word    2

[http_hadoop.apache.org_docs_r1.0.4_cn_streaming.html]: http://hadoop.apache.org/docs/r1.0.4/cn/streaming.html