【MapReduce实例】数据去重

超、凢脫俗 2022-06-04 05:46 231阅读 0赞

#### 一、实例描述 ####

数据去重是利用并行化思想来对数据进行有意义的筛选。统计大数据集上的数据种类个数、从网站日志中计算访问等这些看似庞大的任务都会涉及数据去重。

比如，输入文件  
file1.txt，其内容如下：  
2017-12-9 a  
2017-12-10 b  
2017-12-11 c  
2017-12-12 d  
2017-12-13 a  
2017-12-14 b  
2017-12-15 c  
2017-12-11 c

file2.txt，其内容如下：  
2017-12-9 b  
2017-12-10 a  
2017-12-11 b  
2017-12-12 d  
2017-12-13 a  
2017-12-14 c  
2017-12-15 d  
2017-12-11 c

对应上面给出的输入样例，其输出样例为：  
2017-12-9 a  
2017-12-9 b  
2017-12-10 a  
2017-12-10 b  
2017-12-11 b  
2017-12-11 c  
2017-12-12 d  
2017-12-13 a  
2017-12-14 b  
2017-12-14 c  
2017-12-15 c  
2017-12-15 d

#### 二、设计思路 ####

由于要去除重复的数据，我们可以考虑直接将一行数据作为Map和Reduce函数处理后的key值。  
![这里写图片描述][SouthEast]

**1. job的处理过程如图所示**   
**（1）Map函数设计**   
Map函数的实现目的：  
<1, 2017-12-9 a> ——> <2017-12-9 a, “ ”>

输入的每一行的数据都当作key，value赋空格即可，因此Map函数的设计如下：

public static class DedupCleanMapper extends Mapper<LongWritable, Text, Text, Text> {
        
    
            private static Text line = new Text();
    
            @Override
            protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
                    throws IOException, InterruptedException {
                line = value;
                context.write(line, new Text(""));
            }
        }

**（2）Reduce函数设计**   
Reduce函数的实现目的：

由于重复的数据需要剔除，于是对于同样的key不需进行汇聚操作，直接保存key值即可，因此Reduce函数的设计如下：

public static class DedupCleanReducer extends Reducer<Text, Text, Text, Text> {
        
            @Override
            protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                    throws IOException, InterruptedException {
                context.write(key, new Text(""));
            }
        }

#### 三、完整代码 ####

package com.walker.mrdemo;
    
    import java.io.IOException;
    import java.net.URI;
    
    import org.apache.hadoop.conf.Configuration;
    import org.apache.hadoop.fs.FileSystem;
    import org.apache.hadoop.fs.Path;
    import org.apache.hadoop.io.LongWritable;
    import org.apache.hadoop.io.Text;
    import org.apache.hadoop.mapreduce.Job;
    import org.apache.hadoop.mapreduce.Mapper;
    import org.apache.hadoop.mapreduce.Reducer;
    import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
    import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
    
    public class DedupClean {
        
    
        /*
         * Map函数
         */
        public static class DedupCleanMapper extends Mapper<LongWritable, Text, Text, Text> {
        
    
            private static Text line = new Text();
    
            @Override
            protected void map(LongWritable key, Text value, Mapper<LongWritable, Text, Text, Text>.Context context)
                    throws IOException, InterruptedException {
                line = value;
                context.write(line, new Text(""));
            }
        }
    
        /*
         * Reduce函数
         */
        public static class DedupCleanReducer extends Reducer<Text, Text, Text, Text> {
        
            @Override
            protected void reduce(Text key, Iterable<Text> values, Reducer<Text, Text, Text, Text>.Context context)
                    throws IOException, InterruptedException {
                context.write(key, new Text(""));
            }
        }
    
        // 输入输出路径设置
        private static final String FILE_IN_PATH = "hdfs://192.168.50.130:9000/mrdemo/DedupClean/input";
        private static final String FILE_OUT_PATH = "hdfs://192.168.50.130:9000/mrdemo/DedupClean/output";
    
        public static void main(String[] args) throws Exception {
    
            Configuration conf = new Configuration();
    
            // 删除已存在的输出目录
            FileSystem fileSystem = FileSystem.get(new URI(FILE_OUT_PATH), conf);
            if (fileSystem.exists(new Path(FILE_OUT_PATH))) {
                fileSystem.delete(new Path(FILE_OUT_PATH), true);
            }
    
            Job job = Job.getInstance(conf, "DedupClean");
    
            job.setJarByClass(DedupClean.class);
            job.setMapperClass(DedupCleanMapper.class);
            job.setReducerClass(DedupCleanReducer.class);
    
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(Text.class);
    
            FileInputFormat.addInputPath(job, new Path(FILE_IN_PATH));
            FileOutputFormat.setOutputPath(job, new Path(FILE_OUT_PATH));
    
            job.waitForCompletion(true);
        }
    }

[SouthEast]: /images/20220604/f2625ebb719f4318ac599db725f7b5f5.png