MapperReduce之数据去重

ゝ一世哀愁。 2022-05-29 06:50 327阅读 0赞

#### 有两个文件file1,file2 里面的数据如下 ####

#### file1： ####

2016-6-1 b  
2016-6-2 a  
2016-6-3 b  
2016-6-4 d  
2016-6-5 a  
2016-6-6 c  
2016-6-7 d  
2016-6-3 c

#### file2：    ####

2016-6-1 a  
2016-6-2 b  
2016-6-3 c  
2016-6-4 d  
2016-6-5 a  
2016-6-6 b  
2016-6-7 c  
2016-6-3 c

数据去重  
   在MapReduce流程中，map的输出<key，value>经过shuffle过程聚集成<key，value-list>后会交给reduce。 map将输入的value复制到输出数据的key上，并直接输出；  
   经过shuffle，相同key形成<key，value-list>，作为reduce的输入；reduce将输入中的key复制到输出数据的key上，并直接输出。利用MapReduce对key的汇聚机制将重复的数据去掉。

步骤:

1.先将俩个文件数据上传到HDFS中

可以先在HDFS上创建一个文件夹file  把两个文件上传到file中

hadoop dfs -put  file1.txt file2.txt /file

2.把java代码打成jar  xxx.jar 拉进hdfs中

3.执行 hadoop jar /xxx.jar /file  /fileOut

JAVA代码

/*
     * 两个文件去掉重复的数据
     * */
    public class DeWeight {
    
    	public static class Maps extends Mapper<Object, Text, Text, Text> {
    		Text k1 = new Text();
    
    		@Override
    		protected void map(Object key, Text value, Context context)
    				throws IOException, InterruptedException {
    			k1 = value;	//将每行的数据赋值给k1
    			context.write(k1, new Text(""));
    		}
    	}
    
    	public static class Reduces extends Reducer<Text, Text, Text, Text> {
    
    		protected void reduce(Text k2, Iterable<Text> v2, Context context)
    				throws IOException, InterruptedException {
    			context.write(k2, new Text(""));
    		}
    	}
    
    	public static void main(String[] args) throws Exception {
    		// TODO Auto-generated method stub
    
    		Job job = Job.getInstance(new Configuration());
    
    		job.setJarByClass(DeWeight.class);
    		
    		job.setMapperClass(Maps.class);
    		job.setMapOutputKeyClass(Text.class);
    		job.setMapOutputValueClass(Text.class);
    		FileInputFormat.setInputPaths(job, new Path(args[0]));
    
    		job.setReducerClass(Reduces.class);
    		job.setOutputKeyClass(Text.class);
    		job.setOutputValueClass(Text.class);
    		FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
    		job.waitForCompletion(true);
    	}
    }