0090-mapreduce自定义分组

雨点打透心脏的1/2处 2023-07-14 12:54 6阅读 0赞

### 文章目录 ###

*   *   *  1. 需求
         *  2. 实现步骤
         *   *  2.1 实体类
             *  2.2 Mapper程序
             *  2.3 自定义Partitioner
             *  2.4 Reducer程序
             *  2.5 执行job
             *  2.6 总结

### 1. 需求 ###

mapper将结果发送到Reducer会进行数据分组，默认是分在同一组，有时候会根据不通的业务需求进行分组。  
注：默认的分组逻辑，(key.hashCode() & Integer.MAX\_VALUE)表示用key的hash值对最大整数+1取余，然再对任务数（默认1）取余数（都是0），所以默认都是一个组

(key.hashCode() & Integer.MAX_VALUE) % numReduceTasks

### 2. 实现步骤 ###

两步：1. 自定义分组实现类； 2. job类中设置自定义分组实现类，设置任务数

#### 2.1 实体类 ####

@Data
    public class FlowBean implements WritableComparable<FlowBean> { 
        private String number;
        private long upFlow;
        private long downFlow;
        private long sumFlow;
    
        public FlowBean(String number, long upFlow, long downFlow) { 
            this.number = number;
            this.upFlow = upFlow;
            this.downFlow = downFlow;
            this.sumFlow = upFlow + downFlow;
        }
    
        public FlowBean() { 
        }
    
        @Override
        public void write(DataOutput out) throws IOException { 
            out.writeUTF(number);
            out.writeLong(upFlow);
            out.writeLong(downFlow);
            out.writeLong(sumFlow);
        }
    
        @Override
        public void readFields(DataInput in) throws IOException { 
            this.number = in.readUTF();
            this.upFlow = in.readLong();
            this.downFlow = in.readLong();
            this.sumFlow = in.readLong();
        }
    
        @Override
        public String toString() { 
            return number + " " + upFlow + " " + downFlow + " " + sumFlow;
        }
    
        @Override
        public int compareTo(FlowBean o) { 
            if (this.sumFlow > o.getSumFlow()) { 
                return -1;
            } else if (this.sumFlow < o.getSumFlow()) { 
                return 1;
            } else { 
                return 0;
            }
        }
    }

注：因为这个例子中mapper程序输入的是FlowBean实体类，从Mapper到Reducer会自动将key进行一轮排序，所以自定义实体类的时候一定要实现WritableComparable接口

#### 2.2 Mapper程序 ####

public class GroupFlowMapper extends Mapper<LongWritable, Text, FlowBean, NullWritable> {
        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String line = value.toString();
            String[] args = line.split("\\s");
            String number = args[0];
            Long upFlow = Long.parseLong(args[1]);
            Long downFlow = Long.parseLong(args[2]);
            FlowBean flowBean = new FlowBean(number, upFlow, downFlow);
            context.write(flowBean, NullWritable.get());
        }
    }

#### 2.3 自定义Partitioner ####

public class NumPartitioner extends Partitioner<FlowBean, NullWritable> { 
        private static Map<String, Integer> map;
    
        static { 
            map = new HashMap<>();
            // 分组需要从0开始
            map.put("135", 0);
            map.put("136", 1);
            map.put("137", 2);
            map.put("138", 3);
        }
    
        @Override
        public int getPartition(FlowBean flowBean, NullWritable nullWritable, int numPartitions) { 
            Integer partitionNum = map.get(flowBean.getNumber().substring(0, 3));
            return partitionNum == null ? 4 : partitionNum;
        }
    }

注：分组组号需要从0开始

#### 2.4 Reducer程序 ####

public class GroupFlowReducer extends Reducer<FlowBean, NullWritable, FlowBean, NullWritable> { 
        @Override
        protected void reduce(FlowBean key, Iterable<NullWritable> values, Context context) throws IOException, InterruptedException { 
            for (NullWritable value : values) { 
                context.write(key, value);
            }
        }
    }

#### 2.5 执行job ####

public class GroupFlowRunner extends Configured implements Tool { 
        @Override
        public int run(String[] args) throws Exception { 
            Configured conf = new Configured();
            Job job = Job.getInstance();
            job.setJarByClass(GroupFlowRunner.class);
    
            job.setMapperClass(GroupFlowMapper.class);
            job.setReducerClass(GroupFlowReducer.class);
    
            job.setMapOutputKeyClass(FlowBean.class);
            job.setMapOutputValueClass(NullWritable.class);
    
            job.setOutputKeyClass(FlowBean.class);
            job.setOutputValueClass(NullWritable.class);
    
            FileInputFormat.setInputPaths(job, new Path(args[0]));
            FileOutputFormat.setOutputPath(job, new Path(args[1]));
    
            // 设置自定义Partitioner
            job.setPartitionerClass(NumPartitioner.class);
            // 任务数1或者大于等于分组数都可以
            job.setNumReduceTasks(5);
    
            return job.waitForCompletion(true) ? 1 : 0;
        }
    
        public static void main(String[] args) throws Exception { 
            ToolRunner.run(new Configuration(), new GroupFlowRunner(), args);
        }
    }

#### 2.6 总结 ####

*  自定义Partitioner，组号一定要从0开始
 *  job中设置自定义Partitioner
 *  job中设置分组任务数，任务数等于或者大于分组数都可以，为1也可以。