Hadoop case wordCount execution process

mapreudce operation

Count occurrences of words in a file

Put the mapreduce program into a jar package and put it on the hadoop machine
Execute hadoop jar xsf.jar mapreduce.Dirverx /123 /usr/local/server/hadoop-2.10.0/out

mapreduce.Dirverx is the fully qualified class name of the driver class
/123: default read from hdfs for input file path
/usr/local/server/hadoop-2.10.0/out: output path for results

View the output

mapreduce process

The mapreduce program includes map method, reduce method, driver method

(1) The Map method processes each line of data in the file slice, and the input parameter is a key,value key-value pair consisting of (the offset at the beginning of the line, the content of each line). Process each line of content, segment out words, and output (word, number of occurrences 1) key-value pairs through context

public class Mapperx extends Mapper<LongWritable, Text,Text, IntWritable> {
    @Override
    protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
        // AAA BBB CCCC
        String line = value.toString();
        // Sliced ​​data
        String[] words = line.split(" ");

        for (String word : words){
        // AAA 1   BBB 1 CCCC 1
            context.write(new Text(word),new IntWritable(1));
        }
    }
}

(2) The key-value pairs output in the map stage are processed, and the same key is a group. Reduce method input parameter (key, value set of the same key). Output after summation.

public class Reducex extends Reducer<Text, IntWritable,Text,IntWritable> {
    @Override
    protected void reduce(Text key, Iterable<IntWritable> values, Context context) throws IOException, InterruptedException {
        int sum = 0;
        for (IntWritable value : values){
            sum += value.get();
        }
        // AAA 2
        context.write(key,new IntWritable(sum));
    }
}

(3) The driver defines and configures the mapreduce task and provides the execution entry

public class Dirverx  {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Job job = Job.getInstance(new Configuration());

        job.setJarByClass(Dirverx.class);

        job.setMapperClass(Mapperx.class);
        job.setReducerClass(Reducex.class);

        //map output type
        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(IntWritable.class);

		// output result
        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(IntWritable.class);

        // file input and output path
        FileInputFormat.setInputPaths(job, new Path(args[0]));
        FileOutputFormat.setOutputPath(job,new Path(args[1]));
		// submit
        boolean b = job.waitForCompletion(true);
        System.exit(b ?0:1);

    }
}

(1) Large files are divided into multiple slices, and each slice is processed by a MapTask process. By default, one row of the slice is read at a time, and the map method is called.
(2) Before the result key-value pair output by the Map method is written into the ring buffer, call the part method to obtain the partition number corresponding to the key. The hash value of the default key % sets the number of reduceTask s.
(3) The buffer is almost full. After the stored data is quickly sorted, it is written to the local disk.

(4) The program slices and processes the results of multiple files from multiple machines, and aggregates them into one file after downloading.

(5) Compare two adjacent keys in the file, and finally call the reduce method to process the value set with the same key, and finally output it to hdfs.

Tags: Big Data Hadoop

Posted by sribala on Sat, 21 May 2022 09:25:07 +0300