Hadoop-day01_(java code simulates hadoop storage data)

hadoop file segmentation idea

Requirement: Count the number of people in each class in the text file (total to countless people)

1500100129,Rong Jinan,23,Female,Liberal Arts Class Three
1500100130,Ning Huailian,21,Female,Science fourth class
1500100131,Hu Haoming,22,male,Sixth class of liberal arts
1500100132,Zeng Anhan,22,Female,Fifth class of liberal arts
1500100133,Qian Xiangshan,24,Female,Second class of science
1500100134,Ji Xuanlang,22,male,Science fourth class
1500100135,Yu Zhenhai,21,male,Science fourth class
1500100136,Li Kunpeng,22,male,Sixth class of liberal arts
1500100137,Xuan Xiangshan,22,Female,Science fourth class
1500100138,Luan Hongxin,22,male,Second class of liberal arts
1500100139,Zuo Daixuan,24,Female,Liberal Arts Class Three
1500100140,Yu Yunfa,24,male,Sixth class of liberal arts
1500100141,Xie Changxun,23,male,Sixth class of science
...
...

Simple simulation of hadoop with java for statistics

Traditional method: using sets to count the number of people in each class

package day01;
import java.io.*;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
/**
 * @author WangTao
 * @date 2022/5/20 16:40
 */
public class Student_Count_Demo {
    public static void main(String[] args) {
        long start = System.currentTimeMillis();
        //Create a map collection that accepts the total result data
        HashMap<String, Integer> map = new HashMap<>();
        BufferedReader br = null;
        BufferedWriter bw = null;
        try {
            //read the file to split,
            //Create a character buffered input stream pair
            br = new BufferedReader(new FileReader("src/day01/student"));
            String len = null;
            while((len = br.readLine()) != null){
                //Split by comma to get class and number of people
                String[] split = len.split("[,]");
                String clazz = split[4];
                //Determine if there is a corresponding class in the map collection, if not, add it
                if(!map.containsKey(clazz)){
                    map.put(clazz,1);
                }else{
                    //If it exists, add 1 to the original value
                    map.put(clazz,map.get(clazz)+1);
                }
            }
            //Write the result set to the final file
            bw = new BufferedWriter(new FileWriter("src/day01/student_demo"));
            //iterate over the collection
            Set<Map.Entry<String, Integer>> entries = map.entrySet();
            for (Map.Entry<String,  Integer> entry : entries) {
                Integer value = entry.getValue();
                String key = entry.getKey();
                bw.write(key+":"+value);
                bw.newLine();
                bw.flush();
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //release resources
            if (bw != null){
                try {
                    bw.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if(br!=null){
                try {
                    br.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }


    }
}

question:

  • Inefficient reads, inefficient and expensive servers
  • There are data security issues
  • and many more

Hadoop method

Distributed thinking:

Divide the data into multiple block s, (here each row of data is equal to 1 mb of data)

Distributed storage (HDFS), assigning tasks to each module (MAP means giving each module a thread),

Each module is calculated separately, and then the results calculated by each module are aggregated (redus)

  1. 1. Step 1: Block: The size of each block is 128 megabytes, but when the size of each block is 128*1.1 (approximately equal to more than 140 megabytes), the size of each block is 128 megabytes. The last block,

    If the size of the second-to-last block does not exceed the size of 128*1.1, only one map resource will be allocated for calculation.

package day01;
import java.io.*;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
/**
 * @author WangTao
 * @date 2022/5/20 16:40
 */
public class Student_Count_Demo {
    public static void main(String[] args) {
        long start = System.currentTimeMillis();
        //Create a map collection that accepts the total result data
        HashMap<String, Integer> map = new HashMap<>();
        BufferedReader br = null;
        BufferedWriter bw = null;
        try {
            //read the file to split,
            //Create a character buffered input stream pair
            br = new BufferedReader(new FileReader("src/day01/student"));
            String len = null;
            while((len = br.readLine()) != null){
                //Split by comma to get class and number of people
                String[] split = len.split("[,]");
                String clazz = split[4];
                //Determine if there is a corresponding class in the map collection, if not, add it
                if(!map.containsKey(clazz)){
                    map.put(clazz,1);
                }else{
                    //If it exists, add 1 to the original value
                    map.put(clazz,map.get(clazz)+1);
                }
            }
            //Write the result set to the final file
            bw = new BufferedWriter(new FileWriter("src/day01/student_demo"));
            //iterate over the collection
            Set<Map.Entry<String, Integer>> entries = map.entrySet();
            for (Map.Entry<String,  Integer> entry : entries) {
                Integer value = entry.getValue();
                String key = entry.getKey();
                bw.write(key+":"+value);
                bw.newLine();
                bw.flush();
            }

        } catch (IOException e) {
            e.printStackTrace();
        } finally {
            //release resources
            if (bw != null){
                try {
                    bw.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if(br!=null){
                try {
                    br.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }


    }
}

2. It should be divided into 8 modules, and 8 threads are created here

package day02;
import java.io.File;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
/**
 * @author WangTao
 * @date 2022/5/20 20:58
 */
/*
    Map (Through the method of thread pool, in simple terms, a block in Hadoop is simulated to generate a map task, and a map is equivalent to a thread
    In the divided block s, count the number of people in each class)
 */
public class Map {
    public static void main(String[] args) {
        //Create a thread pool
        ExecutorService executorService = Executors.newFixedThreadPool(8);
        //Define file number, starting from 0
        int offset = 0;
        File file = new File("src/day02/blocks");
        File[] files = file.listFiles();
        for (File file1 : files) {
            MapTask mapTask = new MapTask(file1, offset);
            executorService.submit(mapTask);
            offset++;
        }
        System.out.println("Distributed calculation of class size is complete! ! !");
        //close thread pool
        executorService.shutdown();
        


    }
}

3. Run the following map task in the thread pool: Calculate the number of people in each module

package day02;
import java.io.*;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
/**
 * @author WangTao
 * @date 2022/5/20 21:07
 */
public class MapTask implements Runnable{
    private File file;
    public int offset;

    public MapTask(File file,int offset) {
        this.file = file;
        this.offset = offset;
    }

    @Override
    public void run() {
        BufferedReader br = null;
        BufferedWriter bw = null;
        try {
            //character buffered input stream
            br = new BufferedReader(new FileReader(file));
            //Create a HashMap collection to store student objects
            HashMap<String, Integer> map = new HashMap<>();
            String lin = null;
            while((lin = br.readLine()) != null) {
                //separate with commas
                String clazz = lin.split("[,]")[4];
                //If there is no class as the key in the map, then we use the class as the key and the value as 1
                if (!map.containsKey(clazz)) {
                    map.put(clazz, 1);
                } else {
                    map.put(clazz, map.get(clazz) + 1);
                }

            }
            //Write the local file, the data in the collection to the text file
            //Create a character buffered output stream
            bw = new BufferedWriter(new FileWriter("src/day02/block_counts/block---"+offset));
            //iterate over the collection
            Set<Map.Entry<String, Integer>> entries = map.entrySet();
            for (Map.Entry<String, Integer> entry : entries) {
                Integer value = entry.getValue();
                String key = entry.getKey();
                bw.write(key+":"+value);
                bw.newLine();
                bw.flush();
            }
        } catch (Exception e) {
            e.printStackTrace();
        }finally {
            if (bw != null) {
                try {
                    bw.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
            if(br!= null){
                try {
                    br.close();
                } catch (IOException e) {
                    e.printStackTrace();
                }
            }
        }

    }
}

4. Finally, aggregate the tasks run by 8 threads: count the final number of people

package day02;
import java.io.*;
import java.util.HashMap;
import java.util.Map;
import java.util.Set;
/**
 * @author WangTao
 * @date 2022/5/20 21:51
 */
/*
        The results of each map task are aggregated once, and the final number of people is counted
 */
public class Reduce {
    public static void main(String[] args)throws Exception {
        //Encapsulate the past directory into a File object
        File file = new File("src/day02/block_counts");
        //Get the object array of the file below
        File[] files = file.listFiles();
        //Create a map collection that accepts the total data
        HashMap<String, Integer> map = new HashMap<>();
        //iterate over each block_counts object
        for (File file1 : files) {
            //read file, split
            //Create a character buffer input stream object
            BufferedReader br = new BufferedReader(new FileReader(file1));
            String len = null;
            while ((len = br.readLine()) != null) {
                //The read data is split with : to get the key and value
                String[] split = len.split("[:]");
                String clazz = split[0];
                //Convert string type to int type for easy calculation
                Integer sum = Integer.valueOf(split[1]);
                //Determine if there is a corresponding key in the map
                if(!map.containsKey(clazz)){
                    map.put(clazz,sum);
                }else{
                    //If it exists, the value is added
                    map.put(clazz, map.get(clazz)+sum);
                }
            }
            //Close the read file channel
            br.close();
        }
        //Reading the file has been completed, now start writing to the final file
        BufferedWriter bw = new BufferedWriter(new FileWriter("src/day02/finally_count/finally_sum"));
        //Traverse the map collection and write the data to the file
        Set<Map.Entry<String, Integer>> entries = map.entrySet();
        for (Map.Entry<String, Integer> entry : entries) {
            String key = entry.getKey();
            Integer value = entry.getValue();
            bw.write(key+":"+value);
            bw.newLine();
            bw.flush();
        }
        //close resource
        bw.close();
    }
}

The benefits of distributed computing:

1) High reliability: Because Hadoop assumes that compute elements and storage will fail, and because it maintains multiple copies of working data, processing can be redistributed to failed nodes in the event of a failure.

2) High scalability: Distribute task data among clusters, which can easily expand thousands of nodes.

3) Efficiency: Under the idea of ​​MapReduce, Hadoop works in parallel to speed up task processing.

4) High fault tolerance: Automatically save multiple copies of data, and can automatically reassign failed tasks.

5) Low cost (Economical): Hadoop distributes and processes data by composing server clusters with ordinary cheap machines, so that the cost is very low.

Tags: Hadoop

Posted by Clarkey Boy on Fri, 20 May 2022 20:18:06 +0300