hadoop file segmentation idea
Requirement: Count the number of people in each class in the text file (total to countless people)
1500100129,Rong Jinan,23,Female,Liberal Arts Class Three 1500100130,Ning Huailian,21,Female,Science fourth class 1500100131,Hu Haoming,22,male,Sixth class of liberal arts 1500100132,Zeng Anhan,22,Female,Fifth class of liberal arts 1500100133,Qian Xiangshan,24,Female,Second class of science 1500100134,Ji Xuanlang,22,male,Science fourth class 1500100135,Yu Zhenhai,21,male,Science fourth class 1500100136,Li Kunpeng,22,male,Sixth class of liberal arts 1500100137,Xuan Xiangshan,22,Female,Science fourth class 1500100138,Luan Hongxin,22,male,Second class of liberal arts 1500100139,Zuo Daixuan,24,Female,Liberal Arts Class Three 1500100140,Yu Yunfa,24,male,Sixth class of liberal arts 1500100141,Xie Changxun,23,male,Sixth class of science ... ...
Simple simulation of hadoop with java for statistics
Traditional method: using sets to count the number of people in each class
package day01; import java.io.*; import java.util.HashMap; import java.util.Map; import java.util.Set; /** * @author WangTao * @date 2022/5/20 16:40 */ public class Student_Count_Demo { public static void main(String[] args) { long start = System.currentTimeMillis(); //Create a map collection that accepts the total result data HashMap<String, Integer> map = new HashMap<>(); BufferedReader br = null; BufferedWriter bw = null; try { //read the file to split, //Create a character buffered input stream pair br = new BufferedReader(new FileReader("src/day01/student")); String len = null; while((len = br.readLine()) != null){ //Split by comma to get class and number of people String[] split = len.split("[,]"); String clazz = split[4]; //Determine if there is a corresponding class in the map collection, if not, add it if(!map.containsKey(clazz)){ map.put(clazz,1); }else{ //If it exists, add 1 to the original value map.put(clazz,map.get(clazz)+1); } } //Write the result set to the final file bw = new BufferedWriter(new FileWriter("src/day01/student_demo")); //iterate over the collection Set<Map.Entry<String, Integer>> entries = map.entrySet(); for (Map.Entry<String, Integer> entry : entries) { Integer value = entry.getValue(); String key = entry.getKey(); bw.write(key+":"+value); bw.newLine(); bw.flush(); } } catch (IOException e) { e.printStackTrace(); } finally { //release resources if (bw != null){ try { bw.close(); } catch (IOException e) { e.printStackTrace(); } } if(br!=null){ try { br.close(); } catch (IOException e) { e.printStackTrace(); } } } } }
question:
- Inefficient reads, inefficient and expensive servers
- There are data security issues
- and many more
Hadoop method
Distributed thinking:
Divide the data into multiple block s, (here each row of data is equal to 1 mb of data)
Distributed storage (HDFS), assigning tasks to each module (MAP means giving each module a thread),
Each module is calculated separately, and then the results calculated by each module are aggregated (redus)
-
1. Step 1: Block: The size of each block is 128 megabytes, but when the size of each block is 128*1.1 (approximately equal to more than 140 megabytes), the size of each block is 128 megabytes. The last block,
If the size of the second-to-last block does not exceed the size of 128*1.1, only one map resource will be allocated for calculation.
package day01; import java.io.*; import java.util.HashMap; import java.util.Map; import java.util.Set; /** * @author WangTao * @date 2022/5/20 16:40 */ public class Student_Count_Demo { public static void main(String[] args) { long start = System.currentTimeMillis(); //Create a map collection that accepts the total result data HashMap<String, Integer> map = new HashMap<>(); BufferedReader br = null; BufferedWriter bw = null; try { //read the file to split, //Create a character buffered input stream pair br = new BufferedReader(new FileReader("src/day01/student")); String len = null; while((len = br.readLine()) != null){ //Split by comma to get class and number of people String[] split = len.split("[,]"); String clazz = split[4]; //Determine if there is a corresponding class in the map collection, if not, add it if(!map.containsKey(clazz)){ map.put(clazz,1); }else{ //If it exists, add 1 to the original value map.put(clazz,map.get(clazz)+1); } } //Write the result set to the final file bw = new BufferedWriter(new FileWriter("src/day01/student_demo")); //iterate over the collection Set<Map.Entry<String, Integer>> entries = map.entrySet(); for (Map.Entry<String, Integer> entry : entries) { Integer value = entry.getValue(); String key = entry.getKey(); bw.write(key+":"+value); bw.newLine(); bw.flush(); } } catch (IOException e) { e.printStackTrace(); } finally { //release resources if (bw != null){ try { bw.close(); } catch (IOException e) { e.printStackTrace(); } } if(br!=null){ try { br.close(); } catch (IOException e) { e.printStackTrace(); } } } } }
2. It should be divided into 8 modules, and 8 threads are created here
package day02; import java.io.File; import java.util.concurrent.ExecutorService; import java.util.concurrent.Executors; /** * @author WangTao * @date 2022/5/20 20:58 */ /* Map (Through the method of thread pool, in simple terms, a block in Hadoop is simulated to generate a map task, and a map is equivalent to a thread In the divided block s, count the number of people in each class) */ public class Map { public static void main(String[] args) { //Create a thread pool ExecutorService executorService = Executors.newFixedThreadPool(8); //Define file number, starting from 0 int offset = 0; File file = new File("src/day02/blocks"); File[] files = file.listFiles(); for (File file1 : files) { MapTask mapTask = new MapTask(file1, offset); executorService.submit(mapTask); offset++; } System.out.println("Distributed calculation of class size is complete! ! !"); //close thread pool executorService.shutdown(); } }
3. Run the following map task in the thread pool: Calculate the number of people in each module
package day02; import java.io.*; import java.util.HashMap; import java.util.Map; import java.util.Set; /** * @author WangTao * @date 2022/5/20 21:07 */ public class MapTask implements Runnable{ private File file; public int offset; public MapTask(File file,int offset) { this.file = file; this.offset = offset; } @Override public void run() { BufferedReader br = null; BufferedWriter bw = null; try { //character buffered input stream br = new BufferedReader(new FileReader(file)); //Create a HashMap collection to store student objects HashMap<String, Integer> map = new HashMap<>(); String lin = null; while((lin = br.readLine()) != null) { //separate with commas String clazz = lin.split("[,]")[4]; //If there is no class as the key in the map, then we use the class as the key and the value as 1 if (!map.containsKey(clazz)) { map.put(clazz, 1); } else { map.put(clazz, map.get(clazz) + 1); } } //Write the local file, the data in the collection to the text file //Create a character buffered output stream bw = new BufferedWriter(new FileWriter("src/day02/block_counts/block---"+offset)); //iterate over the collection Set<Map.Entry<String, Integer>> entries = map.entrySet(); for (Map.Entry<String, Integer> entry : entries) { Integer value = entry.getValue(); String key = entry.getKey(); bw.write(key+":"+value); bw.newLine(); bw.flush(); } } catch (Exception e) { e.printStackTrace(); }finally { if (bw != null) { try { bw.close(); } catch (IOException e) { e.printStackTrace(); } } if(br!= null){ try { br.close(); } catch (IOException e) { e.printStackTrace(); } } } } }
4. Finally, aggregate the tasks run by 8 threads: count the final number of people
package day02; import java.io.*; import java.util.HashMap; import java.util.Map; import java.util.Set; /** * @author WangTao * @date 2022/5/20 21:51 */ /* The results of each map task are aggregated once, and the final number of people is counted */ public class Reduce { public static void main(String[] args)throws Exception { //Encapsulate the past directory into a File object File file = new File("src/day02/block_counts"); //Get the object array of the file below File[] files = file.listFiles(); //Create a map collection that accepts the total data HashMap<String, Integer> map = new HashMap<>(); //iterate over each block_counts object for (File file1 : files) { //read file, split //Create a character buffer input stream object BufferedReader br = new BufferedReader(new FileReader(file1)); String len = null; while ((len = br.readLine()) != null) { //The read data is split with : to get the key and value String[] split = len.split("[:]"); String clazz = split[0]; //Convert string type to int type for easy calculation Integer sum = Integer.valueOf(split[1]); //Determine if there is a corresponding key in the map if(!map.containsKey(clazz)){ map.put(clazz,sum); }else{ //If it exists, the value is added map.put(clazz, map.get(clazz)+sum); } } //Close the read file channel br.close(); } //Reading the file has been completed, now start writing to the final file BufferedWriter bw = new BufferedWriter(new FileWriter("src/day02/finally_count/finally_sum")); //Traverse the map collection and write the data to the file Set<Map.Entry<String, Integer>> entries = map.entrySet(); for (Map.Entry<String, Integer> entry : entries) { String key = entry.getKey(); Integer value = entry.getValue(); bw.write(key+":"+value); bw.newLine(); bw.flush(); } //close resource bw.close(); } }
The benefits of distributed computing:
1) High reliability: Because Hadoop assumes that compute elements and storage will fail, and because it maintains multiple copies of working data, processing can be redistributed to failed nodes in the event of a failure.
2) High scalability: Distribute task data among clusters, which can easily expand thousands of nodes.
3) Efficiency: Under the idea of MapReduce, Hadoop works in parallel to speed up task processing.
4) High fault tolerance: Automatically save multiple copies of data, and can automatically reassign failed tasks.
5) Low cost (Economical): Hadoop distributes and processes data by composing server clusters with ordinary cheap machines, so that the cost is very low.