CentOS 7 installs hadoop, configures eclipse and hdfs file system interfaces - run case tests

CentOS 7 installs hadoop, configures eclipse and hdfs file system interfaces - run case tests

I have written three blogs before. This is the last and most important step. Today, let's talk about eclipse docking hadoop

Please read the first three blogs before reading this blog
hadoop,eclipse and jdk have been installed here

  1. Install Hadoop eclipse plugin

First, we need to download Hadoop eclipse plugin, which can be downloaded from the official website and Github https://github.com/winghc/hadoop2x-eclipse-plugin (alternate download address: http://pan.baidu.com/s/1i4ikIoP ). I also copy this to the download directory of the virtual machine after downloading locally.

Baidu online disk download should be faster
After downloading, we unzip the file

The download address here is / opt

cd  /opt
unzip  hadoop2x-eclipse-plugin-master.zip
#After that, the decompression will be completed

Then we need to add hadoop-eclipse-kepler-plugin-2.6.0 in release in hadoop-eclipse-kepler-plugin-2.6.0 folder Versions 1.0 and 2.2 are also copied to the Eclipse plugins installation directory

My eclipse is also installed in / opt

cp   /opt/hadoop-eclipse-kepler-plugin-2.6.0/release/hadoop-eclipse-kepler-plugin-2.6.0.jar  /opt/eclipse/plugins

At this point, we complete the installation

  1. Configure hadoop-eclipse-kepler-plugin-2.6.0

Now start eclipse under the path of / opt/eclipse/eclipse, so please look at the path carefully

Note that the environment can be configured only when it is started under this path
And hadoop must be started in advance

cd /hadoop/hadoop-2.6.0
sbin/start-all.sh
#Start hadoop

cd  /opt/eclipse/eclipse
eclipse -clean

#Then start

Then we began to configure the environment
Click the picture below to select window - > preferences

A window will pop up, and the Hadoop Map/Reduce option will appear on the left side of the window. Click this option to select the Hadoop installation directory


Note: fill in your own Hadoop installation directory. The above directory cannot be filled in at will. Mine is hadoop/hadoop-2.6.0, so I filled in this

Step 2: switch the Map/Reduce development view, select Open Perspective - > other under the Window menu (CentOS is Window - > perspective - > Open Perspective - > other), and a Window pops up, from which you can select the Map/Reduce option to switch.



After that, you should see the changes of the interface and the following signs appear

Step 3: establish a connection with Hadoop cluster, click the Map/Reduce Locations panel in the lower right corner of Eclipse software, right-click in the panel and select New Hadoop Location (the blue elephant)

In the pop-up General option panel, the settings of General should be consistent with the configuration of Hadoop. Here, the DFS Master should be consistent with the configuration file, such as my setting FS Defaultfs is hdfs://lsn-linux:9000 , the Port of DFS Master should be changed to 9000. The Port of the Map/Reduce(V2) Master can be the default, and the Location Name can be filled in at will.


Of course, you can also directly use the local ip address, which is better. LSN Linux actually maps the local ip address
After configuration, click MapReduce Location in Project Explorer on the left to directly view the file list in HDFS (there should be files in HDFS, as shown in the figure below is the output result of WordCount). Double click to view the content. Right click to upload, download and delete the files in HDFS without cumbersome commands such as hdfs dfs -ls.

Now let's create a new project


Fill in the Project name as WordCount, and click Finish to create the project.

At this point, you can see the project just created in the Project Explorer on the left.

Then we build a new package and a project

As shown in the figure above
3. Put some hadoop configuration files under src and configure hadoop debugging instructions

Enable hadoop debugging information to prevent problems with hadoop

vim  /etc/profile

Add the following code:

export HADOOP_ROOT_LOGGER=DEBUG,console

transfer files
/Root / eclipse workspace is my work path
Execute the following instructions. In order to override some default parameters of eclipse, you need to upload the following files

cp /hadoop/hadoop-2.6.0/etc/hadoop/core-site.xml  /root/eclipse-workspace/WordCount/src
cp /hadoop/hadoop-2.6.0/etc/hadoop/hdfs-site.xml   /root/eclipse-workspace/WordCount/src
cp /hadoop/hadoop-2.6.0/etc/hadoop/log4j.properties   /root/eclipse-workspace/WordCount/src
  1. test

Copy the following code into the WordCount class file

package org.apache.hadoop.examples;  

import java.io.IOException;

import java.util.StringTokenizer;

import org.apache.hadoop.conf.Configuration;

import org.apache.hadoop.fs.FileSystem;

import org.apache.hadoop.fs.Path;

import org.apache.hadoop.io.IntWritable;

import org.apache.hadoop.io.LongWritable;

import org.apache.hadoop.io.Text;

import org.apache.hadoop.mapreduce.Job;

import org.apache.hadoop.mapreduce.Mapper;

import org.apache.hadoop.mapreduce.Reducer;

import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;

import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

public class WordCount {

public static class TokenizerMapper extends

Mapper<Object, Text, Text, IntWritable>

//This Mapper class is a generic type. It has four formal parameter types, which specify the types of input key, input value, output key and output value of map function respectively. Instead of using Java embedded types directly, Hadoop has developed a set of basic types that can optimize network serialization transmission. These types are available at org apache. hadoop. IO package.

//For example, the Object type in this example is applicable when multiple types of fields need to be used. The Text type is equivalent to the String type in Java, and the IntWritable type is equivalent to the Integer type in Java

{

//Define two variables

//private final static LongWritable one=new LongWritable(1);

private final static IntWritable one = new IntWritable(1);//This 1 indicates that each word occurs once, and the output value of map is 1

private Text word = new Text();//Data per row

//Implement map function

public void map(Object key, Text value, Context context)

//Context is an internal class of mapper. In short, the top-level interface is to track the status of tasks in map or reduce tasks. Naturally, MapContext records the context of map execution. In mapper class, this context can store some job conf information, such as job runtime parameters. We can process this information in the map function, which is also a classic example of parameter transmission in Hadoop, At the same time, context serves as a bridge between various functions in the execution of map and reduce. This design is very similar to the session object and application object in Java web

//In short, the context object stores the context information of job running, such as job configuration information, InputSplit information, task ID, etc

//The most intuitive thing here is that we mainly use the write method of context.

throws IOException, InterruptedException {

//The tokenizer uses the default delimiter set, which is " \t\n\r": the space character, the tab character, the newline character, the carriage-return character

String line= value.toString();   // Convert the data of the input plain text file into String

 // Divide the input data into rows first

StringTokenizer itr = new StringTokenizer(line);//Convert value of Text type to string type

//StringTokenizer is a string separation parsing type. StringTokenizer is used to separate strings. You can specify separators, such as', ', or characters such as spaces.

while (itr.hasMoreTokens()) {//The hasmoretoken () method is used to test whether there are more tags available for the string of this tag generator.

//java.util.StringTokenizer.hasMoreTokens()

word.set(itr.nextToken());//nextToken() is a method under the StringTokenizer class. nextToken() is used to return the next matching field.

context.write(word, one);

}

}

}

public static class IntSumReducer extends

Reducer<Text, IntWritable, Text, IntWritable> {

private IntWritable result = new IntWritable();

//Implement reduce function

public void reduce(Text key, Iterable<IntWritable> values,

Context context) throws IOException, InterruptedException {

int sum = 0;

for (IntWritable val : values) {

sum += val.get();

}

result.set(sum);

context.write(key, result);

}

 }

public static void main(String[] args) throws Exception {

Configuration conf = new Configuration();

//The Configuration class represents the Configuration of the job, which will load mapred site xml,hdfs-site.xml,core-site.xml and other Configuration files.

//Delete the existing output directory

Path mypath = new Path("hdfs://LSN Linux: 9000 / usr / root "); / / output path

FileSystem hdfs = mypath.getFileSystem(conf);//Get file system

//If the output path exists in the file system, delete it to ensure that the output directory cannot exist in advance.

if (hdfs.isDirectory(mypath)) {

hdfs.delete(mypath, true);

}

//The job object specifies the job execution specification, which can be used to control the operation of the whole job.

Job job = Job.getInstance();// new Job(conf, "word count");

job.setJarByClass(WordCount.class);//When running a job on a hadoop cluster, we need to package the code into a jar file, and then put this file

//Pass it to the cluster, and then execute the job through the command. However, there is no need to specify the name of the JAR file in the command. In this command, you can use setJarByClass() of the job object

//Just pass a main class in. hadoop will use this main class to find the JAR file containing it.

job.setMapperClass(TokenizerMapper.class);

//job.setReducerClass(IntSumReducer.class);

job.setCombinerClass(IntSumReducer.class);

job.setOutputKeyClass(Text.class);

job.setOutputValueClass(IntWritable.class);

//Generally, the output data types of mapper and reducer are the same, so we can use the above two commands. If they are different, we can use the following two commands to specify the data types of mapper's output key and value separately

//job.setMapOutputKeyClass(Text.class);

//job.setMapOutputValueClass(IntWritable.class);

//hadoop defaults to TextInputFormat and TextOutputFormat, so we don't need to configure it here.

//job.setInputFormatClass(TextInputFormat.class);

//job.setOutputFormatClass(TextOutputFormat.class);

FileInputFormat.addInputPath(job, new Path(

"hdfs://lsn-linux:9000/wordcount/zz. Txt "); / / the path specified by fileinputformat. Addinputpath() can be a single file, a directory, or a series of files that conform to a specific file pattern.

//From the method name, you can call this method multiple times to realize multi-path input.

FileOutputFormat.setOutputPath(job, new Path(

"hdfs://LSN Linux: 9000 / usz "); / / there can only be one output path, which specifies the write directory of the output file of the reduce function.

//If the job is executed for a long time, it will be rejected unexpectedly. Otherwise, we can't prevent the job from running for a long time, because the data we want to output will not be overwritten in advance

System.exit(job.waitForCompletion(true) ? 0 : 1);

    //Use job Waitforcompletement() submits the job and waits for the execution to complete. This method returns a boolean value indicating successful or failed execution. This boolean value is converted into program exit code 0 or 1. This boolean parameter is also a detailed identifier, so the job will write the progress to the console.

//After waitforcompletement() submits the job, it will poll the progress of the job every second. If it is found that there is a change from the last report, it will report the progress to the console. After the job is completed, if it is successful, it will display the job counter. If it fails, it will output the error causing the job failure to the console

}

}

After that, we need to create a folder wordcount in the hdfs root directory and upload a ZZ Txt ask price
The contents of the document are as follows

hadoop
hdfs
hadoop
world
hello
hadoop
hdfs
world
hello
hello
hadoop
hadoop
world

Then we can run the program
You will see the following results, then the configuration is successful

Then our eclipse can run normally and connect directly with the hdfs file system

Tags: Big Data Hadoop

Posted by deregular on Sat, 14 May 2022 05:14:39 +0300