Hadoop foundation-11-user behavior log analysis

For the source code, see: https://github.com/hiszm/hadoop-train

User behavior log overview

  • Records of each search and click by the user
  • Historical behavior data, from historical orders

==>Then make recommendations / so as to improve the conversion of users (the ultimate goal)

Log content

20979872853^Ahttp://www.yihaodian. com/1/? type=3&tracker_ u=10974049258^A^A^A3^ABAWG49VCYYTMZ six VU9XX74KPV5CCHPAQ2A4A5^A^A^A^A^APPG68XWJNUSSX649S4YQTCT6HBMS9KBA^A10974049258^A\N^A27. 45.216.128 ^ A ^ A, unionkey: 10974049258 ^ A ^ a2013-07-21 18:58:21 ^ a \ n ^ A ^ A1 ^ A ^ a \ n ^ anull ^ a247 ^ A ^ A ^ A ^ A ^ amozilla / 5.0 (compatible; MSIE 10.0; Windows NT 6.1; wow64; Trident / 6.0; slcc2;. Net CLR 2.0.50727;. Net CLR 3.5.30729;. Net CLR 3.0.30729; Media Center PC 6.0;. Net4.0c) ^ awin32 ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ a Guangdong ^ A20 a ^ A ^ Huizhou
 city^A^A^A^A^A^A^A\N^A\N^A\N^A\N^A2013-07-21
20977690754^Ahttp://www.yihaodian. com/1/? type=3&tracker_ u=10781349260^A^A^A3^A49FDCP696X2RDDRC2ED6Y4JVPTEVFNDADF1D^A^A^A^A^APPGCTKD92UT3DR7KY1VFZ92ZU4HEP479^A10781349260^A\N^A101. "A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ A ^ 07-21 18:11:46 ^ aShanghai ^ A ^ A ^ A ^ A ^ A ^ a \ n ^ a \ n ^ a \ n ^ a \ n ^ a2013-07-21
  • The second field: url = > page ID
  • Field 14: IP = > location
  • Field 18: time

Common terms of e-commerce

1.Ad Views(Ad browsing):The number of times online advertisements are viewed by users.    
2.PV(Visits): Namely Page View. Page views are calculated every time the user refreshes. The total number of times each page of the website has been viewed. A visitor may create a dozen or more views. Or understand it this way: the number of times users open pages on your website, how many pages they browse, or how many times they refresh.    
3.Impression (Impression number):It refers to each display of the web page required by the user, which is one Impression´╝ŤAdvertisers want 100000 people to see advertisements, that is, 100000 times Impression´╝ŤIt is also one of the elements to evaluate the effect of advertising.    
4.UV(Number of unique visitors): i.e Unique Visitor,Visit a website or see an advertisement on a computer client for a visitor. The same client is calculated only once in 24 hours.    
5.IP (independent IP):  Namely Internet Protocol,Refers to independence IP Count. Same within 24 hours IP The address is calculated once.    6.URL(Uniform resource locator): URL Give the location of any server, file and image on the Internet. Users can link specific information via hypertext protocol URL And find the information you need. That's the landing page URL.    
7.Key Word(keyword)    
8.HTML(Hypertext markup language): A page description language based on text format, which is a general editing language of web pages.    
9.Band Width (bandwidth):Information that can be transmitted through a transmission line at a certain time(Text, picture, audio and video)Capacity. The higher the bandwidth, the faster the web page will be called. The limited bandwidth leads to making the image files in the web page as small as possible.    
10.Browser Cache(Browser cache):In order to speed up the browsing of web pages, the browser stores the recently visited page in the hard disk. If you visit the site again, the browser will display the page from the hard disk instead of from the server.    
11.Cookie:A file in the computer that records the behavior of users in the network; The website can be accessed through Cookie To identify whether the user has ever visited the website.    
12.Database(database):It usually refers to the use of modern computer technology to classify and sort out all kinds of information in order to facilitate search and management. In network marketing, it refers to the use of the Internet to collect users' personal information and coexist
 Filing and management; Such as: name, gender, age, address, telephone number, hobbies, consumption behavior, etc.    
13.Targeting(directional): Deliver the most appropriate advertisements to users through content matching, user composition or filtering. It is also what Baidu said to find accurate customers. It is advertising orientation and customer orientation.  
14.Traffic(flow):Number and type of sites visited by users

Project requirements description

  • Statistical visits (pv)
  • Count the visits of each province (ip)
  • Count the number of visits to each page (url)

Data processing flow and technical architecture

Realization of browsing statistics function

Statistics page views
count a row of records into a fixed ket, and the value is assigned to 1

package com.bigdata.hadoop.mr.project.mrv1;

import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;

/**
 * Statistics of traffic in the first version
* 
 */
public class PVStatApp {

    public static void main(String[] args) throws Exception {

        Configuration configuration = new Configuration();


        Job job = Job.getInstance(configuration);
        job.setJarByClass(PVStatApp.class);

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(NullWritable.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job, new Path("input/raw/trackinfo_20130721.data"));

        FileOutputFormat.setOutputPath(job, new Path("output/pv1"));

        job.waitForCompletion(true);
    }

    static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {

        private Text KEY = new Text("key");
        private LongWritable ONE = new LongWritable(1);

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            context.write(KEY, ONE);
        }
    }

    static class MyReducer extends Reducer<Text, LongWritable, NullWritable, LongWritable> {
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {

            long count = 0;
            for (LongWritable value : values) {
                count++;
            }

            context.write(NullWritable.get(), new LongWritable(count));
        }
    }
}

output : 300000

Statistics of provincial visits

Count the flow of each province
select province count(1) from xxx group by province
City information < = IP analysis < = how IP converts city information

package com.bigdata.hadoop.hdfs;

import com.bigdata.hadoop.mr.project.utils.IPParser;
import org.junit.Test;

public class Iptest {
    @Test
    public void testIP(){
        IPParser.RegionInfo regionInfo =IPParser.getInstance().analyseIp("58.32.19.255");
        System.out.println(regionInfo.getCountry());
        System.out.println(regionInfo.getProvince());
        System.out.println(regionInfo.getCity());

    }
}

/Library/Java/JavaVirtualMachines/jdk1.8.0_1/commons-logging/commons-logging/1.1.3/commons-logging-1.1.3.jar:/Users/jacksun/.m2/repository/log4j/log4j/1.2.17/log4j-1.2.17.jar:/Users/jacksun/.m2/repository/commons-lang/commons-lang/2.6/commons-lang-2.6.jar:/Users/jackop.hdfs.Iptest,testIP
 China
 Shanghai
null

Process finished with exit code 0

IP library resolution

import com.bigdata.hadoop.mr.project.utils.IPParser;
import com.bigdata.hadoop.mr.project.utils.LogParser;
package com.bigdata.hadoop.mr.project.utils;

import org.apache.commons.lang.StringUtils;

import java.util.HashMap;
import java.util.Map;

public class LogParser {

    public Map<String, String> parse(String log) {
        IPParser ipParser = IPParser.getInstance();
        Map<String, String> info = new HashMap<>();

        if (StringUtils.isNotBlank(log)) {
            String[] splits = log.split("\001");

            String ip = splits[13];
            String country = "-";
            String province = "-";
            String city = "-";
            IPParser.RegionInfo regionInfo = ipParser.analyseIp(ip);

            if (regionInfo != null) {
                country = regionInfo.getCountry();
                province = regionInfo.getProvince();
                city = regionInfo.getCity();
            }

            info.put("ip", ip);
            info.put("country", country);
            info.put("province", province);
            info.put("city", city);

           
        }

        return info;
    }
}

Function realization

package com.bigdata.hadoop.mr.project.mrv1;

import com.bigdata.hadoop.mr.project.utils.IPParser;
import com.bigdata.hadoop.mr.project.utils.LogParser;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.Map;

public class ProvinceStatApp {

    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration configuration = new Configuration();

        FileSystem fileSystem = FileSystem.get(configuration);
        Path outputPath = new Path("output/v1/provincestat");
        if (fileSystem.exists(outputPath)) {
            fileSystem.delete(outputPath, true);
        }

        Job job = Job.getInstance(configuration);
        job.setJarByClass(ProvinceStatApp.class);

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job, new Path("input/raw/trackinfo_20130721.data"));
        FileOutputFormat.setOutputPath(job, new Path("output/v1/provincestat"));

        job.waitForCompletion(true);
    }

    static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
        private LongWritable ONE = new LongWritable(1);

        private LogParser logParser;

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            logParser = new LogParser();
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String log = value.toString();

            Map<String, String> info = logParser.parse(log);
            String ip = info.get("ip");

            if (StringUtils.isNotBlank(ip)) {
                IPParser.RegionInfo regionInfo = IPParser.getInstance().analyseIp(ip);
                if (regionInfo != null) {
                    String province = regionInfo.getProvince();
                    if(StringUtils.isNotBlank(province)){
                        context.write(new Text(province), ONE);
                    }else {
                        context.write(new Text("-"), ONE);
                    }

                } else {
                    context.write(new Text("-"), ONE);
                }
            } else {
                context.write(new Text("-"), ONE);
            }
        }
    }

    static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long count = 0;
            for (LongWritable value : values) {
                count++;
            }

            context.write(key, new LongWritable(count));
        }
    }
}


-	923
 Shanghai	72898
 Yunnan Province	1480
 Inner Mongolia Autonomous Region	1298
 Beijing	42501
 Taiwan Province	254
 Jilin Province	1435
 Sichuan Province	4442
 Tianjin	11042
 Ningxia	352
 Anhui Province	5429
 Shandong Province	10145
 Shanxi Province	2301
 Guangdong Province	51508
 Guangxi	1681
 Xinjiang	840
 Jiangsu Province	25042
 Jiangxi Province	2238
 Hebei Province	7294
 Henan Province	5279
 Zhejiang Province	20627
 Hainan 	814
 Hubei province	7187
 Hunan Province	2858
 Macao Special Administrative Region	6
 Gansu Province	1039
 Fujian Province	8918
 Tibet	110
 Guizhou Province	1084
 Liaoning Province	2341
 Chongqing City	1798
 Shaanxi Province	2487
 Qinghai Province	336
 Hong Kong Special Administrative Region	45
 Heilongjiang Province	1968


Page view statistics

Statistics page visits
Get the pageID that meets the rules, and then make statistics

Page number acquisition

package com.bigdata.hadoop.mr.project.utils;

import org.apache.commons.lang.StringUtils;

import java.util.HashMap;
import java.util.Map;

public class LogParser {

    public Map<String, String> parse(String log) {
        IPParser ipParser = IPParser.getInstance();
        Map<String, String> info = new HashMap<>();

        if (StringUtils.isNotBlank(log)) {
            String[] splits = log.split("\001");

            String ip = splits[13];
            String country = "-";
            String province = "-";
            String city = "-";
            IPParser.RegionInfo regionInfo = ipParser.analyseIp(ip);

            if (regionInfo != null) {
                country = regionInfo.getCountry();
                province = regionInfo.getProvince();
                city = regionInfo.getCity();
            }

            info.put("ip", ip);
            info.put("country", country);
            info.put("province", province);
            info.put("city", city);

            String url = splits[1];
            info.put("url", url);

            String time = splits[17];
            info.put("time", time);
        }

        return info;
    }

    public Map<String, String> parseV2(String log) {
        IPParser ipParser = IPParser.getInstance();
        Map<String, String> info = new HashMap<>();

        if (StringUtils.isNotBlank(log)) {
            String[] splits = log.split("\t");

            String ip = splits[0];
            String country = splits[1];
            String province = splits[2];
            String city = splits[3];
            IPParser.RegionInfo regionInfo = ipParser.analyseIp(ip);

            info.put("ip", ip);
            info.put("country", country);
            info.put("province", province);
            info.put("city", city);

            String url = splits[4];
            info.put("url", url);

        
        }

        return info;
    }
}


Function realization

package com.bigdata.hadoop.mr.project.mrv1;

import com.bigdata.hadoop.mr.project.utils.ContentUtils;
import com.bigdata.hadoop.mr.project.utils.LogParser;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.Reducer;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.Map;

public class PageStatApp {
    public static void main(String[] args) throws IOException, ClassNotFoundException, InterruptedException {
        Configuration configuration = new Configuration();

        FileSystem fileSystem = FileSystem.get(configuration);
        Path outputPath = new Path("output/v1/pagestat");
        if (fileSystem.exists(outputPath)) {
            fileSystem.delete(outputPath, true);
        }

        Job job = Job.getInstance(configuration);
        job.setJarByClass(PageStatApp.class);

        job.setMapperClass(MyMapper.class);
        job.setReducerClass(MyReducer.class);

        job.setMapOutputKeyClass(Text.class);
        job.setMapOutputValueClass(LongWritable.class);

        job.setOutputKeyClass(Text.class);
        job.setOutputValueClass(LongWritable.class);

        FileInputFormat.setInputPaths(job, new Path("input/raw/trackinfo_20130721.data"));
        FileOutputFormat.setOutputPath(job, new Path("output/v1/pagestat"));

        job.waitForCompletion(true);
    }

    static class MyMapper extends Mapper<LongWritable, Text, Text, LongWritable> {
        private LongWritable ONE = new LongWritable(1);

        private LogParser logParser;

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            logParser = new LogParser();
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String log = value.toString();

            Map<String, String> info = logParser.parse(log);
            String url = info.get("url");

            if (StringUtils.isNotBlank(url)) {
                String pageId = ContentUtils.getPageId(url);

                context.write(new Text(pageId), ONE);
            }
        }
    }

    static class MyReducer extends Reducer<Text, LongWritable, Text, LongWritable> {
        @Override
        protected void reduce(Text key, Iterable<LongWritable> values, Context context) throws IOException, InterruptedException {
            long count = 0;
            for (LongWritable value : values) {
                count++;
            }

            context.write(key, new LongWritable(count));
        }
    }
}
-	298827
13483	19
13506	15
13729	9
13735	2
13736	2
14120	28
14251	1
14572	14
14997	2
15065	1
17174	1
17402	1
17449	2
17486	2
17643	7
18952	14
18965	1
18969	32
18970	27
18971	1
18972	3
18973	8
18977	10
18978	5
18979	11
18980	8
18982	50
18985	5
18988	2
18991	27
18992	4
18994	3
18996	3
18997	3
18998	2
18999	4
19000	5
19004	23
19006	4
19009	1
19010	1
19013	1
20154	2
20933	1
20953	5
21208	11
21340	1
21407	1
21484	1
21826	8
22068	1
22107	4
22114	4
22116	5
22120	6
22123	13
22125	1
22127	16
22129	3
22130	3
22140	1
22141	5
22142	8
22143	5
22144	1
22146	5
22169	1
22170	20
22171	51
22180	4
22196	75
22249	4
22331	6
22372	1
22373	1
22805	3
22809	3
22811	5
22813	11
23203	1
23481	194
23541	1
23542	1
23704	1
23705	1
3541	2
8101	36
8121	32
8122	38
9848	2
9864	1


ETL for data processing

-----------------------



[INFO ] method:org.apache.hadoop.mapred.MapTask$MapOutputBuffer.init(MapTask.java:1008)
kvstart = 26214396; length = 6553600

Counters: 30
	File System Counters
		FILE: Number of bytes read=857754791
		FILE: Number of bytes written=23557997
		FILE: Number of read operations=0
		FILE: Number of large read operations=0
		FILE: Number of write operations=0
	Map-Reduce Framework
		Map input records=300000
		Map output records=299797
		Map output bytes=3001739
		Map output materialized bytes=3601369
		Input split bytes=846
		Combine input records=0
		Combine output records=0
		Reduce input groups=92
		Reduce shuffle bytes=3601369
		Reduce input records=299797
		Reduce output records=92
		Spilled Records=599594
		Shuffled Maps =6
		Failed Shuffles=0
		Merged Map outputs=6
		GC time elapsed (ms)=513
		Total committed heap usage (bytes)=3870818304
	Shuffle Errors
		BAD_ID=0
		CONNECTION=0
		IO_ERROR=0
		WRONG_LENGTH=0
		WRONG_MAP=0
		WRONG_REDUCE=0
	File Input Format Counters 
		Bytes Read=173576072
	File Output Format Counters 
		Bytes Written=771

We found that it will take a lot of time to process the original data, and we need to process the data

So here we introduce ETL

ETL, the abbreviation of extract transform load in English, is used to describe the process of extracting, transforming and loading data from the source to the destination.

  • The full amount of original data is not convenient for calculation, so it needs to be processed step by step for corresponding dimensional statistical analysis
  • Data needed for parsing: ip = = "city information"
  • Remove unnecessary fields
  • ip/time/url/page_id/coutry/province/city

package com.bigdata.hadoop.mr.project.mrv2;

import com.bigdata.hadoop.mr.project.mrv1.PageStatApp;
import com.bigdata.hadoop.mr.project.utils.ContentUtils;
import com.bigdata.hadoop.mr.project.utils.LogParser;
import org.apache.commons.lang.StringUtils;
import org.apache.hadoop.conf.Configuration;
import org.apache.hadoop.fs.FileSystem;
import org.apache.hadoop.fs.Path;
import org.apache.hadoop.io.LongWritable;
import org.apache.hadoop.io.NullWritable;
import org.apache.hadoop.io.Text;
import org.apache.hadoop.mapreduce.Job;
import org.apache.hadoop.mapreduce.Mapper;
import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;

import java.io.IOException;
import java.util.Map;

public class ETLApp {
    public static void main(String[] args) throws Exception {
        Configuration configuration = new Configuration();

        FileSystem fileSystem = FileSystem.get(configuration);
        Path outputPath = new Path("input/etl");
        if (fileSystem.exists(outputPath)) {
            fileSystem.delete(outputPath, true);
        }

        Job job = Job.getInstance(configuration);
        job.setJarByClass(ETLApp.class);

        job.setMapperClass(MyMapper.class);

        job.setMapOutputKeyClass(NullWritable.class);
        job.setMapOutputValueClass(Text.class);

        FileInputFormat.setInputPaths(job, new Path("input/raw/trackinfo_20130721.data"));
        FileOutputFormat.setOutputPath(job, new Path("input/etl"));

        job.waitForCompletion(true);
    }

    static class MyMapper extends Mapper<LongWritable, Text, NullWritable, Text> {
        private LongWritable ONE = new LongWritable(1);

        private LogParser logParser;

        @Override
        protected void setup(Context context) throws IOException, InterruptedException {
            logParser = new LogParser();
        }

        @Override
        protected void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
            String log = value.toString();

            Map<String, String> info = logParser.parse(log);

            String ip = info.get("ip");
            String country = info.get("country");
            String province = info.get("province");
            String city = info.get("city");
            String url = info.get("url");
            String time = info.get("time");
            String pageId = ContentUtils.getPageId(url);


            StringBuilder builder = new StringBuilder();
            builder.append(ip).append("\t");
            builder.append(country).append("\t");
            builder.append(province).append("\t");
            builder.append(city).append("\t");
            builder.append(url).append("\t");
            builder.append(time).append("\t");
            builder.append(pageId);

            context.write(NullWritable.get(), new Text(builder.toString()));
        }
    }
}



106.3.114.42	China	Beijing	null	http://www.yihaodian.com/2/?tracker_u=10325451727&tg=boomuserlist%3A%3A2463680&pl=www.61baobao.com&creative=30392663360&kw=&gclid=CPC2idPRv7gCFQVZpQodFhcABg&type=2	2013-07-21 11:24:56	-
58.219.82.109	China	Jiangsu Province	Wuxi City	http://www.yihaodian.com/5/?tracker_u=2225501&type=4	2013-07-21 13:57:11	-
58.219.82.109	China	Jiangsu Province	Wuxi City	http://search.yihaodian.com/s2/c0-0/k%25E7%25A6%258F%25E4%25B8%25B4%25E9%2597%25A8%25E9%2587%2591%25E5%2585%25B8%25E7%2589%25B9%25E9%2580%2589%25E4%25B8%259C%25E5%258C%2597%25E7%25B1%25B35kg%2520%25E5%259B%25BD%25E4%25BA%25A7%25E5%25A4%25A7%25E7%25B1%25B3%2520%25E6%2599%25B6%25E8%258E%25B9%25E5%2589%2594%25E9%2580%258F%2520%25E8%2587%25AA%25E7%2584%25B6%2520%2520/5/	2013-07-21 13:50:48	-
58.219.82.109	China	Jiangsu Province	Wuxi City	http://search.yihaodian.com/s2/c0-0/k%25E8%258C%25B6%25E8%258A%25B1%25E8%2582%25A5%25E7%259A%2582%25E7%259B%2592%25202213%2520%25E5%258D%25AB%25E7%2594%259F%25E7%259A%2582%25E7%259B%2592%2520%25E9%25A6%2599%25E7%259A%2582%25E7%259B%2592%2520%25E9%25A2%259C%25E8%2589%25B2%25E9%259A%258F%25E6%259C%25BA%2520%2520/5/	2013-07-21 13:57:16	-
58.219.82.109	China	Jiangsu Province	Wuxi City	http://www.yihaodian.com/5/?tracker_u=2225501&type=4	2013-07-21 13:50:13	-
218.11.179.22	China	Hebei Province	Xingtai City	http://www.yihaodian.com/2/?tracker_u=10861423206&type=1	2013-07-21 08:00:13	-
218.11.179.22	China	Hebei Province	Xingtai City	http://www.yihaodian.com/2/?tracker_u=10861423206&type=1	2013-07-21 08:00:20	-
123.123.202.45	China	Beijing	null	http://search.1mall.com/s2/c0-0/k798%25203d%25E7%2594%25BB%25E5%25B1%2595%2520%25E5%259B%25A2%25E8%25B4%25AD/2/	2013-07-21 11:55:28	-
123.123.202.45	China	Beijing	null	http://t.1mall.com/100?grouponAreaId=3&uid=1ahrua02b8mvk0952dle&tracker_u=10691821467	2013-07-21 11:55:21	-

...........


Function upgrade

Just change the log based on etl

 public Map<String, String> parseV2(String log) {
        IPParser ipParser = IPParser.getInstance();
        Map<String, String> info = new HashMap<>();

        if (StringUtils.isNotBlank(log)) {
            String[] splits = log.split("\t");

            String ip = splits[0];
            String country = splits[1] ;
            String province = splits[2];
            if(province.equals("null")){
                province="other";
            }
            String city = splits[3];
            IPParser.RegionInfo regionInfo = ipParser.analyseIp(ip);

            info.put("ip", ip);
            info.put("country", country);
            info.put("province", province);
            info.put("city", city);

            String url = splits[4];
            info.put("url", url);

            String time = splits[5];
            info.put("time", time);

            String client = splits[6];
            info.put("client", client);
        }

        return info;
    }

Tags: Java Big Data Hadoop hdfs mapreduce

Posted by liza on Thu, 12 May 2022 19:03:16 +0300