Data prediction -- the thought and practice of user loss income prediction

premise
The demand related to data prediction is received for the first time. It can be said that it has something to do with "data mining", rather than the conventional data retrieval, query and export of duplicate disk. Therefore, it is still more motivated and interested to try to realize this demand. Because I haven't systematically studied the knowledge of data mining or data prediction, some are just the actual experience of front-line marketers and their daily thinking and attention to data relations. This data prediction is carried out in full accordance with personal intuitive analysis, so the analysis ideas and implementation process are recorded.

Tools: Hadoop, Hive, shell, ETL tools
Use HiveSQL to process data, write shell script, and finally schedule through ETL tool, run script periodically and make prediction every day.

Objective: to predict the loss income of users in the current month through historical data_ loss

General idea: split the indicators. If it can be calculated directly, it can be calculated directly. If it can not be calculated directly, refer to the historical data. Deduce the total amount of lost revenue of the current month according to the time schedule, the proportion of historical lost revenue and the current lost revenue.

Step 1: index analysis
By analyzing the previous data, it is found that the definitions of lost users include different types:

  1. Cancel the user whose account has been cancelled in the current month_ cust
  2. Users whose overdue time reaches the specified value (overdue_time)_ cust
  3. Users applying for account cancellation in the current month apply_cancel_cust

The lost revenue of the current month is equal to the sum of the lost revenue of three types of users

income_loss = loss_cancel_cust + loss_overdue_cust + loss_apply_cancel_cust

Among them, the user in arrears is overdue_cust can be directly screened from the database and can be directly related to calculation; Logged off user cancel_cust and apply for logout_ cancel_cust is generated at any time, so we can't find them directly.

Step 2: classification processing

1, Handle overdue users that can be clearly screened_ cust

  1. The overdue time filtered from the database will be greater than overdue this month_ Time user;
  2. Find owe from database_ Cust's latest consumption data before arrearage is used as the predicted value of the user's lost revenue_ predict;
  3. Sum the lost revenue of all users in arrears to get loss(overdue_cust);

The code demo is as follows:

--Screening overdue time is longer than overdue_time User overdue_cust
CREATE VIEW view_overdue_cust AS
SELECT user_id  
FROM datebase_name
WHERE condition1='Arrears' AND condition2 > overdue_time;


--Obtain the latest consumption data of all users before arrears loss_predict
CREATE VIEW view_consume_latest AS
SELECT 
   user_id,
   consumption  --Consumption data
FROM database_name
WHERE time1= current_time1 - overdue_time ;--Latest time before arrears


 --The total consumption of users in arrears before arrears is taken as the predicted value of lost income
SELECT sum(b.consumption) AS loss_overdue_cust --Estimate
FROM 
   view_overdue_cust a 
   INNER JOIN
   view_consume_latest b
   ON a.user_id = b.user_id;

2, Cancel processing logged out users that cannot be explicitly filtered_ Cust and apply for logout_ cancel_cust
Preconditions (from personal inspiration hypothesis and verified by historical data): the proportion of lost income generated under the same time schedule in the whole month is the same every month. For example, 80% of the lost income of the whole month is generated from the 1st to the 20th of each month. However, the data of each month fluctuates, so the average data progress of multiple historical months is used to offset the impact of fluctuations. The results are shown in the figure below (the red curve is the regression curve of progress average value):

1. Loss of revenue forecast for cancellation of user loss_cancel_cust
(1) Obtain the lost revenue hist of users logged off every day in the historical month from the database_ loss_ cancel_ cust;
(2) Calculate the average value of the proportion of the lost revenue of cancelled users in the lost revenue of the whole month under the same time schedule of each historical month. For example, if it is currently the 15th, calculate the proportion of the lost revenue of cancelled users generated from the 1st to 15th of each historical month in the lost revenue of cancelled users in the whole month, and then calculate the average value AVG of the proportion of each month_ rate_ cancel_ cust;
(3) Obtain the lost revenue of cancelled users generated in this month from the database now_loss_cance_cust;
(4) Calculate the predicted value of lost revenue of cancelled users, loss_cancel_cust = now_loss_cancel_cust / avg_rate_cancel_cust;

The code demo is as follows

--Get the lost revenue of logged off users every day in the historical month from the database hist_loss_cancel_cust
CREATE VIEW view_hist_cancel_cust AS
SELECT 
   'cancel_cust' AS cust_type,
   month1,
   sum(case when a.day_time < 'Specify Date' then a.loss 
       else 0
       end) AS hist_same_day_loss,     --Lost revenue of cancelled users under the same historical time schedule
   sum(a.loss) AS hist_all_month_loss
FROM database_name a
WHERE 
   month1 in ( hist_month1,hist_month2,hist_month3......) --Select the historical month by yourself
   AND condition1 = 'Log off user'                            --Restricted logout user
GROUP BY month1;

--Obtain the lost revenue of cancelled users generated this month from the database now_loss_cance_cust;
CREATE VIEW view_now_cancel_cust AS
SELECT 
   'cancel_cust' AS cust_type,
   sum(a.loss) AS now_loss  --Revenue generated from the loss of cancelled users in the current month
FROM  database_name a
WHERE a.day_time < 'same day';

--Calculate the predicted value of lost revenue of cancelled users loss_cancel_cust
SELECT
b.now_loss / a.avg_rate AS loss_cancel_cust
FROM
    (SELECT
        cust_type,
        avg(hist_same_day_loss/ hist_all_mont_loss) AS avg_rate --Average value of proportion under the same time schedule of each month
     FROM view_hist_cancel_cust) a
     INNER JOIN
     (SELECT 
          cust_type,
          now_loss
      FROM view_now_cancel_cust) b
      ON a.cust_type = b.cust_type;
         

2. Loss of revenue forecast of users applying for cancellation_ apply_ cancel_ cust
(1) Obtain the historical month from the database and apply for cancellation of user loss income hist every day_ loss_ apply_ cancel_ cust;
(2) Calculate the average value of the proportion of the lost income of users applying for cancellation in the lost income of the whole month under the same time schedule of each historical month. For example, if it is currently the 15th, calculate the proportion of the lost revenue of users applying for cancellation generated from the 1st to 15th of each historical month in the lost revenue of users applying for cancellation in the whole month, and then calculate the average value AVG of the proportion of each month_ rate_ apply_ cancel_ cust;
(3) Obtain the lost revenue of users applying for cancellation this month from the database now_loss_apply_cancel_cust;
(4) Calculate the predicted value of users applying for logout,
loss_apply_cancel_cust = now_loss_apply_cancel_cust / avg_rate_apply_cancel_cust;

The code demo is as follows

--Get the lost revenue of logged off users every day in the historical month from the database hist_loss_cancel_cust
CREATE VIEW view_hist_apply_cancel_cust AS
SELECT 
   'apply_cancel_cust' AS cust_type,
   month1,
   sum(case when a.day_time < 'Specify Date' then a.loss 
       else 0
       end) AS hist_same_day_loss,     --Lost revenue of cancelled users under the same historical time schedule
   sum(a.loss) AS hist_all_month_loss
FROM database_name a
WHERE 
   month1 in ( hist_month1,hist_month2,hist_month3......) --Select the historical month by yourself
   AND condition1 = 'Apply for logout of user'                            --Restricted logout user
GROUP BY month1;

--Obtain the lost revenue of cancelled users generated this month from the database now_loss_cance_cust;
CREATE VIEW view_now_apply_cancel_cust AS
SELECT 
   'apply_cancel_cust' AS cust_type,
   sum(a.loss) AS now_loss  --Revenue generated from the loss of cancelled users in the current month
FROM  database_name a
WHERE a.day_time < 'same day';

--Calculate the predicted value of lost revenue of cancelled users loss_apply_cancel_cust
SELECT
b.now_loss / a.avg_rate AS loss_apply_cancel_cust
FROM
    (SELECT
        cust_type,
        avg(hist_same_day_loss/ hist_all_mont_loss) AS avg_rate --Average value of proportion under the same time schedule of each month
     FROM view_hist_apply_cancel_cust) a
     INNER JOIN
     (SELECT 
          cust_type,
          now_loss
      FROM view_now_apply_cancel_cust) b
      ON a.cust_type = b.cust_type;
         

Step 3: calculate the sum
Add the three predicted values
income_loss = loss_cancel_cust + loss_overdue_cust + loss_apply_cancel_cust

Step 4: close
Rewrite the code appropriately, write it into the shell script, establish the ETL scheduling process and run it periodically.

Actual renderings:


matters needing attention

  1. The selection of historical months is discriminated: take the months with similar stages
    (1) If there are no special events in the current month, special time points such as double 11 and Spring Festival should also be avoided when selecting historical months.
    (2) If the current month is a special time, select the corresponding historical month, such as double 11, and select the data of double 11.

Can be improved:
There is a small amount of missing data in the historical month data, which is not handled this time. It is best to fill in the missing value in the process of practice to improve the prediction accuracy.

Difficulties:

  1. The job transfer time is a little short, and I don't know enough about the existing database tables. I need to find the required tables from a wide range of tables and study the table structure and table fields. Although some colleagues can consult, it is still extremely time-consuming;
  2. HiveSQL is used to process data this time. Its syntax and related functions are unfamiliar, and debugging and testing consume a lot of time;
  3. Before, I didn't touch shell programming. After the HiveSQL statement was written into the script, the debugging work cost a lot of time;

Harvest and feelings:
This demand realization lasted intermittently for three weeks, mostly in spare time. It has greatly tempered the ability to write SQL and burst out many unspeakable inspiration and ideas. I'm still very happy. After the demand is completed, I read the introduction to data mining. To some extent, my idea this time is to regard the lost income as a time series correlation and predict the current value according to the data of historical time progress. Reading more books is good for physical and mental health.

Tags: Big Data hive SQL Algorithm Data Analysis Data Mining

Posted by aboyd on Thu, 12 May 2022 15:10:28 +0300