HiveSql interview question: continuous check-in and receive gold coins [Baidu - difficult questions - general solution]

catalogue

0 problem description

1 data preparation

2 problem analysis

3 Summary

0 problem description

User behavior log table tb_user_log

iduidartical_idin_timeout_timesign_in
110102021-07-07 10:00:002021-07-07 10:00:091
210102021-07-08 10:00:002021-07-08 10:00:091
310102021-07-09 10:00:002021-07-09 10:00:421
41010

2021-07-10 10:00:00

2021-07-10 10:00:091
510102021-07-11 23:59:552021-07-11 23:59:591
610102021-07-12 10:00:282021-07-12 10:00:501
710102021-07-13 10:00:28

2021-07-13 10:00:50

1
810202021-10-01 10:00:282021-10-01 10:00:501
910202021-10-02 10:00:012021-10-02 10:01:501
1010202021-10-03 10:00:552021-10-03 11:00:591
1110202021-10-04 10:00:452021-10-04 11:00:550
1210202021-10-05 10:00:532021-10-05 11:00:591
1310202021-10-06 10:00:452021-10-06 11:00:551

(uid - user ID, artistic_id - article ID, in_time - entry time, out_time - departure time, sign_in - check in or not)

Scenario Logic Description:

  • artical_id - article ID represents the ID of the article the user browses. In special cases, it is artistic_ ID - if the article ID is 0, it means that the user is on the non article content page (such as the list page, activity page, etc. in the App). Note: only artistic_ Sign when ID is 0_ In value is valid.
  • From 0:00 on July 7, 2021, users can receive 1 gold coin every day, and can start accumulating the number of check-in days. On the third and seventh days of continuous check-in, they can receive 2 and 6 additional gold coins respectively.
  • Every 7 consecutive days of check-in, the number of check-in days will be accumulated again (i.e. reset the number of check-in days: the eighth consecutive day of check-in will be recorded as the first day of a new round of check-in, and 1 gold coin will be received)

Question: calculate the number of gold coins each user has received each month since July 2021 (the activity ends at the end of October, and the check-in starting on November 1 will no longer receive gold coins). The results are sorted in ascending order by month and ID.

Note: if the check-in record is in_time - enter time and out_time - it's time to leave. It's only recorded as in_ The date corresponding to time has been signed in.

Output example:

The output results of the sample data are as follows:

uidmonthcoin
10120210715
1022021107

Explanation:

101 signed in for 7 consecutive days during the activity period, so he got 1 * 7 + 2 + 6 = 15 gold coins;

102 get 5 gold coins by signing in for 3 consecutive days from 10.01 to 10.03

The signature was broken on 10.04. Two gold coins were obtained from 10.05 to 10.06 for two consecutive days, and a total of seven gold coins were obtained.

1 data preparation

DROP TABLE IF EXISTS tb_user_log;
CREATE TABLE tb_user_log (
    uid INT COMMENT 'user ID',
    artical_id INT COMMENT 'video ID',
    in_time string COMMENT 'Entry time',
    out_time string COMMENT 'Departure time',
    sign_in int COMMENT 'Check in'
) ;

INSERT INTO tb_user_log(uid, artical_id, in_time, out_time, sign_in) VALUES
  (101, 0, '2021-07-07 10:00:00', '2021-07-07 10:00:09', 1),
  (101, 0, '2021-07-08 10:00:00', '2021-07-08 10:00:09', 1),
  (101, 0, '2021-07-09 10:00:00', '2021-07-09 10:00:42', 1),
  (101, 0, '2021-07-10 10:00:00', '2021-07-10 10:00:09', 1),
  (101, 0, '2021-07-11 23:59:55', '2021-07-11 23:59:59', 1),
  (101, 0, '2021-07-12 10:00:28', '2021-07-12 10:00:50', 1),
  (101, 0, '2021-07-13 10:00:28', '2021-07-13 10:00:50', 1),
  (101, 0, '2021-07-14 11:00:28', '2021-07-14 11:00:50', 1),
  (101, 0, '2021-07-15 11:59:28', '2021-07-16 00:01:20', 1),
  (102, 0, '2021-10-01 10:00:28', '2021-10-01 10:00:50', 1),
  (102, 0, '2021-10-02 10:00:01', '2021-10-02 10:01:50', 1),
  (102, 0, '2021-10-03 11:00:55', '2021-10-03 11:00:59', 1),
  (102, 0, '2021-10-04 11:00:45', '2021-10-04 11:00:55', 0),
  (102, 0, '2021-10-05 11:00:53', '2021-10-05 11:00:59', 1),
  (102, 0, '2021-10-06 11:00:45', '2021-10-06 11:00:55', 1);

2 problem analysis

This question is a difficult one on niuke.com. What is required locally is the number of gold coins each user receives each month:

Rules for receiving gold coins:

  • (1) As long as the user signs in, he can get a gold coin
  • (2) Get 2 and 6 additional gold coins respectively on the third and seventh day of continuous check-in
  • (3) Reset after 7 days of continuous check-in, and conduct a new round of currency collection according to rules (1) and (2)

Analysis: the dimensions of this question are user and month, that is, the final result is grouped according to these two

Step 1: according to the rules of receiving gold coins, we can easily think of grouping according to the continuity of check-in, and dividing the continuous marks into one group

Step 2: row in each continuous group_ The number () tag (rn) is used to get the special location for the third and seventh consecutive days.

Step 3: divide and group according to whether there are more than 7 consecutive days in step 2

The overall calculation rules are as follows:

Figure 1

There are three steps. First, mark the place for 7 days (the purpose of marking is to assist in calculation)

Method of marking: RN% 7 = = 0 If it is 0, it is recorded as 1; if it is not 0, it is recorded as 0

Method of grouping windows by sliding. There are two kinds of such methods: one is to make the window larger and larger, and the other is to make the window smaller and smaller. See the following figure for details

The corresponding goals of the two algorithms are different: the first method of increasing the window is to hope that 1 can be divided into the following groups when it is grouped (the marker bit is grouped downward), and the second form of decreasing the window is to hope that 1 can be divided into the previous group of data (or understood as entering the previous data bucket) (the marker bit is upward). The specific overall grouping is shown in the figure below

Figure 1

Step 1: first judge the continuity and make the first grouping label

Note that the sign in value sign should not be used here_ The real meaning of this question is: continuous time + continuous check-in, if sign_ If in is filtered in the where condition, it cannot constitute a grouping condition

select  uid
      ,in_time
      ,sign_in
      ,month
      -- Continuity of time and check-in
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg

from(
select uid
      ,in_time
      ,sign_in
      ,month
    --Calculate the difference between the current value and the previous value. If it is 1, it has the condition of continuity(Continuity of time)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --Construct the continuity condition of check-in
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--Find the value of the previous line
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1
+------+----------------------+----------+----------+-------------+
| uid  |       in_time        | sign_in  |  month   | series_flg  |
+------+----------------------+----------+----------+-------------+
| 101  | 2021-07-07 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-08 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-09 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-10 10:00:00  | 1        | 2021-07  | 1           |
| 101  | 2021-07-11 23:59:55  | 1        | 2021-07  | 1           |
| 101  | 2021-07-12 10:00:28  | 1        | 2021-07  | 1           |
| 101  | 2021-07-13 10:00:28  | 1        | 2021-07  | 1           |
| 101  | 2021-07-14 11:00:28  | 1        | 2021-07  | 1           |
| 101  | 2021-07-15 11:59:28  | 1        | 2021-07  | 1           |
| 102  | 2021-10-01 10:00:28  | 1        | 2021-10  | 1           |
| 102  | 2021-10-02 10:00:01  | 1        | 2021-10  | 1           |
| 102  | 2021-10-03 11:00:55  | 1        | 2021-10  | 1           |
| 102  | 2021-10-05 11:00:53  | 1        | 2021-10  | 0           |
| 102  | 2021-10-06 11:00:45  | 1        | 2021-10  | 0           |
+------+----------------------+----------+----------+-------------+

Wrong group label generation: sign_in filter in where condition

select *
      -- Check in continuity label
     , sum(if(datediff(to_date(in_time),to_date(lag_in_time)) < 1,1,0)) over(partition by uid,month order by in_time) as    series_flg
from
  (select uid
        ,in_time
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10' and sign_in=1
  ) t

The results are as follows:

Step 2: generate the second label, grouping mark for 7 consecutive days

select *
      --Generate group tags for 7 consecutive days
      ,sum(case when rn % 7=0 then 1 else 0 end) over(partition by uid,month,series_flg order by in_time rows between current row and unbounded following) as series_7_flg
from 
(select *
--Generate sequence markers in each group
     , row_number() over(partition by uid,month,series_flg) as rn
from  
(select  uid
      ,in_time
      ,sign_in
      ,month
      -- Continuity of time and check-in
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg

from(
select uid
      ,in_time
      ,sign_in
      ,month
    --Calculate the difference between the current value and the previous value. If it is 1, it has the condition of continuity(Continuity of time)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --Construct the continuity condition of check-in
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--Find the value of the previous line
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1
) t
) t

The corresponding results are as follows:

Step 3: calculate the number of gold coins in each condition group according to the rules of receiving gold coins

select uid
      ,month
      ,series_flg
      ,series_7_flg
      --Calculate the number of gold coins according to the rules of receiving gold coins
      ,case when cnt>=1 and cnt<3 then cnt
            when cnt>=3 and cnt<7 then cnt + 2
       else cnt + 2 + 6 end coin        
from
(select uid
      ,month
      ,series_flg
      ,series_7_flg
      ,count(1) as cnt
from
(select *
      --Generate group tags for 7 consecutive days
      ,sum(case when rn % 7=0 then 1 else 0 end) over(partition by uid,month,series_flg order by in_time rows between current row and unbounded following) as series_7_flg
from 
(select *
--Generate sequence markers in each group
     , row_number() over(partition by uid,month,series_flg) as rn
from  
(select  uid
      ,in_time
      ,sign_in
      ,month
      -- Continuity of time and check-in
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg

from(
select uid
      ,in_time
      ,sign_in
      ,month
    --Calculate the difference between the current value and the previous value. If it is 1, it has the condition of continuity(Continuity of time)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --Construct the continuity condition of check-in
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--Find the value of the previous line
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1
) t
) t
) t
group by uid,month,series_flg,series_7_flg
) t
+------+----------+-------------+---------------+-------+
| uid  |  month   | series_flg  | series_7_flg  | coin  |
+------+----------+-------------+---------------+-------+
| 102  | 2021-10  | 1           | 0             | 5     |
| 102  | 2021-10  | 0           | 0             | 2     |
| 101  | 2021-07  | 1           | 1             | 15    |
| 101  | 2021-07  | 1           | 0             | 2     |
+------+----------+-------------+---------------+-------+

Group 1: series_flg indicates continuous check-in group. Group 2: series_7_flg means that those who have received gold coins for more than 7 consecutive days will be regrouped with 7 days as the dividing point.

The algorithm for receiving gold coins here is: 1-2 days is equal to consecutive days, 3-6 days is consecutive days plus 2, and when it is 7 days, it is consecutive days plus 8

Step 4: get the number of gold coins per month and sort them in ascending order by user and month

The final SQL is as follows

select uid,month,sum(coin) as coin
from(
select uid
      ,month
      ,series_flg
      ,series_7_flg
      --Calculate the number of gold coins according to the rules of receiving gold coins
      ,case when cnt>=1 and cnt<3 then cnt
            when cnt>=3 and cnt<7 then cnt + 2
       else cnt + 2 + 6 end coin        
from
(select uid
      ,month
      ,series_flg
      ,series_7_flg
      ,count(1) as cnt
from
(select *
      --Generate group tags for 7 consecutive days
      ,sum(case when rn % 7=0 then 1 else 0 end) over(partition by uid,month,series_flg order by in_time rows between current row and unbounded following) as series_7_flg
from 
(select *
--Generate sequence markers in each group
     , row_number() over(partition by uid,month,series_flg) as rn
from  
(select  uid
      ,in_time
      ,sign_in
      ,month
      -- Continuity of time and check-in
     , sum(if(diff < 1,1,0)) over(partition by uid,month,series_sign_in_flg order by in_time) as    series_flg

from(
select uid
      ,in_time
      ,sign_in
      ,month
    --Calculate the difference between the current value and the previous value. If it is 1, it has the condition of continuity(Continuity of time)
      ,datediff(to_date(in_time),to_date(lag_in_time))  as diff
      --Construct the continuity condition of check-in
      ,sum(if(sign_in!=lag_sign_in,1,0)) over(partition by uid,month order by in_time) as series_sign_in_flg
from
--Find the value of the previous line
  (select uid
        ,in_time
        ,sign_in
        ,substr(in_time,1,7) as month
        ,lag(in_time,1,in_time) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_in_time
        ,lag(sign_in,1,sign_in) over(partition by uid,substr(in_time,1,7) order by in_time) as lag_sign_in
  from tb_user_log
  where artical_id=0 and substr(in_time,1,7)>='2021-07' 
        and substr(in_time,1,7) <='2021-10'
  ) t
  ) t
where sign_in=1
) t
) t
) t
group by uid,month,series_flg,series_7_flg
) t
)t
group by uid,month 
order by uid,month

The final results are as follows:

+------+----------+-------+
| uid  |  month   | coin  |
+------+----------+-------+
| 101  | 2021-07  | 17    |
| 102  | 2021-10  | 7     |
+------+----------+-------+

Note that the discrepancy between the answer here and the one given is due to the addition of two pieces of data to the data for testing.

3 Summary

This paper presents a general solution for continuous check-in to receive gold coins, which generates the tag value through the window function for auxiliary calculation. Through this question, you can gain the following knowledge points:

  • (1) Judgment method of continuous check-in (pay attention to continuous time + continuous check-in)
  • (2) The method of regrouping breakpoints (according to certain conditions) (for example, the breakpoint is a breakpoint after 7 consecutive days)
         1) The breakpoint is the same as above: order by in_time rows between current row and unbounded   

                     following(Upper boundary sliding, lower boundary fixed)

         2)Breakpoints are the same as below: order by in_time(Lower boundary sliding, upper boundary fixed)
  • (3) The calculation method of the period, such as the problem of 7 days a cycle (using the method of remainder: rn%7==0)

Welcome to follow sister pomegranate's official account "my SQL", and follow me not to get lost

 

 

 

Tags: Java Big Data SQL

Posted by erika_web on Tue, 03 May 2022 21:59:16 +0300