BI data analysis methods that you need to know about data warehouse development

Data warehouse development often needs to deal with data tables, so after the data warehouse table development is completed, everything will be fine? Obviously not, you also need to think about how to analyze the data and how to present the data, because this is an important aspect of the data value. Through data analysis and visual presentation, the secrets behind the data can be more intuitively provided, thereby assisting business decision-making and realizing real data-enabled business. From this article you can learn:

  • Pareto Analysis Method and Data Visualization
  • RFM Analysis and Data Visualization
  • Boston Matrix and Data Visualization

Pareto Analysis and Data Visualization

basic concept

Pareto analysis, also known as ABC analysis, is the 80/20 rule we usually refer to. Regarding Pareto analysis, there are different applications in different industries.

  • take a chestnut

In the inventory management of enterprises, it can be found that a few varieties account for a large proportion in the total demand (or total supply, total inventory, and total reserves), but the proportion of the corresponding value rare. Therefore, the Pareto analysis method can be used to classify the various items needed by the enterprise according to the size of the demand, the importance of the items, the shortage of resources and the difficulty of purchasing, the level of the unit price, the amount of reserve funds occupied and other factors. Divided into several categories, the implementation of classification management.

In the analysis of commodity sales, the sales of some commodities account for a large part of the total sales, and the sales of some commodities only account for a small proportion, so that they can be divided into A, B, and C categories. , invest in categories with a larger proportion of sales to obtain more sales.

In the quality analysis, the number of products that cause unqualified product quality for a certain reason is analyzed. Using the Pareto analysis method, it is very intuitive to see which reasons cause the unqualified product quality and which reasons are more serious. This allows you to focus on solving important problems, clear goals, and make it easier to operate.

  • another way of expressing

According to the main characteristics of things in terms of technology or economy, they are classified to distinguish between the key and the non-key. Each category is treated differently and managed, and the analyzed objects are divided into three categories: A, B, and C. There is no clear division of numerical boundaries for the three categories of items.

Classification and importance describe
Category A (very important) Small proportion of quantity, large proportion of value
Class B (more important) Not as important as class A, between A and C
Class C (generally important) A large proportion of quantity but a small proportion of value

The core idea of ​​classification: a few contribute most of the value. Take product categories and sales as an example: Category A accounts for 10% of the total, but contributes 80% of sales.

Data analysis case

  • renderings

  • Implementation steps

Suppose you have the following dataset format:

brand sales
NEW BALANCE (New Balance) 8750
ZIPPO (treasure) 9760
OCTMAMI (October Mommy) 5800

The data needs to be processed into the following format:

brand sales Total sales Cumulative sales Cumulative sales percentage
=∑ All brand sales = current brand sales + previous brand sales Cumulative Sales/Total Sales

The specific SQL implementation is as follows:

SELECT
     brand, -- brand
     total_money, -- sales
     sum(total_money) over() AS sum_total_money,-- Total sales
     sum(total_money) over(ORDER BY total_money DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS acc_sum_total_money -- Cumulative sales
FROM sales_money

The specific SQL implementation is given above. In fact, BI tools have built-in many processing functions and drag-and-drop data processing. You can process a detailed data into the above form without writing SQL.

  • Conclusion analysis

As can be seen from the above Pareto chart: Class A (green part) accounts for about 80% of total sales, Class B (yellow part) accounts for 10% of total sales, and Class C (red part) accounts for about 80% of total sales. 10% of total sales. Next, you can conduct long-tail analysis, formulate marketing strategies, and more.

RFM Analysis and Data Visualization

basic concept

RFM model is a commonly used model in customer relationship management (CRM). RFM model is an important tool and means to measure customer value and customer profitability. The model describes the value of a customer through three indicators: the customer's recent purchase behavior, the overall frequency of purchases, and the amount of money spent.

The RFM model shows the entire outline of a customer more dynamically, which provides a basis for personalized communication and service. At the same time, if the customer has been dealing with the customer for a long time, it can also accurately judge the long-term value of the customer (even is the lifetime value), which supports more marketing decisions by improving the condition of the three metrics.

In the RFM model, three key factors are included, namely:

  • R(Recency): Indicates how far the customer last purchased, that is, the most recent consumption, the more recent the consumption time, the greater the value of the customer
  • F(Frequency): Indicates the number of purchases made by customers in a recent period of time, that is, the consumption frequency. Users who buy frequently are regular customers, and their value is definitely greater than that of customers who come here once in a while.
  • M (Monetary): Indicates the amount purchased by the customer in the recent period of time, that is, the customer's spending power. Usually, the average single consumption amount of the customer is used as a measure. The more consumption, the greater the value of the user.

The last consumption, consumption frequency, and consumption amount are the most important and easiest ways to measure consumer value, which fully demonstrates the guiding significance of these three indicators for marketing activities. Among them, the most recent consumption is the strongest predictor.

Through the above analysis, the customer groups can be classified:

Customer Type and Level R F M customer characteristics
Important value customer (A-level/111) high (1) high (1) high (1) Recent consumption time is near, consumption frequency and consumption amount are high
Important Development Clients (Class A/101) high (1) low (0) high (1) The recent consumption time is relatively short, the consumption amount is high, but the frequency is not high, the loyalty is not high, and the potential users must focus on development
Important Customer Retention (Class B/011) low (0) high (1) high (1) Recently, the consumption time is far away, and the consumption amount and frequency are very high.
Important Customer Retention (Class B/001) low (0) low (0) high (1) Recently, users who spend a long time and have infrequent consumption but have a high consumption amount may be users who are about to lose or have been lost, and should be based on retention measures.
General Value Customers (Class B/110) high (1) high (1) low (0) The recent consumption time is short and the frequency is high, but the consumption amount is low, and the unit price of its customers needs to be increased.
General Development Clients (Grade B/100) high (1) low (0) low (0) Recently, the consumption time is relatively recent, and the consumption amount and frequency are not high.
General Keeping Customers (Class C/010) low (0) high (1) low (0) Recently, the consumption time is long and the frequency of consumption is high, but the amount is not high.
General Customer Retention (Class C/000) low (0) low (0) low (0) very low

Data analysis case

  • renderings

  • Implementation steps

Suppose you have the following sample data:

client's name date Amount of consumption consumption quantity
Shanghai Co., Ltd 2020-05-20 76802 2630

The dataset needs to be processed into the following format:

Specific SQL implementation

SELECT customer_name,-- client's name
customer_avg_money,-- Average spending by current customers
customer_frequency, -- The current customer's consumption frequency
total_frequency,-- Total consumption frequency of all customers
total_avg_frequency, -- Average consumption frequency of all customers
customer_recency_diff, -- The difference between the current customer's last purchase date and the current date
total_recency, -- The average value of the difference between the last purchase date of all customers and the current date
monetary,-- Consumption Amount Vectorization
frequency, -- Consumption frequency vectorization
recency, -- Vectorization of recent consumption
rfm, -- rfm
CASE
    WHEN rfm = "111" THEN "Important value customers"
    WHEN rfm = "101" THEN "Important development customers"
    WHEN rfm = "011" THEN "Important to keep customers"
    WHEN rfm = "001" THEN "Important customer retention"
    WHEN rfm = "110" THEN "General Value Customers"
    WHEN rfm = "100" THEN "General Development Clients"
    WHEN rfm = "010" THEN "generally keep customers"
    WHEN rfm = "000" THEN "General customer retention"
           END AS rfm_text
FROM
  (SELECT customer_name,-- client's name
customer_avg_money,-- Average spending by current customers
customer_frequency, -- The current customer's consumption frequency
total_avg_money ,-- Average total spend by all customers
total_frequency,-- Total consumption frequency of all customers
total_frequency / count(*) over() AS total_avg_frequency, -- Average consumption frequency of all customers
customer_recency_diff, -- The difference between the current customer's last purchase date and the current date
avg(customer_recency_diff) over() AS total_recency, -- The average value of the difference between the last purchase date of all customers and the current date
if(customer_avg_money > total_avg_money,1,0) AS monetary, -- Consumption Amount Vectorization
if(customer_frequency > total_frequency / count(*) over(),1,0) AS frequency, -- Consumption frequency vectorization
if(customer_recency_diff > avg(customer_recency_diff) over(),0,1) AS recency, -- Vectorization of recent consumption
concat(if(customer_recency_diff > avg(customer_recency_diff) over(),0,1),if(customer_frequency > total_frequency / count(*) over(),1,0),if(customer_avg_money > total_avg_money,1,0)) AS rfm
   FROM
     (SELECT customer_name, -- client's name
max(customer_avg_money) AS customer_avg_money , -- Average spending by current customers
max(customer_frequency) AS customer_frequency, -- The current customer's consumption frequency
max(total_avg_money) AS total_avg_money ,-- Average total spend by all customers
max(total_frequency) AS total_frequency,-- Total consumption frequency of all customers
datediff(CURRENT_DATE,max(customer_recency)) AS customer_recency_diff -- The difference between the current customer's last purchase date and the current date
FROM
        (SELECT customer_name, -- client's name
avg(money) over(partition BY customer_name) AS customer_avg_money, -- Average spending by current customers
count(amount) over(partition BY customer_name) AS customer_frequency, -- The current customer's consumption frequency
avg(money) over() AS total_avg_money,-- Average total spend by all customers
count(amount) over() AS total_frequency, --Total consumption frequency of all customers
max(sale_date) over(partition BY customer_name) AS customer_recency -- The current customer's last purchase date

         FROM customer_sales) t1
      GROUP BY customer_name)t2) t3

Through the above analysis, the corresponding customers can be labeled with customer characteristics, so that different marketing strategies can be specified for certain types of customers.

Boston Matrix and Data Visualization

basic concept

Boston Matrix (BCG Matrix) is also known as market growth rate-relative market share matrix, Boston Consulting Group method, four-quadrant analysis method, product series structure management method, etc.

The BCG matrix distinguishes 4 business combinations:

  • 1. Star business (Stars, refers to high growth, high market share)
  • 2. Question Marks (High Growth, Low Market Share)
  • 3. Cash cows business (Cash cows, refers to low growth, high market share)
  • 4. Thin dog business (Dogs, refers to low growth, low market share)

The Boston Matrix analyzes and determines the product structure of an enterprise through sales growth rate (an indicator reflecting market gravity) and market share (an indicator reflecting enterprise strength).

case

  • renderings

  • Implementation steps

This case is based on the analysis of customers, classifying customers and finding star customers, cash cow customers, problem customers and thin dog customers.

Suppose the dataset has the following style:

customer type client's name Amount of consumption
Class A Shanghai** Company 20000 2020-05-30

First, you need to calculate the customer unit price: the average consumption amount of each customer, that is, the customer unit price = the total consumption amount of a customer) / the number of times a customer consumes

Next, you need to calculate the number of records: the number of consumption by each customer, that is, the total number of consumption by a customer

Next, you need to calculate the average consumption amount: the average consumption amount of all customers, that is, the total consumption amount of all customers / the number of consumption by all customers

Finally, calculate the average consumption times: the average consumption times of all customers, that is, the total consumption times of all customers/total number of customers

Specific SQL implementation:

SELECT 
    customer_name, -- client's name
    customer_avg_money, -- Customer price
    customer_frequency , -- The number of purchases by the current customer
    total_avg_money,-- Average spending by all customers
    total_frequency / count(*) over() AS total_avg_frequency -- average consumption

FROM
  (SELECT 
        customer_name, -- client's name
        max(customer_avg_money) AS customer_avg_money, -- Customer price
        max(customer_frequency) AS customer_frequency , -- The number of purchases by the current customer
        max(total_avg_money) AS total_avg_money,-- Average spending by all customers
        max(total_frequency) AS total_frequency --Total consumption frequency of all customers

   FROM
     (
        SELECT
             customer_name, -- client's name
             avg(money) over(partition BY customer_name) AS customer_avg_money, -- Customer price
             count(*) over(partition BY customer_name) AS customer_frequency, -- The number of purchases by the current customer
            avg(money) over() AS total_avg_money,-- Average spending by all customers
            count(*) over() AS total_frequency --Total consumption frequency of all customers

        FROM customer_sales ) t1
GROUP BY customer_name) t2

After the above analysis, we can roughly see the customer portrait:

  • A customer whose consumption frequency exceeds the average level and whose consumption intensity (unit price per customer) also exceeds the average level: it is judged as a star customer, and this type of customer needs to be focused on;
  • A customer's consumption frequency exceeds the average, but the consumption intensity does not reach the average level each time: It is judged as a cash cow customer. This type of customer usually consumes frequently and can bring a relatively stable cash flow to the company. Customers are the cornerstone of corporate profits;
  • A customer whose consumption frequency does not reach the average level, but each time the consumption intensity exceeds the average level: it is a problem customer. This type of customer is most likely to be converted into a star customer, but because of the customer’s potential problems, the frequency of consumption is not high. , such customers need to focus on follow-up and long-term communication;
  • Customers whose consumption frequency has not reached the average level and consumption intensity has not reached the average level: they are thin dog customers. These customers usually account for the majority of corporate customers, and only need general maintenance. energy.

Summarize

This article mainly introduces the common data analysis methods that should be understood in data warehouse development. There are three main types: Pareto analysis, RFM analysis and Boston matrix analysis. This paper introduces the basic concepts, operation steps and SQL implementation of the three analysis methods, and gives the corresponding visual analysis charts. Each case is a real application scenario of the enterprise. It is hoped that students who develop data warehouses can provide some perspectives on the analysis of observation data, so that in the actual development process, they can think more about the application value of data and how data can empower businesses, so as to further improve their comprehensive capabilities.

Official account "Big Data Technology and Data Warehouse", reply "Data" to receive the big data package

Tags: hive Data Warehouse

Posted by zero-one on Tue, 24 May 2022 23:31:42 +0300