Data warehouse development often needs to deal with data tables, so after the data warehouse table development is completed, everything will be fine? Obviously not, you also need to think about how to analyze the data and how to present the data, because this is an important aspect of the data value. Through data analysis and visual presentation, the secrets behind the data can be more intuitively provided, thereby assisting business decision-making and realizing real data-enabled business. From this article you can learn:
- Pareto Analysis Method and Data Visualization
- RFM Analysis and Data Visualization
- Boston Matrix and Data Visualization
Pareto Analysis and Data Visualization
basic concept
Pareto analysis, also known as ABC analysis, is the 80/20 rule we usually refer to. Regarding Pareto analysis, there are different applications in different industries.
- take a chestnut
In the inventory management of enterprises, it can be found that a few varieties account for a large proportion in the total demand (or total supply, total inventory, and total reserves), but the proportion of the corresponding value rare. Therefore, the Pareto analysis method can be used to classify the various items needed by the enterprise according to the size of the demand, the importance of the items, the shortage of resources and the difficulty of purchasing, the level of the unit price, the amount of reserve funds occupied and other factors. Divided into several categories, the implementation of classification management.
In the analysis of commodity sales, the sales of some commodities account for a large part of the total sales, and the sales of some commodities only account for a small proportion, so that they can be divided into A, B, and C categories. , invest in categories with a larger proportion of sales to obtain more sales.
In the quality analysis, the number of products that cause unqualified product quality for a certain reason is analyzed. Using the Pareto analysis method, it is very intuitive to see which reasons cause the unqualified product quality and which reasons are more serious. This allows you to focus on solving important problems, clear goals, and make it easier to operate.
- another way of expressing
According to the main characteristics of things in terms of technology or economy, they are classified to distinguish between the key and the non-key. Each category is treated differently and managed, and the analyzed objects are divided into three categories: A, B, and C. There is no clear division of numerical boundaries for the three categories of items.
Classification and importance | describe |
---|---|
Category A (very important) | Small proportion of quantity, large proportion of value |
Class B (more important) | Not as important as class A, between A and C |
Class C (generally important) | A large proportion of quantity but a small proportion of value |
The core idea of ​​classification: a few contribute most of the value. Take product categories and sales as an example: Category A accounts for 10% of the total, but contributes 80% of sales.
Data analysis case
- renderings
- Implementation steps
Suppose you have the following dataset format:
brand | sales |
---|---|
NEW BALANCE (New Balance) | 8750 |
ZIPPO (treasure) | 9760 |
OCTMAMI (October Mommy) | 5800 |
The data needs to be processed into the following format:
brand | sales | Total sales | Cumulative sales | Cumulative sales percentage |
---|---|---|---|---|
=∑ All brand sales | = current brand sales + previous brand sales | Cumulative Sales/Total Sales |
The specific SQL implementation is as follows:
SELECT brand, -- brand total_money, -- sales sum(total_money) over() AS sum_total_money,-- Total sales sum(total_money) over(ORDER BY total_money DESC ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW) AS acc_sum_total_money -- Cumulative sales FROM sales_money
The specific SQL implementation is given above. In fact, BI tools have built-in many processing functions and drag-and-drop data processing. You can process a detailed data into the above form without writing SQL.
- Conclusion analysis
As can be seen from the above Pareto chart: Class A (green part) accounts for about 80% of total sales, Class B (yellow part) accounts for 10% of total sales, and Class C (red part) accounts for about 80% of total sales. 10% of total sales. Next, you can conduct long-tail analysis, formulate marketing strategies, and more.
RFM Analysis and Data Visualization
basic concept
RFM model is a commonly used model in customer relationship management (CRM). RFM model is an important tool and means to measure customer value and customer profitability. The model describes the value of a customer through three indicators: the customer's recent purchase behavior, the overall frequency of purchases, and the amount of money spent.
The RFM model shows the entire outline of a customer more dynamically, which provides a basis for personalized communication and service. At the same time, if the customer has been dealing with the customer for a long time, it can also accurately judge the long-term value of the customer (even is the lifetime value), which supports more marketing decisions by improving the condition of the three metrics.
In the RFM model, three key factors are included, namely:
- R(Recency): Indicates how far the customer last purchased, that is, the most recent consumption, the more recent the consumption time, the greater the value of the customer
- F(Frequency): Indicates the number of purchases made by customers in a recent period of time, that is, the consumption frequency. Users who buy frequently are regular customers, and their value is definitely greater than that of customers who come here once in a while.
- M (Monetary): Indicates the amount purchased by the customer in the recent period of time, that is, the customer's spending power. Usually, the average single consumption amount of the customer is used as a measure. The more consumption, the greater the value of the user.
The last consumption, consumption frequency, and consumption amount are the most important and easiest ways to measure consumer value, which fully demonstrates the guiding significance of these three indicators for marketing activities. Among them, the most recent consumption is the strongest predictor.
Through the above analysis, the customer groups can be classified:
Customer Type and Level | R | F | M | customer characteristics |
---|---|---|---|---|
Important value customer (A-level/111) | high (1) | high (1) | high (1) | Recent consumption time is near, consumption frequency and consumption amount are high |
Important Development Clients (Class A/101) | high (1) | low (0) | high (1) | The recent consumption time is relatively short, the consumption amount is high, but the frequency is not high, the loyalty is not high, and the potential users must focus on development |
Important Customer Retention (Class B/011) | low (0) | high (1) | high (1) | Recently, the consumption time is far away, and the consumption amount and frequency are very high. |
Important Customer Retention (Class B/001) | low (0) | low (0) | high (1) | Recently, users who spend a long time and have infrequent consumption but have a high consumption amount may be users who are about to lose or have been lost, and should be based on retention measures. |
General Value Customers (Class B/110) | high (1) | high (1) | low (0) | The recent consumption time is short and the frequency is high, but the consumption amount is low, and the unit price of its customers needs to be increased. |
General Development Clients (Grade B/100) | high (1) | low (0) | low (0) | Recently, the consumption time is relatively recent, and the consumption amount and frequency are not high. |
General Keeping Customers (Class C/010) | low (0) | high (1) | low (0) | Recently, the consumption time is long and the frequency of consumption is high, but the amount is not high. |
General Customer Retention (Class C/000) | low (0) | low (0) | low (0) | very low |
Data analysis case
- renderings
- Implementation steps
Suppose you have the following sample data:
client's name | date | Amount of consumption | consumption quantity |
---|---|---|---|
Shanghai Co., Ltd | 2020-05-20 | 76802 | 2630 |
The dataset needs to be processed into the following format:
Specific SQL implementation
SELECT customer_name,-- client's name customer_avg_money,-- Average spending by current customers customer_frequency, -- The current customer's consumption frequency total_frequency,-- Total consumption frequency of all customers total_avg_frequency, -- Average consumption frequency of all customers customer_recency_diff, -- The difference between the current customer's last purchase date and the current date total_recency, -- The average value of the difference between the last purchase date of all customers and the current date monetary,-- Consumption Amount Vectorization frequency, -- Consumption frequency vectorization recency, -- Vectorization of recent consumption rfm, -- rfm CASE WHEN rfm = "111" THEN "Important value customers" WHEN rfm = "101" THEN "Important development customers" WHEN rfm = "011" THEN "Important to keep customers" WHEN rfm = "001" THEN "Important customer retention" WHEN rfm = "110" THEN "General Value Customers" WHEN rfm = "100" THEN "General Development Clients" WHEN rfm = "010" THEN "generally keep customers" WHEN rfm = "000" THEN "General customer retention" END AS rfm_text FROM (SELECT customer_name,-- client's name customer_avg_money,-- Average spending by current customers customer_frequency, -- The current customer's consumption frequency total_avg_money ,-- Average total spend by all customers total_frequency,-- Total consumption frequency of all customers total_frequency / count(*) over() AS total_avg_frequency, -- Average consumption frequency of all customers customer_recency_diff, -- The difference between the current customer's last purchase date and the current date avg(customer_recency_diff) over() AS total_recency, -- The average value of the difference between the last purchase date of all customers and the current date if(customer_avg_money > total_avg_money,1,0) AS monetary, -- Consumption Amount Vectorization if(customer_frequency > total_frequency / count(*) over(),1,0) AS frequency, -- Consumption frequency vectorization if(customer_recency_diff > avg(customer_recency_diff) over(),0,1) AS recency, -- Vectorization of recent consumption concat(if(customer_recency_diff > avg(customer_recency_diff) over(),0,1),if(customer_frequency > total_frequency / count(*) over(),1,0),if(customer_avg_money > total_avg_money,1,0)) AS rfm FROM (SELECT customer_name, -- client's name max(customer_avg_money) AS customer_avg_money , -- Average spending by current customers max(customer_frequency) AS customer_frequency, -- The current customer's consumption frequency max(total_avg_money) AS total_avg_money ,-- Average total spend by all customers max(total_frequency) AS total_frequency,-- Total consumption frequency of all customers datediff(CURRENT_DATE,max(customer_recency)) AS customer_recency_diff -- The difference between the current customer's last purchase date and the current date FROM (SELECT customer_name, -- client's name avg(money) over(partition BY customer_name) AS customer_avg_money, -- Average spending by current customers count(amount) over(partition BY customer_name) AS customer_frequency, -- The current customer's consumption frequency avg(money) over() AS total_avg_money,-- Average total spend by all customers count(amount) over() AS total_frequency, --Total consumption frequency of all customers max(sale_date) over(partition BY customer_name) AS customer_recency -- The current customer's last purchase date FROM customer_sales) t1 GROUP BY customer_name)t2) t3
Through the above analysis, the corresponding customers can be labeled with customer characteristics, so that different marketing strategies can be specified for certain types of customers.
Boston Matrix and Data Visualization
basic concept
Boston Matrix (BCG Matrix) is also known as market growth rate-relative market share matrix, Boston Consulting Group method, four-quadrant analysis method, product series structure management method, etc.
The BCG matrix distinguishes 4 business combinations:
- 1. Star business (Stars, refers to high growth, high market share)
- 2. Question Marks (High Growth, Low Market Share)
- 3. Cash cows business (Cash cows, refers to low growth, high market share)
- 4. Thin dog business (Dogs, refers to low growth, low market share)
The Boston Matrix analyzes and determines the product structure of an enterprise through sales growth rate (an indicator reflecting market gravity) and market share (an indicator reflecting enterprise strength).
case
- renderings
- Implementation steps
This case is based on the analysis of customers, classifying customers and finding star customers, cash cow customers, problem customers and thin dog customers.
Suppose the dataset has the following style:
customer type | client's name | Amount of consumption | |
---|---|---|---|
Class A | Shanghai** Company | 20000 | 2020-05-30 |
First, you need to calculate the customer unit price: the average consumption amount of each customer, that is, the customer unit price = the total consumption amount of a customer) / the number of times a customer consumes
Next, you need to calculate the number of records: the number of consumption by each customer, that is, the total number of consumption by a customer
Next, you need to calculate the average consumption amount: the average consumption amount of all customers, that is, the total consumption amount of all customers / the number of consumption by all customers
Finally, calculate the average consumption times: the average consumption times of all customers, that is, the total consumption times of all customers/total number of customers
Specific SQL implementation:
SELECT customer_name, -- client's name customer_avg_money, -- Customer price customer_frequency , -- The number of purchases by the current customer total_avg_money,-- Average spending by all customers total_frequency / count(*) over() AS total_avg_frequency -- average consumption FROM (SELECT customer_name, -- client's name max(customer_avg_money) AS customer_avg_money, -- Customer price max(customer_frequency) AS customer_frequency , -- The number of purchases by the current customer max(total_avg_money) AS total_avg_money,-- Average spending by all customers max(total_frequency) AS total_frequency --Total consumption frequency of all customers FROM ( SELECT customer_name, -- client's name avg(money) over(partition BY customer_name) AS customer_avg_money, -- Customer price count(*) over(partition BY customer_name) AS customer_frequency, -- The number of purchases by the current customer avg(money) over() AS total_avg_money,-- Average spending by all customers count(*) over() AS total_frequency --Total consumption frequency of all customers FROM customer_sales ) t1 GROUP BY customer_name) t2
After the above analysis, we can roughly see the customer portrait:
- A customer whose consumption frequency exceeds the average level and whose consumption intensity (unit price per customer) also exceeds the average level: it is judged as a star customer, and this type of customer needs to be focused on;
- A customer's consumption frequency exceeds the average, but the consumption intensity does not reach the average level each time: It is judged as a cash cow customer. This type of customer usually consumes frequently and can bring a relatively stable cash flow to the company. Customers are the cornerstone of corporate profits;
- A customer whose consumption frequency does not reach the average level, but each time the consumption intensity exceeds the average level: it is a problem customer. This type of customer is most likely to be converted into a star customer, but because of the customer’s potential problems, the frequency of consumption is not high. , such customers need to focus on follow-up and long-term communication;
- Customers whose consumption frequency has not reached the average level and consumption intensity has not reached the average level: they are thin dog customers. These customers usually account for the majority of corporate customers, and only need general maintenance. energy.
Summarize
This article mainly introduces the common data analysis methods that should be understood in data warehouse development. There are three main types: Pareto analysis, RFM analysis and Boston matrix analysis. This paper introduces the basic concepts, operation steps and SQL implementation of the three analysis methods, and gives the corresponding visual analysis charts. Each case is a real application scenario of the enterprise. It is hoped that students who develop data warehouses can provide some perspectives on the analysis of observation data, so that in the actual development process, they can think more about the application value of data and how data can empower businesses, so as to further improve their comprehensive capabilities.
Official account "Big Data Technology and Data Warehouse", reply "Data" to receive the big data package