Post calculation script of SQL

In most cases, we can complete the database calculation with SQL (stored procedure), but if we encounter some complex operations that SQL is not good at, we can only read the data out of the database with other program languages, and then complete the calculation outside the database. Such program languages often appear in the form of simple scripts, which we call SQL post calculation scripts here.
The operations that SQL is not good at mainly include complex set calculation, ordered calculation, association calculation, multi-step calculation, etc. SQL aggregation is not thorough enough and there is no explicit collection data type, which makes it difficult to reuse the collection generated in the calculation process. For example, the collection must be forcibly summarized after grouping, and the subset based on grouping cannot be calculated again; SQL is designed based on unordered set theory. It is very troublesome to deal with ordered operations such as cross row groups and ranking. It often uses JOIN or sub query to temporarily generate sequence numbers, which is not only difficult to write, but also very inefficient. SQL does not support the reference of records, and can only use sub query or JOIN statement to describe the association relationship. Once there are many levels or self association, the code will be extremely complex; SQL itself does not advocate multi-step code and often forces programmers to write long statements nested in many layers. Although using stored procedures can solve this problem to a certain extent, sometimes the actual environment does not allow us to use stored procedures. For example, DBA strictly controls the permissions of stored procedures, old databases and small databases do not support stored procedures, and the debugging of stored procedures is also very inconvenient, It is not very suitable for writing process calculations.
In addition to the above complex operations, SQL post calculation scripts are also used in some cases. For example, computing logic needs to be migrated between different kinds of databases, involving non relational databases; The input source or output target is not only the database, but Excel, text and other files; It is also possible to perform hybrid calculations between multiple databases. These calculations will be used outside the SQL library.
For the post calculation script of SQL, of course, the most important function is to realize those complex operations that SQL is not good at. In addition, it is better to have some more advanced features, such as computing files, non relational databases and other diverse data, can handle a large amount of data, and the operation performance can not be too slow. Of course, the most basic thing is to easily support reading and writing databases, so as to realize post calculation of SQL.
Common scripts for post SQL calculation include JAVA, python, pandas and esProc. Let's have an in-depth understanding of these scripts and see their ability differences in post SQL calculation.

C + +, JAVA and other high-level languages are omnipotent in theory, and naturally they can realize the operations that SQL is not good at. JAVA supports generics, and the collection is relatively complete, which can realize complex collection operations. JAVA array has serial number, which can realize orderly operation. JAVA supports object reference, which can be used to represent relationships, and there is no problem with association operation. JAVA supports branch, loop and other procedural syntax, which can easily realize multi-step complex operation.
However, JAVA lacks structured class library. Even the simplest structured calculation must be hard coded, and the most basic structured data types must be established manually, which will lead to lengthy and cumbersome code.
Take an example of orderly calculation: find the longest consecutive rise days of a stock. The library table AAPL stores the stock price information of a stock. The main fields include trading date and closing price. Please calculate the longest consecutive rising days of the stock.
Realize this task according to the natural idea: cycle the stock records with orderly dates. If this record is higher than the previous record, add 1 to the number of consecutive rising days (initially 0). If it is lower, compare the number of consecutive rising days with the current maximum number of consecutive rising days (initially 0), select a new current maximum number of consecutive rising days, and then clear 0 to the number of consecutive rising days. Until the end of this cycle, the current maximum continuous rise days is the final maximum continuous rise days.
SQL is not good at orderly calculation and cannot be realized with the above natural ideas. It can only use some strange and difficult skills: divide the stock records orderly by date into several groups, and the records that rise continuously into the same group, that is, if the stock price of a day rises compared with the previous day, it will be divided into the same group as the records of the previous day. If it falls, it will start a new group. Finally, let's look at the largest number of members in all groups, that is, the number of days that have increased continuously.
The details are as follows:

select max(continue_inc_days)
from (select count(*) continue_inc_days
      from   (select sum(inc_de_flag) over(order by transDate) continue_de_days
                    from (select transDate,
                                          case when
                                          price>LAG(price)   over(order by transDate)
                                          then 0 else 1 end inc_de_flag
                                           from AAPL) )
                  group by continue_de_days)

This SQL is not very long, but it is nested in four layers. The skills used are strange and difficult to understand. It is difficult for ordinary people to think of such code.
When implemented in JAVA, you can return to the natural idea:

package stock; 
import java.sql.*; 
public class APP {
       public static void main(String[] args) throws SQLException,
                            ClassNotFoundException   {
                                          Connection con = null;
                                          con = DriverManager .getConnection( "jdbc:mysql://","root", ""); 
                                          String dql = "select   * from AAPL order by transDate";
                                          PreparedStatement   stmt = con.prepareStatement(dql, ResultSet.TYPE_SCROLL_INSENSITIVE, ResultSet.CONCUR_READ_ONLY);
                                          ResultSet aapl = stmt.executeQuery();
                                          int continue_inc_days = 0;
                                          int max_continue_inc_days = 0;              
                                          float last_Price = 0;              
                                          while ( {
                                                               float price = aapl.getFloat("price");
                                                               if (price >= last_Price) {
                                                               } else {
                                                                         if (continue_inc_days >= max_continue_inc_days) { 
                                                                                 max_continue_inc_days = continue_inc_days;
                                                               continue_inc_days = 0; 
                                          last_Price = price; 
                       System.out.println("max_continue_inc_days=" + max_continue_inc_days); 
                       if (con != null) 

The latter part of the code is the idea described above, which can be completed as long as one layer of loop.
However, we also found that although the code written by Java is simple and not difficult, the code is obviously very lengthy.
The complexity of this problem is not high. It does not involve the common structured data calculation such as grouping and connection. Otherwise, the amount of code will be even more amazing. Limited to space, JAVA will not be used as an example.
In terms of advanced functions such as diverse data, optimizing performance and processing big data, JAVA is also characterized by "can be realized, but it is too cumbersome", which will not be repeated here.
JAVA is an excellent enterprise level general language, but the other meaning of general is often unprofessional. In other words, JAVA lacks professional structured computing class library, and the code is lengthy and cumbersome, which is not an ideal post SQL computing script.

Python pandas
Python has a simple syntax and many third-party function libraries, including Pandas, which serves structured computing. Because of this, Pandas is often used as a post calculation script for SQL.
As a structured computing function library, Pandas has much stronger ability to simplify complex SQL operations than JAVA.
For example, the same ordered operation "find the longest consecutive rising days", and the Pandas code is as follows:

import pymysql
import pandas as pd
conn =   pymysql.connect(
    host = "",
    port = 3306,
    user = "root", 
    password = "",
    database = "mysql",
aapl = pd.read_sql_query("select price from AAPL order by   transDate", conn)
continue_inc_days=0 ; max_continue_inc_days=0
for i in aapl['price'].shift(0)>aapl['price'].shift(1):
    continue_inc_days   =0 if i==False else continue_inc_days +1 
    max_continue_inc_days   = continue_inc_days if max_continue_inc_days   < continue_inc_days else max_continue_inc_days

In the above code, Pandas provides a data structure dataFrame for structured computing. This data structure naturally has serial numbers, which can simplify the code in ordered operation and make cross line data retrieval easier than JAVA. In addition, the encapsulation of SQL fetching by Pandas is also very compact, which is shorter than JAVA code.
Another example is set calculation: a row is split into multiple rows. Split the string of the omies table into two spaces, where the omies field is separated by the omies member, and the omies field needs to be separated by a space, and the omies field needs to be separated by a space.
Data before processing

Processed data:

The core code of Pandas is as follows (omitting database input and output, the same below):

split_dict = pd.read_sql_query("select * from tb", conn)
split_list = []
for key,value in split_dict.items():
    anomalies = value[0].split(' ')    
    key_array = np.tile(key,len(anomalies))    
    split_df =   pd.DataFrame(np.array([key_array,anomalies]).T,columns=['ID','ANOMALIES'])    
df = pd.concat(split_list,ignore_index=True)

In the above code, Pandas uses the set function to directly split the string into dataframes, and then uses the set function to directly merge multiple dataframes. The code is very concise. Although JAVA can achieve similar functions, they all need to be implemented manually, and the code is much more cumbersome.
As a structured computing function library, Pandas code is indeed simpler than JAVA, but this is limited to the case of limited complexity. If the complexity is further improved, Pandas code will become lengthy and difficult to understand.
For example, this example involving set calculation + ordered calculation: continuous duty. The duty table in the warehouse records the daily duty situation. A person will usually be on duty for several working days, and then change people. Now please calculate the continuous duty situation of each person in turn according to the duty. The data structure is as follows:
Before treatment (duty)

After treatment

The core pandas code is as follows:

duty = pd.read_sql_query("select date,name from duty order by   date", conn)
name_rec = ''
start = 0
duty_list = []
for i in range(len(duty)):
    if name_rec == '':        
        name_rec = duty['name'][i]    
    if name_rec != duty['name'][i]:
           begin =   duty['date'].loc[start:i-1].values[0]        
           end =   duty['date'].loc[start:i-1].values[-1] 
           start = i        
           name_rec = duty['name'][i]
 begin = duty['date'].loc[start:i].values[0]
 end = duty['date'].loc[start:i].values[-1]
 duty_b_e = pd.DataFrame(duty_list,columns=['name','begin','end']) 

The process of database output has been omitted above. You can see that the code is still a little cumbersome.

Another example is set calculation + multi-step operation: calculate installment loan details. The library table loan records loan information, including loan ID, total loan amount, number of periods and annual interest rate, as shown below:

It is necessary to calculate the details of each period, including current repayment amount, current interest, current principal and remaining principal. The calculation results are as follows:

The core code of Pandas to realize the above operation is as follows:

loan_data = pd.read_sql_query("select loanID,LoanAmt,Term,Rate from   loan", conn)
loan_data['mrate'] = loan_data['Rate']/(100*12)
loan_data['mpayment'] =   loan_data['LoanAmt']*loan_data['mrate']*np.power(1+loan_data['mrate'],loan_data['Term'])   \                          /(np.power(1+loan_data['mrate'],loan_data['Term'])-1)
loan_term_list = []
for i in range(len(loan_data)):
    loanid =   np.tile(loan_data.loc[i]['LoanID'],loan_data.loc[i]['Term'])    
    loanamt = np.tile(loan_data.loc[i]['LoanAmt'],loan_data.loc[i]['Term'])    
    term =   np.tile(loan_data.loc[i]['Term'],loan_data.loc[i]['Term'])   
     rate =   np.tile(loan_data.loc[i]['Rate'],loan_data.loc[i]['Term'])    
     payment =   np.tile(np.array(loan_data.loc[i]['mpayment']),loan_data.loc[i]['Term'])    
     interest = np.zeros(len(loanamt))    
     principal = np.zeros(len(loanamt))
     principalbalance  = np.zeros(len(loanamt))    
     loan_amt = loanamt[0]    
     for j in range(len(loanamt)):       
      interest[j] =   loan_amt*loan_data.loc[i]['mrate']        
      principal[j] = payment[j]   - interest[j]        
      principalbalance[j] =   loan_amt - principal[j]        
      loan_amt =   principalbalance[j]    
    loan_data_df =   pd.DataFrame(np.transpose(np.array([loanid,loanamt,term,rate,payment,interest,principal,principalbalance])), 
                     columns =   ['loanid','loanamt','term','rate','payment','interest','principal','principalbalance'])
 loan_term_pay = pd.concat(loan_term_list,ignore_index=True)

It can be seen that although Python is much better than JAVA in simplifying complex SQL operations, it is limited to simple cases. If the requirements are more complex, the code will become lengthy and difficult to understand. The reason for this phenomenon may be that Pandas is only a third-party function library, can not get the underlying support of Python from the syntax level, and the design is not professional enough.

The lack of professionalism of pandas is also reflected in the diversity of data. Pandas does not develop a unified interface for various data sources. It only supports common local files, but does not support complex data sources, such as Hadoop and MongoDB. Users have to find a third-party (actually the fourth party) function library and write complex access code. Pandas does not even have a unified database interface. For example, MySQL has several third-party function libraries, including PyMySQL, sqlalchemy and MySQL dB. However, this problem is not serious for most desktop application scenarios, and common database Python can basically simply support it.
For the multi-source mixed correlation problem, as long as the data of various data sources can be read out, it can be basically realized. Pandas's performance in this regard is basically satisfactory. However, according to the above statement, pandas is easy to implement simple mixed association relations, and once there is a more complex association operation, the implementation process will become difficult.
In terms of large amount of data, pandas's performance is not satisfactory. Pandas does not have cursor data type, which leads to that when solving the calculation of large amount of data, it must be hard coded to realize circular data retrieval, and the internal and external memory exchange cannot be carried out automatically. Therefore, the code is extremely cumbersome. For details, please refer to How Python Handles Big Files
The operational performance of Pandas is also average, but it is basically sufficient. Multithreading parallelism is often criticized, and it is difficult to implement such operations in Python. For example, database IO is generally slow, but parallel fetching can be used to improve fetching performance when the database is not busy. Python needs the help of other third-party function libraries to realize parallelism. The code is extremely cumbersome, and there is no guarantee for the convenience of expression efficiency, execution efficiency and stability.
Although Pandas is a structured computing function library, it is still not easy to use.

Like Pandas, esProc also has rich structured computing functions. Unlike Pandas, esProc is a product supported by commercial companies. It is a professional structured computing language, not a third-party library function of the open source community, and there is no loose parent organization. esProc can design a consistent structured calculation syntax from a global perspective, and can design a unified structured data type from bottom to top, so as to match and combine functions with maximum flexibility, so as to quickly and conveniently solve the problems encountered in post SQL calculation.
As a professional structured computing language, esProc is good at simplifying complex SQL operations. For example, to find the longest consecutive rising days and realize the natural idea mentioned above, esProc only needs two lines:

The above code uses the sequence table and circular function. The sequence table is a data structure dedicated to structured calculation. It is easier to get data across rows than Pandas, and it is more convenient to realize orderly calculation. The circular function can avoid most of the for statements (for should still be used in complex cases), and can greatly simplify the code. In addition, esProc encapsulates SQL fetches more compactly and is shorter than Pandas code.
For another example, if a line is split into multiple lines, the esProc code is still short:

esProc can still be easily implemented even if the requirement complexity is further improved.
For example, in the case of continuous duty, esProc code is much shorter than Pandas:

Another example is to calculate the details of installment loans. esProc is also shorter than Pandas:

esProc is usually easy to implement for complex operations that are difficult to implement by Pandas, and the code is not difficult. For example, tasks involving multi-step algorithm + set operation + dynamic table structure: insert sub tables horizontally into sub tables.
Source table relationship

Target results

esProc can greatly simplify this Code:

As a professional structured computing language, esProc can not only greatly simplify the complex operations that SQL is not good at, but also have higher-level ability to solve some special situations.
In terms of diverse data, esProc supports a variety of file formats and complex data sources, such as Hadoop, MongoDB, etc. Furthermore, using the same code, data analysts can calculate data from different sources, both database and non database.
In terms of large data volume, esProc provides a cursor mechanism from the bottom layer, hides the details of internal and external memory exchange from the top layer, and allows data analysts to intuitively deal with large data volume with a syntax similar to dealing with small data volume.
For example, the warehouse table orders records a large number of orders of e-commerce, all of which will exceed the memory. Now it is necessary to calculate the three orders with the largest sales of each salesperson outside the warehouse. esProc code is as follows:

esProc also provides many easy-to-use methods for performance optimization. For example, the data in the orders table is roughly the same every month. Please conduct multi-threaded parallel query by month, so as to greatly improve the query performance. esProc only needs the following code:

After the previous comparison, we can find that esProc has rich structured functions and is a professional structured computing language. It can greatly simplify the complex computing logic that SQL is not good at. It is a more ideal post SQL computing script.

Posted by wmguk on Mon, 02 May 2022 09:21:17 +0300