[Frontiers of Trading Technology] Experience Sharing of Real-time Computing System Construction

Abstract: Real-time computing technology has been applied to various fields such as advertising, e-commerce, games, and entertainment. For example, e-commerce websites analyze user attributes in real time, and push relevant products to customers based on the analysis results; online games analyze player data in real time, and then analyze game parameters and parameters in real time. Balance adjustment. This article focuses on the process and experience of CITIC Construction Investment Securities using real-time computing technology to build a real-time computing system.

Keywords: real-time computing, real-time data processing, data synchronization

1 Overview

1.1 What is real-time computing

   Real-time computing refers to unbounded data processing of unlimited data, which is generally performed for massive data, and the delay requirement is second. Real-time computing is mainly divided into two parts:Real-time storage of data and real-time calculation of data.

1.2 Application areas

   At present, real-time computing technology has been applied to various fields such as advertising, e-commerce, games, entertainment, etc. For example, e-commerce websites analyze user attributes in real time, and push relevant products to customers based on the analysis results; online games analyze player data in real time, and then analyze game parameters and parameters in real time. Balance adjustment.
   China Securities Co., Ltd. applied it to the financial field and built a real-time computing system to aggregate customers' assets in stocks and debt bases, wealth management, options, two financing, precious metals and other businesses in real time, and provide customers with a full range of asset query and analysis. Serve.

1.3 Introduction to real-time computing system

   Benefiting from the gradually mature real-time computing technology, CSC built a real-time computing system at the end of 2018 to provide real-time asset query services for employees and customers. days, interest-bearing days and other indicators.
   The real-time computing system puts customers in centralized trading, two financing, OTC,The data assets in the five major systems of options and gold (hereinafter referred to as the "five major systems") are aggregated and changed in synchronization with the data of the five major systems.

2. Construction experience

2.1 System Architecture

The number of customers included in the five major systems is tens of millions. With so many customers, various operations during the opening of the market will generate massive data. How to meet the needs of real-time data processing under the conditions of such large-scale data is a real-time computing system. problem that must be solved.
Before the market opens, the real-time computing system collects the T-1 day data after liquidation from the five major systems as the basis. After the market opens, the real-time computing system collects the changed data of the five major systems, and updates it to the basic data after processing. In order to facilitate understanding, the author has simplified the original architecture of the system, as shown in Figure 1:


Figure 1 Structure diagram of real-time computing system

  The real-time data flow in Figure 1 is as follows:
  1. Attunity The real-time data generated from the collection disk of the five major system data sources is pushed to the Kafka;At the same time, the market transfer tool obtains market data from the market service and pushes it to Redis;
  2. real-time computing system cluster Spark Streaming Consume in first-in, first-out order Kafka data within;
  3. Spark Streaming The real-time data is processed and updated to Redis Basic data within the cluster;
  4. Other systems access the real-time data interface on the winning platform;
  5. real-time data interface from Redis Get the latest customer data within the cluster and transform it into json The formatted data is returned to other systems.
  During the system construction process, the project team also demonstrated another solution, that is, to collect data directly from the five major system databases, and provide it to visitors after summarizing. However, after discussion and analysis, it is found that this scheme has many shortcomings:
  1. Due to security reasons, the five major systems cannot directly expose the database to the real-time computing system, which requires the five major systems to develop special interfaces for the real-time computing system to use, and the development and maintenance of the interface is heavy;
  2. Even if the five major systems provide a query interface, when the interface is called, the five major systems have to retrieve the data of the queried customer from the data of all customers, and not only query one table, the query efficiency is low;
  3. This solution requires a dedicated server to run, and a certain amount of server and network resources must be invested, which is not less expensive than other solutions.

2.2 Key Technologies

2.2.1 Real-time data collection

   In the early stage of system construction, the project team chose OGG as a real-time data acquisition tool.(Although OGG There are many problems and the efficiency is not high)
   OGG The software is a log-based structured data replication software, which can realize real-time capture, transformation and delivery of a large amount of transaction data, and realize data synchronization between the source database and the target database. Oracle GoldenGate Data additions, deletions, and changes are obtained by parsing the online logs or archived logs of the source database, and then applying these changes to the target database to achieve synchronization and active-active between the source database and the target database.
   After the completion of the first-phase system construction, the project team found that OGG It has caused a lot of workload to the operation and maintenance team. Every time the five major system databases are maintained or upgraded, the operation and maintenance team will spend a lot of time to restore OGG Serve. Therefore, the project team decided to use a more suitable data synchronization tool–Attunity Replicate(short name Attunity)to replace OGG. 
   Attunity It is a heterogeneous database change synchronization tool based on database logs. it adopts B/S Architecture, configure and maintain data source and target table through browser graphical interface, mouse click and drag, support including Oracle,SQLserver,DB2 Cross-platform data synchronization between more than a dozen homogeneous and heterogeneous databases. In addition, it supports parallel work, and the degree of concurrency can be customized, which makes it easier than OGG Lower latency.
   than OGG,Attunity Easier deployment and maintenance. Attunity No need to install software on the source or target operating system, just a separate server installation service,based on WEB The maintenance page is simple and convenient, and only a few operations are required to complete the data synchronization configuration, which greatly reduces the difficulty and workload of operation and maintenance.

2.2.2 Intermediate data storage

   The five major systems generate massive amounts of real-time data during market opening, even those running on clusters Spark Streaming It is also impossible to completely consume the data at the moment it is generated. Therefore, a buffer pool is needed to temporarily store real-time data. Considering various factors, the real-time computing system is selected Kafka as a buffer pool.
   Kafka It is a high-throughput distributed publish-subscribe messaging system. The real-time computing system sets up a special cluster to run Kafka,The cluster has enough memory space as a buffer pool to store real-time streaming data. Attunity As a producer, to Kafka real-time streaming data; Spark Streaming as a consumer. In order to ensure that every step of the customer's operation can be accurately synchronized to the real-time computing system, the system consumes strictly in accordance with the first-in, first-out method. Kafka data within.

2.2.3 Real-time stream computing

   Because real-time computing systems require high timeliness, there must be a tool that can process data quickly. Project team selection Spark Streaming to undertake data processing tasks. Spark Streaming Yes Spark core API It is an extension of the real-time stream data processing function with high throughput and fault-tolerant mechanism, which can meet the low-latency requirements of real-time computing systems.

Before the market opens, the real-time computing system pushes T-1 assets to Redis. After the market opens, Spark Streaming continues to consume data from Kafka and then process it. The data processing logic inside Spark Streaming is shown in Figure 2:

Figure 2 Data processing logic

   The data processing layer in Figure 2 Spark Streaming In cooperation with upstream and downstream tools, it undertakes the work of real-time stream processing. Take a customer's stock transaction during market opening as an example, Spark Streaming The simplified process of handling the client's assets is shown in Figure 3:

Figure 3 Stock transaction synchronization logic

   The interface service layer is responsible for classifying customer assets according to relevant business rules. In order to provide the caller with a large and comprehensive asset summary data, the system aggregates the customer's asset data in the five major systems at the interface service layer, and generates a unique identification with the customer number for each customer. json data, returned to the caller.
   returned by the interface json The asset types included in the data are divided by currency: RMB, USD, HKD. Among them, there are fewer customers with assets in US dollars and Hong Kong dollars, and their business is relatively simple; When the above types of assets exist independently in their respective systems, the assets and liabilities are relatively simple, and data such as total assets, total liabilities, and net assets can be obtained without tedious calculations. However, if the assets and liabilities of the five major systems are to be aggregated When we come together, we must consider various situations from a global perspective, such as: how to deal with in-transit assets in ordinary accounts when customers subscribe for OTC open-end funds; how to correct their available funds when customers use cash wealth management products to buy stocks; How to update the available funds after redemption of wealth management products. In order to solve these complex problems, the project team has done a lot of research and sorting, and finally formed a customer panoramic asset model that includes five major system assets. Figure 4 shows the content of the customer's RMB assets in this model:

Figure 4 Overview of Client RMB Assets

2.2.4 External service interface

   go through Spark Streaming To process and aggregate customer data, a container with large capacity and fast response is required to accommodate it; when data needs to be provided externally, a high-concurrency service platform is also required. The winning platform [1] independently developed by CSC Securities can meet the above requirements. The project team developed a set of interface services based on the winning platform.
   The winning platform assigns a set of real-time computing systems Redis A cluster dedicated to storing aggregated customer data. To maximize Redis The throughput of the cluster, the data is in Redis It is stored in serialized form within the cluster. When other systems call the system interface, the interface changes from Redis Extract the corresponding data and process it into json Format, returned to the caller. Redis The data in it is kept updated in real time, the interface service is waiting to be called at any time, and the two do not interfere with each other. The schematic diagram of the interface call is shown in Figure 5:

Figure 5 Principle of interface calling

3. Future Outlook

With the increasing development of technology and the continuous exploration of the project team, the real-time computing system can also be applied to a wider range of fields in the future, such as real-time strategy recommendation for stocks, real-time warning of two financial services, etc. CSC will rely on the real-time computing system to provide customers with a full range of financial services, and its financial technology level will also reach a new level.

[1]: "Winning Platform" is a set of micro-service architecture that follows a unified technical architecture and specifications, including three parts: technology middle platform, business middle platform, and data middle platform. See the article for details.

This article is selected from the 40th issue of "Frontiers of Trading Technology" (September 2020)
Xiao Gang, Li Ning, Zhang Haiyuan / Information Technology Department of CITIC Construction Investment Securities

Tags: Big Data Spark

Posted by alvinho on Fri, 06 May 2022 03:10:34 +0300