The so-called distributed system is that a service is divided into multiple sub services and distributed in different server nodes. The system formed together is called distributed system. The server nodes in the same distributed system can be randomly distributed in spatial deployment. These servers may be placed in different cabinets, in different computer rooms, or even in different cities.
Difference between distributed and cluster
Cluster: multiple people do the same thing together.
Distributed: many people do different things together
Characteristics of distributed system:
(4) Lack of global clock
(5) Faults always happen
2. Problems faced by distributed systems
1) Abnormal communication
Due to the unreliability of the network itself, each network communication will be accompanied by the risk of network unavailability (unavailability of hardware equipment or systems such as optical fiber, routing and DNS), which will lead to the failure of the final distributed system to carry out network communication. In addition, even if the network communication between nodes of the distributed system can be carried out normally, the delay will be greater than that of single machine operation, and there is a huge delay difference, It will also affect the sending and receiving process of messages, so message loss and message delay become very common.
2) Network partition
The networks are not connected, but the internal network of each sub network is normal, resulting in the network environment of the whole system being divided into several isolated areas, and local small clusters will appear in the distributed system. In extreme cases, these small clusters will independently complete the functions that originally needed to be completed by the whole distributed system, including data transaction processing, This poses a great challenge to distributed consistency.
3) Node failure
Node failure is another common problem in distributed systems. It refers to the downtime or "dead" of the server nodes that make up the distributed system,
According to experience, every node is likely to fail and often happens
4) Three states
Each request and response of distributed system has a unique concept of * * "three states * *, namely success, failure and timeout. In the distributed system, because the network is unreliable, although in most cases, the network communication can receive the response of success or failure, the timeout phenomenon will occur when the network is abnormal. There are usually the following two situations:
- Due to network reasons, the request was not successfully sent to the receiver, but lost in the sending process.
- The request was successfully received by the receiver and processed, but the message was lost during the response feedback to the sender.
3. Distributed theory: consistency
1) Distributed consistency: when data is stored in multiple copies, the data in each copy is consistent.
1) Replica consistency: in distributed systems, data often has multiple replicas. If a database handles all data requests, the consistency of data can be basically guaranteed through the four principles of ACID. Multiple copies need to ensure that there will be multiple copies of data. This brings the problem of synchronization, because there is almost no way to ensure that all machines can be updated at the same time, including backing up all data. Network delay. Even if I send data update requests to all machines at the same time, I can't guarantee that the response time of these requests is consistent. If there is a time difference, there will be data inconsistency between some machines
3) Consistency classification
1. Strong consistency
2. Weak consistency
This consistency level restricts the system from promising to read the written value immediately after the write is successful, or how long the data will be consistent,
However, we will try our best to ensure that the data can reach a consistent state after reaching a certain time level (such as the second level)
Read write consistency
The consistency of users' reading and writing results ensures that users can always see their updated content at the first time. For example, we send a circle of friends. It doesn't matter whether the content of the circle of friends is seen by friends for the first time, but it must be displayed on our own list Solution: Scheme 1: one scheme is that we go to the main library to read some specific contents every time. (problem: high pressure in main reservoir) Scheme 2: we set an update time window. During the period just updated, we read from the master database by default. After this window, we will select the slave database that has been updated recently for reading Scheme 3: we directly record the time stamp updated by the user, and bring this time stamp when requesting. The slave database that has the last update time less than this time stamp will not respond.
The data read this time cannot be older than that read last time. Because the master and slave nodes update the data at different times, the user can sometimes refresh the data while constantly refreshing If you brush it out and refresh it again, you will find that the data is missing. If you refresh it again, you may brush it out again, just like meeting a supernatural event. The solution is the same:According to the user ID Calculate a hash Value, and then pass hash Values are mapped to the machine. No matter how the same user is refreshed, it will only be mapped to the same machine. That's how it works After verification, you will not read other content from the library, which will have a bad impact on the user experience.
Refers to: if the node A The node is notified after updating a data B，So node B After that, the access and modification of the data are based on A Updated value. At the same time, and nodes A Causal node C There are no such restrictions on data access.
The final consistency is the weakest of all distributed consistency models. It can be considered as the "weakest" consistency without any optimization. It means that I don't consider the impact of all intermediate states, but only ensure that the data of all replicas in the system is correct after a period of time without new updates. It ensures the concurrency of the system to the greatest extent. Therefore, it is also the most widely used consistency model in high concurrency scenarios.
4. Distributed theory: CAP conclusion
The meaning of CAP theory is that a distributed system cannot meet the three basic requirements of consistency, availability and Partition tolerance at the same time, and can only meet two of them at most.
C consistency consistency in distributed systems refers to the consistency of the data of all nodes, or the consistency of the data of all replicas
A availability reads and writes always succeeded That is, the system is always available and the service remains normal
P-partition fault tolerance system can still provide services that meet consistency and availability in case of some node or network partition failures
How to achieve consistency?
1. After writing into the master database, synchronize the data to the slave database
2. After writing into the master database, lock the slave database during synchronization with the slave database, and release the lock after the synchronization is completed, so as not to query the old data from the slave database after writing new data
Characteristics of distributed consistency:
1. Due to the database synchronization process, the response of write operation will be delayed
2. In order to ensure the consistency of the fixed data, temporarily lock the resources and release the locked resources after the data synchronization is completed
3. If the node fails to request data synchronization, it will return an error message and will not return old data
Availability A - Availability
Availability means that any operation can get the result of response without response timeout or response error.
How to achieve availability?
1. After writing into the master database, synchronize the data to the slave data
2. To ensure the availability of the database, you cannot lock the resources in the database
3. Even if the data has not been synchronized, the query data must be returned from the database, even if it is old data, but the error and timeout cannot be returned
Partition tolerance P - Partition tolerance
When the nodes of the distributed system are deployed in different subnets, it is inevitable that the communication between nodes will fail due to network problems. At this time, they can still provide services. This is partition fault tolerance (partition tolerance)
How to achieve partition fault tolerance?
1. Try to use asynchrony instead of synchronous operation. For example, use asynchrony to synchronize data from the master database to the slave database, so that nodes can effectively achieve loose communication
2. Add a database node. One of the slave nodes hangs up and the other slave nodes provide services
give an example:
There are users to N1 Sent a request, changed the data, and transferred the database from V0 Updated to V1. Because the network is disconnected, so N2 The database is still V0，If a request is sent at this time N2，however N2 There is no way to give the latest results directly V1，What should we do at this time? At this time, there are no two ways. One is to make mistakes and make mistakes V0 The data is returned to the user. The second is blocking waiting, waiting for network communication to resume, N2 The data in is updated and then returned to the user. Obviously, the former sacrifices consistency and the latter sacrifices availability. Although this example is simple, the description is very important. In distributed systems, CAP We cannot meet the three characteristics at the same time, so we must abandon one. The three abandon one. Obviously, there are three possibilities for permutation and combination.
1. Discard a (availability) and retain CP (consistency and partition fault tolerance)
A system ensures consistency and partition fault tolerance, abandoning availability. In other words, in extreme cases, the system is allowed to be inaccessible. At this time, the user experience is often sacrificed to keep the user waiting until the system data is consistent, and then restore the service.
- Discard C (consistency) and retain AP (availability and partition fault tolerance)
This is the design of most distributed systems, which ensures high availability and partition fault tolerance, but will sacrifice consistency.
- Discard P (partition fault tolerance) and retain Ca (consistency and availability)
If you want to give up P，So we have to abandon the distributed system, CAP There's no way to talk about it. so to speak P Is the premise of distributed system, so this situation does not exist.
5. Distributed theory: BASE theory
BASE: full name: Abbreviations of three phrases: basically available, Soft state and Eventually consistent, proposed by the architect of ebay.
BASE is the result of the trade-off between consistency and availability in CAP. The core idea of BASE theory is that even if strong consistency cannot be achieved, each application can adopt appropriate ways to achieve the final consistency of the system according to its own business characteristics.
① Basically available
Basic availability means that a distributed system is allowed to lose part of its availability in the event of an unpredictable failure - but please note that this is by no means equivalent to an unpredictable failure of the system
Use. Here are two examples of "basic availability"
Loss of response time: under normal circumstances, an online search engine needs to return the corresponding query results to the user within 0.5 seconds, but due to
In case of current failure (such as power failure or network disconnection failure in some machine rooms of the system), the response time of query results has increased to 1 ~ 2 seconds.
Loss of function: under normal circumstances, when shopping on an e-commerce website (such as Taobao), consumers can complete almost every transaction smoothly
Order. However, when some festivals promote shopping peak (such as double 11 and double 12), due to the surge of consumers' shopping behavior, in order to protect consumers
For the stability (or consistency) of the system, some consumers may be guided to a degraded page
② Soft state
What is soft state? Compared with consistency, the data copies of multiple nodes are required to be consistent, which is a "hard state".
Soft state means that the data in the system is allowed to have an intermediate state, and it is considered that this state does not affect the overall availability of the system, that is, the system is allowed to be in multiple different states
There is a delay in the process of data synchronization between data copies of nodes.
③ Eventually consistent
Final consistency emphasizes that all data copies in the system can finally reach a consistent state after a period of synchronization. So eventually
The essence of consistency is that the system needs to ensure that the final data can be consistent without ensuring the strong consistency of system data in real time.
6... Distributed theory: consistency protocol 2PC
What is 2PC
2PC (abbreviated as Two-Phase Commit) is a two-stage commit protocol, which divides the whole transaction process into two stages: Prepare
Phase) and commit phase. 2 refers to two phases. P refers to the preparation phase and C refers to the submission phase.
2) 2PC execution process
Two stage process:
1. Preparation stage( Prepare phase): The transaction manager sends to each participant Prepare Message, and each database participant executes the event locally And write local Undo/Redo Log, the transaction is not committed at this time. ( Undo The log records the data before modification and is used for database retrieval Get out, Redo Log is used to record the modified data, which is used to write the data file after the transaction is committed) 2. Submission phase( commit phase): If the transaction manager receives the execution failure or timeout message from the participant, it will send it directly to each participant Send rollback(Rollback)Message; Otherwise, send a submission(Commit)Message; Participants perform commit or rollback operations according to the instructions of the transaction manager And release the lock resources used in the transaction. be careful:The lock resource must be released at the final stage.
Phase I: 1. The transaction coordinator sends the transaction content to all participants, asks whether the transaction submission operation can be performed, and starts to wait for the response of each participant. two. Execute transaction (Write local Undo/Redo journal) 3. Each participant feeds back the response summary of the transaction inquiry to the coordinator: Each participant votes whether to let the transaction proceed.
ACK confirmation character, a transmission type control character sent by the receiving station to the transmitting station in data communication. Indicates that the data sent has been confirmed to be received correctly.
Phase II: 1. Send submit request: the coordinator sends a request to all participants commit Request. two. Transaction submission: received by participants commit After the request, the transaction commit operation will be formally executed, and the transaction resources occupied during the whole transaction execution will be released after the commit is completed. three. Feedback on transaction submission results: after the participant completes the transaction submission, it sends it to the coordinator Ack Information. four. Complete the transaction: the coordinator receives feedback from all participants Ack Complete the transaction after receiving the information
The steps to interrupt a transaction are as follows:
If any participant feeds back the No response to the coordinator, or after the waiting timeout, the coordinator cannot receive the feedback from all participants
If yes, the transaction is interrupted
1. The transaction coordinator sends the transaction content to all participants, asks whether the transaction submission operation can be performed, and starts to wait for the response of each participant. two. Execute transaction (Write local Undo/Redo journal) 3. Each participant feeds back the response summary of the transaction inquiry to the coordinator: Each participant votes whether to let the transaction proceed.
1. Send rollback request: the coordinator sends a rollback request to all participants Rollback Request. two. Transaction rollback: participant received Rollback After the request, it will use the information recorded in phase I Undo Information to perform the transaction rollback operation, and release the resources occupied during the whole transaction execution after the rollback is completed. three. Feedback of transaction rollback results: after the participant completes the transaction rollback, it sends a message to the coordinator Ack Information. four. Interrupt transaction: the coordinator receives feedback from all participants Ack After the message is received, the transaction is interrupted. From the above logic, we can see that the two-stage submission does two things: voting and execution.
3) 2PC advantages and disadvantages
Simple principle and convenient implementation
Synchronization blocking, single point problem, inconsistent data, too conservative
The most obvious and biggest problem of the two-phase commit protocol is synchronization blocking. During the execution of the two-phase commit, all the logic involved in the transaction operation is blocked, that is, each participant cannot perform other operations while waiting for the response of other participants. This synchronization blocking greatly limits the performance of distributed systems.
Single point question:
The coordinator is very important in the whole two-stage submission process. If the coordinator has problems in the submission stage, the whole process will not work. More importantly
Yes: other participants will be in the state of locking transaction resources all the time and cannot continue to complete the transaction operation.
Suppose that after the coordinator sends a commit request to all participants, a local network exception occurs or the coordinator has not sent all the data yet
Before the commit request, it crashed, resulting in that only some participants received the commit request in the end. This will lead to serious data inconsistency.
If the coordinator cannot get the response information of all participants due to the failure of participants in the submission inquiry phase of two-stage submission, the coordinator can only rely on its own timeout mechanism to judge whether to interrupt the transaction. Obviously, this strategy is too conservative. In other words, the two-phase commit protocol does not design a more perfect fault-tolerant mechanism, and the failure of any node will lead to the failure of the whole transaction.
7. Distributed agreement theory: 3PC
3PC, fully known as "three phase commit", is an improved version of 2PC, which divides the "submit transaction request" process of 2PC into two, forming a total of
The transaction processing protocol is composed of three stages: CanCommit, PreCommit and doCommit.
The first stage: CanCommit
① Transaction inquiry The coordinator sends a message containing the transaction content to all participants canCommit Request, ask whether the transaction submission operation can be performed, and start Wait for the response of each participant. ② Each participant feeds back the response to the transaction inquiry to the coordinator The participant receives a message from the coordinator containing the transaction content canCommit After the request, under normal circumstances, if you think it can be executed smoothly Transaction, feedback Yes Respond and enter the ready state, otherwise feedback No Response.
Phase 2: PreCommit
After receiving the response from all participants, the coordinator will perform two operations according to the results: pre commit the transaction or interrupt the transaction
If all participants in the feedback are Yes, the transaction pre commit will be executed.
- There are three steps to pre commit transactions
① Send pre submission request: The coordinator sends a message to all participant nodes preCommit Request and enter prepared Phase. ② Transaction pre commit: Participant received preCommit After the request, the transaction operation will be executed and the Undo and Redo Information is logged to the transaction log. ③ Each participant feeds back the results of transaction execution to the coordinator: If the participant successfully executed the transaction operation, the feedback Ack If any participant gives feedback No If a response is received or the coordinator cannot receive feedback from all participants after the wait timeout, the transaction is interrupted
- Interrupting a transaction is also divided into two steps:
① Send interrupt request: The coordinator sends a message to all participants abort Request. ② Interrupt transaction: Whether received from the coordinator abort If the request or waiting for the coordinator's request times out, the participant will interrupt the transaction
Stage 3: do Commit
In this stage, the real transaction is committed or the transaction rollback is completed, so there are two situations:
- Execute transaction commit
① Send submit request: Entering this stage, it is assumed that the coordinator is working normally and that it receives messages from all participants Ack In response, he will change from pre submission status to submission status and send it to all participants doCommit Request. ② Transaction commit: Participant received doCommit After the request, the transaction commit operation will be formally executed, and the transaction resources occupied in the whole transaction execution process will be released after the commit is completed. ③ Feedback on transaction submission results: after the participant completes the transaction submission, it sends it to the coordinator Ack Response. ④ Complete the transaction: the coordinator receives feedback from all participants Ack After the message is received, the transaction is completed.
- Interrupt transaction
① Send interrupt request: the coordinator sends interrupt request to all participant nodes abort Request. ② Transaction rollback: participant received abort After the request, according to the recorded Undo Information to execute the transaction rollback, and release the resources occupied during the whole transaction execution after the rollback is completed. ③ Feedback of transaction rollback results: after the participant completes the transaction rollback, it sends a message to the coordinator Ack News. ④ Interrupt transaction: the coordinator receives feedback from all participants Ack After the message, the transaction is interrupted.
Note: once entering phase III, there may be two kinds of faults:
- Problem with Coordinator
- Network failure between coordinator and participant
In either case, the participants will eventually be unable to receive the doCommit request or abort request. In this case, participate
After waiting for timeout, the transaction commit will continue
2PC vs 3PC
1.First, a timeout mechanism is set for both coordinators and participants (in 2 PC In, only the coordinator has a timeout mechanism, that is, if there is no receipt within a certain time Messages to participants fail by default),The main purpose is to avoid the participants' failure to communicate with the coordinator node for a long time (the coordinator hangs up) Because the participants have their own timeout mechanism, they will automatically perform local timeout after timeout commit So as to release resources. And this The mechanism also reduces the blocking time and scope of the whole transaction. two.adopt CanCommit,PreCommit,DoCommit Three stage design Total, compared to 2 PC In other words, a buffer stage is set to ensure that the states of participating nodes are consistent before the final submission stage. 3.PreCommit It is a buffer to ensure that the states of participating nodes are consistent before the final submission stage.
Problem: 3PC protocol does not completely solve the problem of data inconsistency.
8. Distributed theory: consistency algorithm Paxos
What problem did Paxos solve
A: it solves the consistency problem of distributed system.
Distributed systems use multiple copies to store data , If sequence control is not performed on multiple copies, Which copies do you want to update,Due to network delay, timeout and other faults, the data of each copy is inconsistent. We want the execution sequence of each copy to be [ op1 op2 op3 .... opn ] Invariable, same. Paxos Once to determine the immutable variable opi Value of , After each confirmation Opi after,Execution of each copy opi operation,An analogy. Conclusion: Paxos The problem that the algorithm needs to solve is how to quickly and correctly agree on the value of a data within the cluster in a distributed system where the above exceptions may occur. Note: the value of a data here is not just a number in a narrow sense. It can be a log or a command( command). . . According to different application scenarios, the value of a data has different meanings
In Paxos algorithm, there are the following roles
The client sends a request to the distributed system and waits for a response. For example, a write request to a file in a distributed file server.
Proposer: sponsor of the proposal
Proponents advocate customer requests, try to persuade acceptors to reach an agreement, and act as a coordinator in case of conflict to move the agreement forward
Acceptor: the decision maker who can approve the proposal
Acceptor can accept the proposal; If a proposal is selected, the value in the proposal is selected
Learners: learners of final decision
Learners act as replication factors for the protocol
**Regulation: if a proposal is selected, it needs to be accepted by more than half of the acceptors
P2: if a proposal with value V is selected, the value of each selected proposal with higher number must also be v.
P2a: if a proposal with a value of V is selected, the value of each proposal with a higher number accepted by the Acceptor must also be v.
However, consider the following: suppose there are five acceptors in total. More than half of the proposals [V1, Acceptor2]
Accepted the proposal, so for Acceptor2~5 and Proposer2, they both think V1 is selected. Acceptor1 has just recovered from downtime
Come here (Acceptor1 hasn't received any proposal before). At this time, Proposer1 sends the proposal of [M2,V2] to Acceptor1 (V2 ≠ V1 and
M2 > M1), for Acceptor1, this is the first proposal it has received. According to P1 (an Acceptor must accept the first request it receives)
Case.), Acceptor1 must accept the proposal! Meanwhile, acceptor1 thinks that V2 is selected. Two problems arise:
Acceptor1 thinks V2 is selected, Acceptor2~5 and Proposer2 think V1 is selected. There was an inconsistency.
V1 is selected, but the value of the proposal [M2,V2] accepted by Acceptor1 with a higher number is V2, and V2 ≠ v1. This is similar to P2a (if
If a proposal with a value of v is selected, the value of each proposal with a higher number accepted by the Acceptor must also be v) contradictory.
So we need to strengthen the P2a constraint!
P2a is a constraint on the proposal accepted by the Acceptor, but in fact, the proposal is put forward by the Proposer. We can review all the proposals put forward by the Proposer
Constraints. Get P2b:
P2b: if a proposal with value V is selected, the value of any proposal with higher number proposed by the Proposer must also be v.
P2a can be pushed out from P2b, and then P2 can be pushed out.
So, how to ensure that after a proposal with a value of V is selected, the value of the proposal with a higher number proposed by the Proposer is v?
As long as P2c is met:
P2c: For arbitrary Mn and Vn,If proposal[Mn,Vn]If it is proposed, then there must be one by more than half Acceptor A collection of S，Either of the following two conditions is met: * or S Each Acceptor No one has accepted the number less than Mn Proposal. * or S All in Acceptor All approved numbers are less than Mn Of the proposals, the proposal with the largest number value Value is Vn
From the above content, it can be seen that the process from P1 to P2c is actually a gradual enhancement of a series of conditions. If it needs to be proved that these conditions can ensure one
Consistency, then the reverse derivation can be carried out: P2C = > P2b = > P2a = > P2, and then the consistency can be guaranteed through P2 and P1
Proposer generates proposal
Next, learn how to generate proposals based on P2c
Here is an important idea: before a Proposer generates a proposal, it should first "learn" the value that has been selected or may be selected, and then
Take this value as the value of your proposal. If no value is selected, the Proposer can decide the value by itself. So that we can reach an agreement. This learning stage is realized through a "Prepare request".
Proposal generation algorithm:
1. Proposer Select a new proposal number N，Then to someone Acceptor A collection (more than half) sends a request to each member of the collection Acceptor Respond as follows( response) (a) Acceptor towards Proposer Promise not to accept any number less than N Proposal. (b) If Acceptor Have accepted the proposal, then Proposer The number of feedback received is less than N Yes, but the proposal with the largest number Value of. We call this request number N of Prepare Request. 2. If Proposer Received more than half of the Acceptor Then it can generate a response with the number N，Value by V Proposal[N,V]. here of V Is the highest numbered proposal of all responses Value. If there is no proposal in all responses, then at this time V Can be Proposer Choose for yourself. After the proposal is generated, Proposer Send the proposal to more than half of the Acceptor Gather and expect these Acceptor The proposal is acceptable. We Call the request Accept Request. **Acceptor Accept proposal** Just explained Paxos In algorithm Proposer The processing logic and how to generate the proposal, let's take a look Acceptor How was the proposal approved According to the introduction just now, one Acceptor May be affected by Proposer The two requests are Prepare Request and Accept Request, for these two The conditions for responding to such requests are as follows: Prepare Request: Acceptor You can respond to one at any time Prepare request Accept Request: without violating Accept Under the premise of existing commitments, you can respond arbitrarily Accept request Therefore, yes Acceptor The acceptance proposal gives the following constraints: P1a: One Acceptor As long as you have not responded to any number greater than N of Prepare Request, then he can accept the number N Mention of
If Acceptor Received a number N of Prepare Request, which has previously responded with a number greater than N of Prepare Request. according to P1a，Should Acceptor It is not possible to accept number N Proposal. Therefore, the Acceptor Can ignore number N of Prepare Request.
Paxos algorithm description
Algorithm flow demonstration
Phase I: (a) Proposer Select a proposal number N，Then to more than half of the Acceptor Send number is N of Prepare Request. (b) If one Acceptor Received a number N of Prepare Request, and N Greater than this Acceptor All that have responded Prepare request Then it will respond to the proposal with the largest number (if any) it has accepted Proposer，At the same time Acceptor Promise not to accept any number less than N Proposal. Phase II: (a) If Proposer More than half received Acceptor The number issued to it is N of Prepare In response to the request, it will send a pin yes[N,V]Proposed Accept Request more than half Acceptor. be careful: V It is the proposal with the largest number of responses received value，If If the response does not contain any proposal, then V By Proposer Decide for yourself. (b) If Acceptor Received one for number N Of your proposal Accept Request, as long as the Acceptor No number greater than N of Prepare Once the request had responded, it accepted the proposal.
How to ensure the activity of Paxos algorithm
Solution: select the main Proposer and stipulate that only the main Proposer can propose a proposal. In this way, as long as the main Proposer and more than half of the acceptors
If the network communication can be carried out normally, if the main Proposer puts forward a proposal with a higher number, the proposal will eventually be approved. In this way, select one
The main Proposer, the whole Paxos algorithm can remain active
9. Distributed theory: consistency algorithm Raft
Raft is a consistency algorithm for managing replication logs.
The Raft algorithm is divided into two stages. The first is the election process, and then the normal operation is carried out under the leadership of the elected leaders
Raft decomposes the consistency algorithm into three modules
- Leader election
- Log replication
leader(leader): Handle client interaction, log replication and other actions. Generally, there is only one leader at a time candidate (candidate): A candidate is an entity that nominates itself during the election process and becomes a leader once the election is successful Follower(follower): Like voters, completely passive roles, such servers wait to be notified to vote Raft Use the heartbeat mechanism to trigger the election. When server When starting, the initial state is follower. every last server There is a timer with a timeout of election timeout(Generally 150-300ms)，If a server When receiving any message from the leader or candidate without timeout, the timer restarts. If it times out, it starts an election.
1. Node exception
The state of each node in the cluster may change at any time. In terms of actual changes, node exceptions can be roughly divided into four types:
The leader is unavailable;
follower is not available;
Multiple candidate s or multiple leader s;
New nodes join the cluster.
2. leader not available
Some exceptions cannot be sent to heartbeat or heartbeat is no longer received
When more than half of the followers accept the vote, this node will become a new leader. The number of steps of the leader will be increased by 1 and the log will be synchronized to the followers
After a period of time, if the previous leaders join the cluster again, the two leaders will compare the steps of each other, and the leader with low steps will switch his state to follow.
The inconsistent logs in the earlier leader will be cleared and consistent with the logs in the existing leader.
3. The follower node is not available
The unavailability of the follower node is relatively easy to solve. Because the log content in the cluster is always synchronized from the leader node, as long as this node re copies the log from the leader node when joining the cluster again
4. Multiple candidate or leader exceptions
Multiple candidates or leaders in the cluster are usually caused by poor data transmission. It is relatively rare for multiple leaders to appear, but multiple candidates are more likely to appear in the "chaotic" period when the leader has not been selected in the initial stage of cluster node startup.
Continue voting. If it is the same, reset and start again
Log replication (ensure data consistency)
Log copy process:
after the Leader is selected, it starts to receive the request from the client. The Leader adds the request to its log as Log entries, and then initiates AppendEntries RPC to other servers in parallel to copy Log entries. When the log is copied to most servers, the Leader applies the log to its state machine and returns the execution result to the client.
Each request from the client contains instructions executed by the replicated state machine. leader This instruction is added to the log as a new log entry, and then initiated in parallel RPC Give it to other servers and let them copy it A message. Follower response ACK,If follower Downtime or slow operation or packet loss, leader Will keep trying again until all follower final All log entries are copied. Notify all Follower Submit the log, and the leader submits the log to his own state machine and returns it to the client. As you can see, the whole transaction will not be reached until the fourth step. Failure of any step in the middle will not affect log consistency.
II Distributed system design strategy
1. Heartbeat detection
Cycle detection heartbeat mechanism
The Server sends a monitoring request to the Node cluster every t seconds and sets the timeout. If the timeout is exceeded, it will be judged as "dead".
Cumulative failure detection mechanism
Based on the periodic heartbeat detection mechanism, count the return of nodes in a certain period (including timeout and correct return), so as to calculate the "death" probability of nodes. In addition, for the node declaring "dying", it can initiate a limited number of retries for further judgment.
The periodic heartbeat detection mechanism and cumulative failure detection mechanism can help judge whether the node is "dead". If it is "dead", the node can be kicked out of the cluster
2. High availability design
There are three common design modes for system high availability: master slave, active active and Cluster.
Cluster mode means that multiple nodes are running, and service requests can be shared through the master node. Such as Zookeeper. The cluster mode needs to solve the high availability problem of the master node itself, and the master standby mode is generally adopted.
3. Fault tolerance
4. Load balancing
III Distributed architecture network communication
The full name is remote procedure call, that is, remote procedure call.
A complete RPC architecture contains four core components: Client, Client Stub, Server and Server Stub
It can be understood as stub.
Client, the caller of the service.
The client stub is used to store the address message of the server, then package the request parameters of the client into network messages, and then send them to the remote server through the network
Send the program to the service provider.
Server, the real service provider.
The server stub receives the message sent by the client, unpacks the message, and calls the local method.
(1) Client( client)Call the service locally (i.e. by interface); (2) Client stub( client stub)After receiving the call, be responsible for assembling the methods and parameters into a message body capable of network transmission (pair the message body to Image serialization to binary); (3) Client pass sockets Send the message to the server; (4) Server stub( server stub)Decoding after receiving the message (deserializing the message object); (5) Server stub( server stub)Call the local service according to the decoding result; (6) The local service executes and returns the result to the server stub( server stub)； (7) Server stub( server stub)Package the returned result into a message (serialize the result message object); (8) Server( server)adopt sockets Send the message to the client; (9) Client stub( client stub)Receive the result message and decode it (serialize the result message); (10) Client( client)Get the final result. RPC The goal is to encapsulate steps 2, 3, 4, 7, 8 and 9. Note: no matter what type of data, it needs to be converted into binary stream for transmission on the network, and the sender of the data needs to convert the object into binary stream Binary stream, while the receiver of data needs to restore the binary stream to an object. stay java in RPC There are many frameworks. Common ones are Hessian,gRPC,Thrift,HSF (High Speed Service Framework),Dubbo Wait, actually for RPC In terms of framework, the core module is communication and serialization
1.client: 1)stub/pile(Stub): Proxy of remote object on client; 2)Remote reference layer(Remote Reference Layer): Resolve and execute remote reference protocol; 3)Transport layer(Transport): Send calls, pass remote method parameters, and receive remote method execution results. 2.Server: 1)skeleton(Skeleton): Read the method parameters passed by the client and call the actual object method of the server, And receive the return value after the method is executed; 2)Remote reference layer(Remote Reference Layer): Send a remote method call to the skeleton after processing the remote reference; 3)Transport layer(Transport): Listen to the inbound connection of the client, receive and forward the call to the remote reference layer. 3.registry(Registry): with URL Register the remote object and reply the reference to the remote object to the client
Remote call procedure
1)The client queries and obtains the remote object reference from the registry of the remote server. 2) The pile object has the same interface and method list as the remote object. When the client calls the remote object, it is actually completed by the corresponding pile object agent. three )The remote reference layer converts the local reference of the pile into the remote reference of the object on the server, and then passes the call to the transport layer(Transport)，Through the transport layer TCP Protocol sending and calling; 4) On the server side, the transport layer listens for inbound connections. Once it receives a remote call from the client, it forwards the reference to its upper remote reference layer; 5) The remote reference layer on the server side converts the remote application sent by the client into the reference of the local virtual machine, and then passes the request to the skeleton(Skeleton)； 6)The skeleton reads the parameters, passes the request to the server, and finally the server makes the actual method call.
1)If there are return values after the remote method call, the server will return these results along the "skeleton"->Remote reference layer->"Transport layer" downward transmission; 2) After receiving the return value, the transport layer of the client goes along the "transport layer"->Remote reference layer->The "pile" is passed up, and then the pile deserializes these return values and passes the final result to the client program.
Synchronization: refers to the way that the user process triggers IO operation waiting or rotation training to check whether the IO operation is ready.
Asynchronous: when an asynchronous process call is issued, the caller will not get the result immediately. Instead, after the call is issued, the callee notifies through status and notification
The caller, or handle the call through a callback function.
When using asynchronous IO, Java delegates IO reading and writing to the OS for processing. The address and size of the data buffer need to be passed to the OS. The OS needs to support asynchronous IO operations
Blocking and non blocking
Blocking and non blocking are different ways for the process to access data according to the ready state of IO operation
Simply put, it is an implementation of read and write operation method In blocking mode, read and write will wait all the time, while in non blocking mode, read and write methods
Understand that a status value is returned
BIO: synchronization blocking
**NIO: * * non blocking IO / new IO refers to JDK version 1.4 and above.
Channel the channel of the data connection. Data can be read from the channel to the Buffer or written from the Buffer to the channel
The channel can write data to the buffer or store data in the buffer.
With a selector, a large number of active I/O channels can be monitored and maintained in real time with the help of a single thread.
When a connection is created, it does not need to correspond to a thread. The connection will be registered with the multiplexer, so a connection only needs one thread, that is
However, all connections need a thread to operate. The multiplexer of this thread will rotate training. When a connection request is found, a thread will be opened
Asynchronous non blocking IO. A stands for asynchronous
Usage scenario: an architecture with a large number of connections and long connections (re operation), such as an album server. Focus on calling the OS to participate in concurrent operations, programming ratio
More complex. Java 7 starts to support
Netty is an asynchronous, event driven network programming framework provided by JBOSS.
NIO Class libraries and API Complicated and troublesome to use. You need to master Selector,ServerSocketChannel,SocketChannel, ByteBuffer etc.. The reliability is not strong, and the development workload and difficulty are very large NIO of Bug. for example Epoll Bug，It can lead to Selector Empty polling, resulting in CPU 100%
Provide unified for various transmission protocols API Highly customizable threading model - single thread, one or more thread pools Better throughput and lower latency Less resource consumption Minimize unnecessary memory copies
Netty abstracts two groups of thread pools. BossGroup is responsible for receiving client connections and WorkerGroup is responsible for network read-write operations. NioEventLoop represents a thread that continuously circulates processing tasks. Each NioEventLoop has a selector to listen to the socket network channel bound to it. NioEventLoop adopts serial design internally, starting from message reading - > decoding - > processing - > encoding - > sending
The IO thread NioEventLoop is ultimately responsible.
Netty core components
ChannelHandler and its implementation class
The ChannelHandler interface defines many event handling methods. We can rewrite these methods to implement specific business logic
We often need to customize a Handler class to inherit the ChannelInboundHandlerAdapter, and then implement the business logic by rewriting the corresponding methods. Next, let's see which methods we generally need to override
- public void channelActive(ChannelHandlerContext ctx)， Channel ready event - - public void channelRead(ChannelHandlerContext ctx, Object msg)， Channel read data event - - public void channelReadComplete(ChannelHandlerContext ctx) ， Event after reading data - - public void exceptionCaught(ChannelHandlerContext ctx, Throwable cause)， An abnormal event occurred in the channel
ChannelPipeline is a collection of handlers, which is responsible for handling and intercepting inbound or outbound events and operations. It is equivalent to a chain running through Netty.
- ChannelPipeline addFirst(ChannelHandler... handlers)， Put a business processing class( handler) Add to the first position in the chain - - ChannelPipeline addLast(ChannelHandler... handlers)， Put a business processing class( handler) Add to the last position in the chain
This is the event handler context object, the actual processing node in the Pipeline chain. Each processing node
The ChannelHandlerContext contains a specific event handler, and
The information of the corresponding pipeline and Channel is also bound in the ChannelHandlerContext, which is convenient for calling the ChannelHandler.
Common methods are as follows
- ChannelFuture close()， Close channel - ChannelOutboundInvoker flush()， Refresh - ChannelFuture writeAndFlush(Object msg) ， Write data to ChannelPipeline Medium current - ChannelHandler Next ChannelHandler Start processing (outbound)
Indicates the result of asynchronous I/O operation in Channel. In Netty, all I/O operations are asynchronous, and I/O calls will be returned directly. The caller cannot obtain the result immediately, but the processing status of I/O operation can be obtained through ChannelFuture. Common methods are as follows:
Channel channel()， Return to currently in progress IO Operating channel ChannelFuture sync()， Wait for the asynchronous operation to complete
EventLoopGroup and its implementation class NioEventLoopGroup
EventLoopGroup is an abstraction of a group of eventloops. In order to make better use of multi-core CPU resources, Netty generally has multiple eventloops working at the same time, and each EventLoop maintains a Selector instance. EventLoopGroup provides the next interface. You can get one of the eventloops from the group according to certain rules to process tasks. In Netty server-side programming, we generally need to provide two eventloopgroups, such as BossEventLoopGroup and WorkerEventLoopGroup.
- public NioEventLoopGroup()， Construction method - - public Future<?> shutdownGracefully()， Disconnect and close the thread
ServerBootstrap and Bootstrap
ServerBootstrap is a server-side startup assistant in Netty, which can complete various server-side configurations; Bootstrap is in Netty
Client startup assistant, which can complete various configurations of the client. Common methods are as follows:
- public ServerBootstrap group(EventLoopGroup parentGroup, EventLoopGroup childGroup)，This method is used on the server side to set two EventLoop - public B group(EventLoopGroup group) ， This method is used for the client to set a EventLoop - - public B channel(Class<? extends C> channelClass)， This method is used to set up a server-side channel implementation - - public <T> B option(ChannelOption<T> option, T value)， Used to give ServerChannel Add configuration - - public <T> ServerBootstrap childOption(ChannelOption<T> childOption, T value)， Used to add configuration to the received channel - - public ServerBootstrap childHandler(ChannelHandler childHandler)， This method is used to set the business processing class (customized) handler) - - public ChannelFuture bind(int inetPort) ， This method is used on the server side to set the occupied port number - - public ChannelFuture connect(String inetHost, int inetPort) This method is used for the client and connecting to the server