vivo 10000 large-scale HDFS cluster upgrade HDFS 3 X practice

vivo Internet big data team Lv Jia

Hadoop 3. The first stable version of X was released at the end of 2017, with many significant improvements.

In terms of HDFS, it supports new features such as error coding, More than 2 NameNodes, router based Federation, Standby NameNode Read, FairCallQueue and intra datanode balancer. These new features bring many benefits in terms of stability, performance and cost. We plan to upgrade the HDFS cluster to HDFS 3 X version.

This article will introduce how we roll upgrade CDH 5.14.4 HDFS 2.6.0 to HDP-3.1.4.0-315 HDFS 3.1.1, which is one of the few cases in the industry to roll upgrade from CDH cluster to HDP cluster. What are the problems encountered in the upgrade? How are these problems solved? This article has very high reference value.

1, Background

vivo offline warehouse Hadoop cluster is built based on CDH 5.14.4. CDH 5.14.4 Hadoop version: 2.6.0+CDH 5.14.4+2785. It is a Hadoop distribution version of Cloudera company after entering some optimized patch es based on Apache Hadoop version 2.6.0.

In recent years, with the development of vivo business and the explosive growth of data, offline warehouse HDFS clusters have expanded from one to ten, with a scale of nearly 10000. With the growth of HDFS cluster size, some pain points of the current version of HDFS have also been exposed:

  • In the current lower version of HDFS, the NameNode in the online environment often has RPC performance problems, and the user Hive/Spark offline task will also delay the task because the NameNode RPC performance becomes slow.

  • Some RPC performance problems are in HDFS 3 X versions have been repaired. At present, the performance problem of online NameNode RPC can only be solved by entering HDFS higher version patch.

  • Frequent patch merging increases the complexity of HDFS code maintenance. The NameNode or DataNode needs to be restarted when each patch goes online, which increases the operation and maintenance cost of HDFS cluster.

  • Online HDFS clusters use viewfs to provide external services. The company has many internal business lines. Many business departments have applied for independent HDFS clients to access offline data warehouse clusters. After modifying the online HDFS configuration, updating the HDFS client configuration is a very time-consuming and troublesome thing.

  • HDFS 2.x does not support EC, and EC cannot be used for cold data to reduce storage costs.

Hadoop 3. The first stable version of X was released at the end of 2017, with many significant improvements. In terms of HDFS, it supports new features such as error coding, More than 2 NameNodes, router based Federation, Standby NameNode Read, FairCallQueue and intra datanode balancer. HDFS 3. The new features of X bring many benefits in many aspects, such as stability, performance, cost and so on.

  • New features of HDFS Standby NameNode Read, FairCallQueue and HDFS 3 X namenode RPC optimized patch can greatly improve the stability and RPC performance of our current version of HDFS cluster.

  • HDFS RBF replaces viewfs, simplifies the HDFS client configuration update process, and solves the pain point of online updating many HDFS client configurations.

  • HDFS EC applies cold data storage to reduce storage costs.

Based on the above pain points and benefits, we decided to upgrade the offline warehouse HDFS cluster to HDFS 3 X version.

2, HDFS upgrade version selection

Since our Hadoop cluster is built based on CDH version 5.14.4, we first consider upgrading to the higher version of CDH. CDH 7 provides HDFS 3 X distribution. Unfortunately, there is no free version of CDH 7. We can only choose to upgrade to Apache version or HDP distribution provided by Hortonworks.

Because Apache Hadoop does not provide management tools, it is extremely inconvenient to manage and distribute configuration for 10000 HDFS clusters. Therefore, we chose Hortonworks HDP distribution and Ambari for HDFS management tool.

The latest stable free Hadoop distribution provided by Hortonworks is HDP-3.1.4.0-315. The version of Hadoop is Apache Hadoop version 3.1.1.

3, HDFS upgrade scheme formulation

3.1} upgrade scheme

HDFS officially offers two upgrade schemes: Express and RollingUpgrade.

  • **The upgrade process of Express * * is to stop the existing HDFS service and then start the service with the new version of HDFS, which will affect the normal operation of online business.

  • **RollingUpgrade * * the upgrade process is a rolling upgrade, non-stop service and no perception of users.

Since the suspension of HDFS service has a great impact on the business, we finally chose the rolling upgrade scheme.

3.2 degradation scheme

In RollingUpgrade scheme, there are two fallback methods: * * Rollback and RollingDowngrade * *.

  • **Rollback * * will return the HDFS version and data status to the moment before the upgrade, resulting in data loss.

  • RollingDowngrade only reverses the HDFS version, and the data is not affected.

Our online HDFS cluster can't tolerate data loss, so we finally choose the fallback scheme of rolling down grade.

3.3 HDFS client upgrade scheme

Some components, such as Spark and Flink, need to be upgraded to the higher version of hdfs.3 to support online computing x. Upgrading HDFS Client has high risk.

After several rounds of tests in the test environment, we have verified HDFS 3 X compatible with HDFS 2 X client read / write.

Therefore, in this HDFS upgrade, we only upgrade NameNode, JournalNode and DataNode components, HDFS 2 X client will be upgraded after YARN is upgraded.

3.4 HDFS rolling upgrade steps

The operation process of RollingUpgrade upgrade is introduced in the official Hadoop upgrade document. The general steps are as follows:

  1. Upgrade the JournalNode and restart the JournalNode in turn with the new version.

  2. Prepare for NameNode upgrade and generate rollback fsimage file.

  3. Restart Standby NameNode and ZKFC with the new version of Hadoop.

  4. NameNode HA master-slave switching, so that the upgraded NameNode becomes an Active node.

  5. Restart another NameNode with the new version of Hadoop and restart ZKFC.

  6. Upgrade DataNode and use the new version of Hadoop to scroll and restart all DataNode nodes.

  7. Execute Finalize to confirm that the HDFS cluster is upgraded to the new version.

4, How do management tools coexist

HDFS 2.x cluster, HDFS, YARN, Hive, HBase and other components, managed by CM tool. Since only HDFS is upgraded, HDFS 3 X is managed by Ambari, and other components such as YARN and Hive are still managed by cm. HDFS 2.x client will not be upgraded and will continue to use cm management. Zookeeper uses ZK deployed by the original cm.

Specific implementation: the CM Server node deploys Amari Server, and the CM Agent node deploys Ambari Agent.

As shown in the figure above, use Ambari tool to deploy HDFS 3 on the master/slave node X namenode / datanode component. Due to port conflict, Ambari deployed HDFS 3 X will fail to start, and HDFS 2.0 will not be deployed to online CM X clusters.

After the HDFS upgrade starts, the master node stops CM JN/ZKFC/NN and starts Ambari JN/ZKFC/NN, and the slave node stops CM DN and starts Ambari DN. While upgrading HDFS, switch management tools from CM to Ambari.

5, Problems encountered during HDFS rolling upgrade and downgrade

5.1 incompatibilities fixed by HDFS community

HDFS community has fixed key incompatibilities during rolling upgrade and downgrade. The relevant issue number is: HDFS-13596HDFS-14396HDFS-14831.

[HDFS-13596 ]: fix the problem of writing EC related data structures into EditLog file after Active NamNode upgrade, resulting in abnormal reading of EditLog by Standby NameNode and direct Shutdown.

[HDFS-14396 ]3. Repair NameNode upgrade to HDFS: After version x, the EC related data structure is written into the Fsimage file, resulting in the degradation of the NameNode to HDFS 2 X version identifies the problem of abnormal Fsimage file.

[HDFS-14831 ]: fix the problem of Fsimage incompatibility after HDFS degradation due to the modification of StringTable after NameNode upgrade.

Our upgraded HDP HDFS version introduces the codes related to the above three issue s. In addition, we encountered other incompatibilities during the upgrade process:

5.2 Unknown protocol appears in the upgrade of journalnode

Problems during upgrading of JournalNode:

Unknown protocol: org.apache.hadoop.hdfs.qjournal.protocol.InterQJournalProtocol

org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.ipc.RpcNoSuchProtocolException): Unknown protocol: org.apache.hadoop.hdfs.qjournal.protocol.InterQJournalProtocol
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.getProtocolImpl(ProtobufRpcEngine.java:557)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Server$ProtoBufRpcInvoker.call(ProtobufRpcEngine.java:596)
        at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:1073)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2281)
        at org.apache.hadoop.ipc.Server$Handler$1.run(Server.java:2277)
        at java.security.AccessController.doPrivileged(Native Method)
        at javax.security.auth.Subject.doAs(Subject.java:415)
        at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1924)
        at org.apache.hadoop.ipc.Server$Handler.run(Server.java:2275)
        at org.apache.hadoop.ipc.Client.getRpcResponse(Client.java:1498)
        at org.apache.hadoop.ipc.Client.call(Client.java:1444)
        at org.apache.hadoop.ipc.Client.call(Client.java:1354)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:228)
        at org.apache.hadoop.ipc.ProtobufRpcEngine$Invoker.invoke(ProtobufRpcEngine.java:116)
        at com.sun.proxy.$Proxy14.getEditLogManifestFromJournal(Unknown Source)
        at org.apache.hadoop.hdfs.qjournal.protocolPB.InterQJournalProtocolTranslatorPB.getEditLogManifestFromJournal(InterQJournalProtocolTranslatorPB.java:75)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.syncWithJournalAtIndex(JournalNodeSyncer.java:250)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.syncJournals(JournalNodeSyncer.java:226)
        at org.apache.hadoop.hdfs.qjournal.server.JournalNodeSyncer.lambda$startSyncJournalsDaemon$0(JournalNodeSyncer.java:186)
        at java.lang.Thread.run(Thread.java:748)

Error reason: HDFS 3 X adds the new InterQJournalProtocol, which is used to synchronize old edits data between journalnodes.

HDFS-14942 This problem is optimized, and the log level is changed from ERROR to DEBUG. This problem does not affect the upgrade when three HDFS 2 X JN all upgraded to HDFS 3 X JN, data can be synchronized between JNS normally.

5.3 NameNode upgrade datanodeprotocol Proto incompatible

After the NameNode is upgraded, datanodeprotocol Proto incompatibility causes Datanode BlockReport to fail.

(1) HDFS version 2.6.0

DatanodeProtocol.proto

message HeartbeatResponseProto {
  repeated DatanodeCommandProto cmds = 1; // Returned commands can be null
  required NNHAStatusHeartbeatProto haStatus = 2;
  optional RollingUpgradeStatusProto rollingUpgradeStatus = 3;
  optional uint64 fullBlockReportLeaseId = 4 [ default = 0 ];
  optional RollingUpgradeStatusProto rollingUpgradeStatusV2 = 5;
}

(2) HDFS version 3.1.1

DatanodeProtocol.proto

message HeartbeatResponseProto {
  repeated DatanodeCommandProto cmds = 1; // Returned commands can be null
  required NNHAStatusHeartbeatProto haStatus = 2;
  optional RollingUpgradeStatusProto rollingUpgradeStatus = 3;
  optional RollingUpgradeStatusProto rollingUpgradeStatusV2 = 4;
  optional uint64 fullBlockReportLeaseId = 5 [ default = 0 ];
}

We can see that the positions of the 4th and 5th parameters of the two versions of HeartbeatResponseProto are exchanged.

The reason for this problem is that Hadoop version 3.1.1 commit s HDFS-9788 , which is used to solve the problem of compatibility with lower versions during HDFS upgrade, while HDFS version 2.6.0 does not have a commit, resulting in datanodeprotocol Proto incompatible.

In the HDFS upgrade process, it is not necessary to be compatible with the lower version of HDFS, but only with the lower version of HDFS client.

Therefore, HDFS 3 X not required HDFS-9788 Compatible with the functions of the lower version, we fell back in Hadoop version 3.1.1 HDFS-9788 Modification to maintain and HDFS version 2.6.0 datanodeprotocol Proto compatible.

5.4. NameNode upgrade is incompatible with layoutVersion

After the NameNode is upgraded, the NameNode layoutVersion changes, resulting in EditLog incompatibility, HDFS 3 X downgraded to HDFS 2 X NameNode failed to start.

2021-04-12 20:15:39,571 ERROR org.apache.hadoop.hdfs.server.namenode.EditLogInputStream: caught exception initializing XXX:8480/getJournal
id=test-53-39&segmentTxId=371054&storageInfo=-60%3A1589021536%3A0%3Acluster7
org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream$LogHeaderCorruptException: Unexpected version of the file system log file: -64. Current version = -60.
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.readLogVersion(EditLogFileInputStream.java:397)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.init(EditLogFileInputStream.java:146)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextopImpl(EditLogFileInputStream.java:192)
        at org.apache.hadoop.hdfs.server.namenode.EditLogFileInputStream.nextop(EditLogFileInputStream.java:250)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.read0p(EditLogInputStream.java:85)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
        at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.next0p(RedundantEditLogInputStream.java:178)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.readop(EditLogInputStream.java:85)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.skipUntil(EditLogInputStream.java:151)
        at org.apache.hadoop.hdfs.server.namenode.RedundantEditLogInputStream.next0p(RedundantEditLogInputStream.java:178)
        at org.apache.hadoop.hdfs.server.namenode.EditLogInputStream.read0p(EditLogInputStream.java:85)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.LoadEditRecords(FSEditLogLoader.java:188)
        at org.apache.hadoop.hdfs.server.namenode.FSEditLogLoader.LoadFSEdits(FSEditLogLoader.java:141)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.loadEdits(FSImage.java:903)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.LoadFSImage(FSImage.java:756)
        at org.apache.hadoop.hdfs.server.namenode.FSImage.recoverTransitionRead(FSImage.java:324)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.LoadFSImage(FSNamesystem.java:1150)
        at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.LoadFromDisk(FSNamesystem.java:797)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.LoadNamesystem (NameNode.java:614)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.initialize(NameNode.java:676)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:844)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.<init>(NameNode.java:823)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.createNameNode (NameNode.java:1547)
        at org.apache.hadoop.hdfs.server.namenode.NameNode.main(NameNode.java:1615)

Upgrade from HDFS 2.6.0 to HDFS 3.1.1, and change the NameNode layoutVersion value from - 60 to - 64. To solve this problem, first find out when the NameNode layoutVersion will change?

HDFS version upgrade introduces new features, and NameNode layoutVersion changes with the new features. The official Hadoop upgrade document points out that new features should be disabled during the rolling upgrade of HDFS to ensure that the layoutVersion remains unchanged during the upgrade. After the upgrade, HDFS 3 X version to go back to HDFS 2 X version.

Next, find out which new feature is introduced in upgrading from HDFS 2.6.0 to HDFS 3.1.1, resulting in the change of namenode layoutVersion? Check HDFS-5223,HDFS-8432,HDFS-3107 For related issue s, HDFS version 2.7.0 introduces the truncate function, and the NameNode layoutVersion becomes - 61. View HDFS 3 X version namenodelaoutversion Code:

NameNodeLayoutVersion

public enum Feature implements LayoutFeature {
  ROLLING_UPGRADE(-55, -53, -55, "Support rolling upgrade", false),
  EDITLOG_LENGTH(-56, -56, "Add length field to every edit log op"),
  XATTRS(-57, -57, "Extended attributes"),
  CREATE_OVERWRITE(-58, -58, "Use single editlog record for " +
    "creating file with overwrite"),
  XATTRS_NAMESPACE_EXT(-59, -59, "Increase number of xattr namespaces"),
  BLOCK_STORAGE_POLICY(-60, -60, "Block Storage policy"),
  TRUNCATE(-61, -61, "Truncate"),
  APPEND_NEW_BLOCK(-62, -61, "Support appending to new block"),
  QUOTA_BY_STORAGE_TYPE(-63, -61, "Support quota for specific storage types"),
  ERASURE_CODING(-64, -61, "Support erasure coding");

TRUNCATE,APPEND_NEW_BLOCK,QUOTA_BY_STORAGE_TYPE,ERASURE_ The four coding features set minCompatLV to - 61.

View the final NameNode layoutVersion value logic:

FSNamesystem

static int getEffectiveLayoutVersion(boolean isRollingUpgrade, int storageLV,
    int minCompatLV, int currentLV) {
  if (isRollingUpgrade) {
    if (storageLV <= minCompatLV) {
      // The prior layout version satisfies the minimum compatible layout
      // version of the current software.  Keep reporting the prior layout
      // as the effective one.  Downgrade is possible.
      return storageLV;
    }
  }
  // The current software cannot satisfy the layout version of the prior
  // software.  Proceed with using the current layout version.
  return currentLV;
}

getEffectiveLayoutVersion gets the final effective layoutVersion. storageLV is the layoutVersion -60 of the current HDFS version 2.6.0, minCompatLV is - 61, and currentLV is the layoutVersion -64 of the upgraded HDFS version 3.1.1.

From the code judgment logic, it can be seen that if the layoutversion of HDFS version 2.6.0 is less than or equal to minCompatLV is - 61, it does not hold. Therefore, after upgrading to HDFS version 3.1.1, the value of namenode layoutVersion is currentLV -64.

From the above code analysis, it can be seen that after the truncate function is introduced in HDFS version 2.7.0, HDFS community only supports HDFS 3.0 X is compatible with NameNode layoutVersion downgraded to HDFS version 2.7.

We evaluate the HDFS truncate function. Combined with the business scenario analysis, we vivo analyze the scenarios that do not use the HDFS truncate function for the time being. Based on this, we modified the minCompatLV of HDFS version 3.1.1 to - 60 to support HDFS 2.6.0. After upgrading to HDFS version 3.1.1, we can downgrade to HDFS 2.6.0.

minCompatLV is modified to - 60:

NameNodeLayoutVersion

public enum Feature implements LayoutFeature {
  ROLLING_UPGRADE(-55, -53, -55, "Support rolling upgrade", false),
  EDITLOG_LENGTH(-56, -56, "Add length field to every edit log op"),
  XATTRS(-57, -57, "Extended attributes"),
  CREATE_OVERWRITE(-58, -58, "Use single editlog record for " +
    "creating file with overwrite"),
  XATTRS_NAMESPACE_EXT(-59, -59, "Increase number of xattr namespaces"),
  BLOCK_STORAGE_POLICY(-60, -60, "Block Storage policy"),
  TRUNCATE(-61, -60, "Truncate"),
  APPEND_NEW_BLOCK(-62, -60, "Support appending to new block"),
  QUOTA_BY_STORAGE_TYPE(-63, -60, "Support quota for specific storage types"),
  ERASURE_CODING(-64, -60, "Support erasure coding");

5.5 DataNode upgrade is incompatible with layoutVersion

After DataNode upgrade, DataNode layoutVersion is incompatible with HDFS 3.0 X DataNode degraded to HDFS 2 X DataNode failed to start.

2021-04-19 10:41:01,144 WARN org.apache.hadoop.hdfs.server.common.Storage: Failed to add storage directory [DISK]file:/data/dfs/dn/
org.apache.hadoop.hdfs.server.common.IncorrectVersionException: Unexpected version of storage directory /data/dfs/dn. Reported: -57. Expecting = -56.
        at org.apache.hadoop.hdfs.server.common.StorageInfo.setLayoutVersion(StorageInfo.java:178)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.setFieldsFromProperties(DataStorage.java:665)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.setFieldsFromProperties(DataStorage.java:657)
        at org.apache.hadoop.hdfs.server.common.StorageInfo.readProperties(StorageInfo.java:232)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.doTransition(DataStorage.java:759)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.LoadStorageDirectory(DataStorage.java:302)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.LoadDataStorage(DataStorage.java:418)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.addStorageLocations(DataStorage.java:397)
        at org.apache.hadoop.hdfs.server.datanode.DataStorage.recoverTransitionRead(DataStorage.java:575)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initStorage(DataNode.java:1560)
        at org.apache.hadoop.hdfs.server.datanode.DataNode.initBLockPool(DataNode.java:1520)
        at org.apache.hadoop.hdfs.server.datanode.BPOfferService.verifyAndSetNamespaceInfo(BPOfferService.java:341)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.connectToNNAndHandshake(BPServiceActor.java:219)
        at org.apache.hadoop.hdfs.server.datanode.BPServiceActor.run(BPServiceActor.java:673)
        at java.lang.Thread.run(Thread.java:748)

HDFS 2.6.0 DataNode layoutVersion is - 56, HDFS 3.1.1 DataNode layoutVersion is - 57.

Reason for DataNode layoutVersion change: Hadoop community from HDFS-2.8.0} commit HDFS-8791 After, the Layout of DataNode was upgraded. The storage structure of DataNode Block Pool data block directory changed from 256 x 256 directories to 32 x 32 directories. The purpose is to optimize the performance problems caused by Du operation by reducing the directory level of DataNode.

DataNode Layout upgrade process:

  1. rename the current directory to previous tmp.

  2. Create a new current directory and create hardlink from previous TMP to the new current directory.

  3. rename directory previous TMP is the previous directory.

Layout upgrade flowchart:

DN Layout storage directory structure during upgrade:

link association mode diagram of hardlink:

Check the DataNodeLayoutVersion code, which defines 32 x 32 directory structures. The layoutVersion is - 57. This indicates that the layoutVersion needs to be changed for DataNode Layout upgrade.

DataNodeLayoutVersion

public enum Feature implements LayoutFeature {
  FIRST_LAYOUT(-55, -53, "First datanode layout", false),
  BLOCKID_BASED_LAYOUT(-56,
      "The block ID of a finalized block uniquely determines its position " +
      "in the directory structure"),
  BLOCKID_BASED_LAYOUT_32_by_32(-57,
      "Identical to the block id based layout (-56) except it uses a smaller"
      + " directory structure (32x32)");

During the DataNode Layout upgrade in the test environment, we found the following problems: the process of creating a new current directory and establishing hardlink for DataNode is very time-consuming, and it takes 5 minutes from the Layout upgrade of 1 million block s to providing external read-write services. This is unacceptable for our HDFS cluster with nearly 10000 datanodes, and it is difficult to complete the upgrade of datanodes within the scheduled upgrade time window.

Therefore, we rolled back in HDFS version 3.1.1 HDFS-8791 , DataNode does not perform Layout upgrade. The test found that the upgrade of DataNode with 1002 million block s only takes 90180 seconds, which significantly shortens the upgrade time compared with Layout.

Back off HDFS-8791 , how to solve the performance problems caused by DataNode Du?

We combed the patch of HDFS version 3.3.0 and found HDFS-14313 Calculate the usage space of DataNode from memory and no longer use Du operation, which perfectly solves the performance problem of DataNode Du. We have entered the new version of HDFS 3.1.1 after upgrading HDFS-14313 , which solves the io performance problem caused by Du operation after DataNode upgrade.

5.6 DataNode Trash directory processing

As shown in the figure above, during the upgrade of DataNode, when deleting a Block, the DataNode will not delete the Block. Instead, it will first put the Block file into a track directory under the disk BlockPool directory, so as to use the original rollback_fsimage restores data deleted during the upgrade. The average water level of our cluster disk is always 80%, which is very tight. During the upgrade, a large number of Block files in the trace will pose a great threat to the stability of the cluster.

Considering that the fallback method of our scheme is rolling degradation rather than Rollback, the Block in the trace will not be used. Therefore, we use scripts to delete the Block files in the trace regularly, which can greatly reduce the storage pressure on the disk on the Datanode.

5.7 other problems

These are all the incompatibilities we encountered during the upgrade and downgrade of HDFS. In addition to incompatibility, we also introduced some NameNode RPC optimization patch es in the upgraded HDP HDFS version 3.1.1.

The FoldedTreeSet red black tree data structure in HDFS version 2.6.0 causes the RPC performance to decline after the NameNode runs for a period of time, and a large number of staledatanodes appear in the cluster, resulting in the failure of the task to read the block block. Hadoop 3.4.0  HDFS-13671 Fixed this problem by returning FoldedTreeSet to the original LightWeightResizableGSet linked list data structure. We will also HDFS-13671 patch introduces our upgraded HDP HDFS version 3.1.1.

After upgrade HDFS-13671 Optimization effect: the number of cluster staledatanodes is greatly reduced.

6, Test and launch

In March 2021, we launched the special HDFS upgrade of offline warehouse clusters, built multiple sets of HDFS clusters in the test environment, and conducted multiple rounds of HDFS upgrade and degradation drills under viewfs mode. Constantly summarize and improve the upgrade scheme to solve the problems encountered in the upgrade process.

6.1 full component HDFS client compatibility test

In the HDFS upgrade, only the Server side is upgraded, and the HDFS Client is still HDFS version 2.6.0. Therefore, we should ensure that the business can normally read and write to the HDFS 3.1.1 cluster through the HDFS 2.6.0 Client.

In the test environment, we set up an HDFS test cluster similar to the online environment. The colleagues of the joint computing group and the business department conducted a full-scale business compatibility test on Hive, Spark, OLAP(kylin, presto, druid) and algorithm platform by using HDFS 2.6.0 Client to read and write HDFS 3.1.1, simulating the online environment. Confirm that the HDFS 2.6.0 Client can read and write the HDFS 3.1.1 cluster normally, and the compatibility is normal.

6.2 scripted upgrade operation

We strictly sorted out the HDFS upgrade and downgrade orders, and sorted out the risks and precautions of each step of operation. Start and stop HDFS service through CM and Ambari API. These operations are organized into python scripts to reduce the risk of human operation.

6.3 upgrade point check

We sorted out the key inspection items in the HDFS upgrade process to ensure that problems in the HDFS upgrade process can be found at the first time, and the impact on the business of fallback and bottom reduction.

6.4 formal upgrade

We conducted several HDFS upgrade and downgrade drills in the test environment, completed the work related to HDFS compatibility test, and wrote many WIKI documents for recording within the company.

After confirming that the upgrade and downgrade of the test environment HDFS is OK, we started the upgrade.

The launch process of relevant specific milestones is as follows:

  • From March to April 2021, sort out HDFS 3 X version new features and related patches, read the source code of HDFS rolling upgrade and downgrade, and determine the final upgraded HDFS 3 X version. Complete HDFS 2 X existing optimized patch and HDFS 3 X later version patch is ported to upgraded HDFS 3 X version.

  • From May to August 2021, the HDFS upgrade and downgrade drill was conducted, and the compatibility test of Hive, Spark and OLAP(kylin, presto and druid) was conducted. It was determined that there was no problem with the HDFS upgrade and downgrade scheme.

  • During the period from September 2021 to March 2021, a large number of HDFS.yarn service logs were not repaired, resulting in a large number of HDFS.yarn logs being upgraded.

  • In November 2021, seven offline warehouse HDFS clusters (about 5000 sets) were upgraded to HDP HDFS 3.1.1. Users had no perception and the business was not affected.

  • In January 2022, the offline warehouse HDFS cluster (10 clusters with a scale of nearly 10000 units) was upgraded to HDP HDFS 3.1.1, which was not perceived by users and the business was not affected.

After the upgrade, we observed the clusters in the offline data warehouse. At present, the HDFS service is running normally.

7, Summary

It took us a year to upgrade 10000 offline warehouse HDFS clusters from CDH HDFS 2.6.0 to HDP HDFS 3.1.1, and the management tool was successfully switched from CM to Ambari.

The HDFS upgrade process is long, but the benefits are very large. The HDFS upgrade lays the foundation for the subsequent upgrading of YARN, Hive/Spark and HBase components.

On this basis, we can continue to do very meaningful work, continue to make in-depth exploration in stability, performance, cost and other aspects, and use technology to create visible value for the company.

reference material

  1. https://issues.apache.org/jira/browse/HDFS-13596

  2. https://issues.apache.org/jira/browse/HDFS-14396

  3. https://issues.apache.org/jira/browse/HDFS-14831

  4. https://issues.apache.org/jira/browse/HDFS-14942

  5. https://issues.apache.org/jira/browse/HDFS-9788

  6. https://issues.apache.org/jira/browse/HDFS-3107

  7. https://issues.apache.org/jira/browse/HDFS-8791

  8. https://issues.apache.org/jira/browse/HDFS-14313

  9. https://issues.apache.org/jira/browse/HDFS-13671

Tags: Big Data Hadoop hdfs

Posted by mybluehair on Mon, 16 May 2022 05:16:17 +0300