Getting started with ClickHouse: table engine - HDFS

preface
Plug in and server version
Server: ubuntu 16.04
Hadoop: 2.6
ClickHouse: 20.9.3.45

brief introduction

The HDFS engine of ClickHouse can connect to HDFS. Here, it is assumed that the HDFS environment has been configured. The HDFS version used in this test is 2.6
HDFS engine is defined as follows:
ENGINE = HDFS(hdfs_uri,format)
Parameter definition:

  • hdfs_uri indicates the file storage path of HDFS
  • Format indicates the file format (refers to the file format supported by ClickHouse, commonly including CSV, TSV, JSON, etc.)

HDFS table engine can be used in two ways:

  • I am responsible for reading and writing documents
  • It is only responsible for reading files, and the writing of files is completed by the external system

Engine configuration

  • Because hdfs is configured with HA, if you do not configure it, create a table that is only responsible for reading files and query the data, the following error will be reported:

    Code: 210. DB::Exception: Received from localhost:9000. 
    DB::Exception: Unable to connect to HDFS: InvalidParameter: Cannot parse URI: 
    hdfs://mycluster, missing port or invalid HA configuration	
    Caused by: HdfsConfigNotFound: Config key: dfs.ha.namenodes.mycluster not found. 
    

  • The following configuration is required to solve the problem
    1. Copy HDFS site XML file to / etc / Clickhouse server, and modify the file name to HDFS client xml

    2. Modify the ClickHouse Server startup file and add the Environment variable Environment = "libhdfs3_conf = / etc / ClickHouse Server / HDFS client. XML"
    remarks:
    The test environment here is ubuntu environment, and the startup service is started with systemctl, so the path of the modified startup file is:

    vi /etc/systemd/system/clickhouse-server.service
    


    3. Load and restart Clickhouse server

    systemctl daemon-reload
    systemctl restart clickhouse-server.service 
    

    4. Testing
    Create a new read-only table, and a test file 1 has been placed on the corresponding hdfs in advance txt

    CREATE TABLE test_hdfs
    (
        id Int32
    )
    ENGINE HDFS('hdfs://mycluster/1.txt','CSV');
    


    Data in query table

Two forms of HDFS table engine

  • I am responsible for reading and writing documents

    1. Create a new table

      CREATE TABLE test_hdfs_read
      (
          id Int32,
      	name String
      )
      ENGINE HDFS('hdfs://mycluster/test','CSV');
      

    2. insert data

      insert into test_hdfs_read values (1,'tracy');
      

    3. Query table data and view hdfs directory


      Here you can see that there is an additional test file in the hdfs directory

  • It is only responsible for reading files, and the writing of files is completed by the external system
    This form is similar to the plug-in table of hive. Other systems directly write files to HDFS through the parameter hdsfs_ui and format are mapped with the file path and file format of HDFS, where hdfs_uri supports the following common configuration methods:
    1. Absolute path: specifies a single file on the path, for example hdfs://mycluster/1.txt
    2. * wildcard: matches all characters, for example hdfs://mycluster/ *, will read hdfs://mycluster/ All files under the path
    3. ? Wildcard: matches a single character, for example hdfs://mycluster/test_ ?. txt will match all test_ txt file,? Represents any character
    4. {M... N} number range: a file that matches a specified number, such as a path hdfs://mycluster/test_{1…3}.txt, it will be read hdfs://mycluster/ File under path test_1.txt,test_2.txt,test_3.txt

    1. Create a new directory in hdfs and put three files

    2. Test * wildcard

      CREATE TABLE test_hdfs_read_all
      (
          `id` Int32,
          `name` String
      )
      ENGINE = HDFS('hdfs://mycluster/test_hdfs_read/*', 'CSV')
      

    3. Test? wildcard

      CREATE TABLE test_hdfs_read_1
      (
          `id` Int32,
          `name` String
      )
      ENGINE = HDFS('hdfs://mycluster/test_hdfs_read/test_?.csv', 'CSV')
      

    4. Test number range wildcard

      CREATE TABLE test_hdfs_read_2
      (
          `id` Int32,
          `name` String
      )
      ENGINE = HDFS('hdfs://mycluster/test_hdfs_read/test_{2..3}.csv', 'CSV')
      


      Only test matches here_ 2 and test_3, so there are only two records

quote

https://github.com/ClickHouse/ClickHouse/issues/8159

Tags: clickhouse

Posted by SwiftlyTilting on Tue, 10 May 2022 15:35:38 +0300