SQL reference of clickhouse - select

reference material

https://clickhouse.tech/docs/en/sql-reference/statements/select/

Version: v20.0 eleven

catalogue

reference material

Overview

grammar

select sentence

COLUMNS expression

asterisk

extremum

AS

Implementation details

SELECT modifier

APPLY

EXCEPT

REPLACE

Modifier Combinations

ARRAY JOIN clause

grammar

example

Use alias

ARRAY JOIN with nested data structure

Implementation details

DISTINCT clause

Null value processing

Different from groupby

limitations

example

FORMAT clause

default format

Implementation details

example

FROM clause

FINAL modifier

shortcoming

Implementation details

GROUP BY clause

NULL value processing

WITH ROLLUP modifier

give an example

WITH CUBE modifier

give an example

WITH TOTALS modifier

totals processing configuration

Implementation details

GROUP BY in external memory

HAVING clause

limit

Into output clause

Implementation details

JOIN clause

grammar

Supported join types

set up

ASOF JOIN usage

give an example

Distributed connection

Use suggestions

Null value processing

syntax

Grammatical restrictions

performance

Memory limit

give an example

LIMIT clause

LIMIT... WITH TIES modifier

LIMIT BY clause

give an example

ORDER BY clause

Special value sorting

give an example

Case support

Implementation details

Optimize data reading

ORDER BY Expr WITH FILL modifier

give an example

OFFSET FETCH modifier

give an example

Preview clause

Manually control the front position

limit

SAMPLE clause

SAMPLE K

SAMPLE N

SAMPLE K OFFSET M

UNION clause

UNION ALL

UNION DISTINCT

Implementation details

WHERE clause

WITH Clause

grammar

give an example

Overview

grammar

[WITH expr_list|(subquery)]
SELECT [DISTINCT] expr_list
[FROM [db.]table | (subquery) | table_function] [FINAL]
[SAMPLE sample_coeff]
[ARRAY JOIN ...]
[GLOBAL] [ANY|ALL|ASOF] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI] JOIN (subquery)|table (ON <expr_list>)|(USING <column_list>)
[PREWHERE expr]
[WHERE expr]
[GROUP BY expr_list] [WITH ROLLUP|WITH CUBE] [WITH TOTALS]
[HAVING expr]
[ORDER BY expr_list] [WITH FILL] [FROM expr] [TO expr] [STEP expr]
[LIMIT [offset_value, ]n BY columns]
[LIMIT [n, ]m] [WITH TIES]
[UNION  ...]
[INTO OUTFILE filename]
[FORMAT format]

All clauses are optional except for the list of required expressions immediately following SELECT, which will be described in more detail below.

select sentence

COLUMNS expression

To compare some columns in the result with re2 Regular expression matching, you can use the COLUMNS expression.

COLUMNS('regexp')

--For example, consider the following table:
CREATE TABLE default.col_names (aa Int8, ab Int8, bc Int8) ENGINE = TinyLog

--The following queries are from a Select data from all columns whose names contain symbols.
SELECT COLUMNS('a') FROM col_names
┌─aa─┬─ab─┐
│  1 │  1 │
└────┴────┘

--The selected columns are not returned alphabetically.
--You can COLUMNS Use multiple expressions in a query and apply functions to them.
--For example:
SELECT COLUMNS('a'), COLUMNS('c'), toTypeName(COLUMNS('c')) FROM col_names
┌─aa─┬─ab─┬─bc─┬─toTypeName(bc)─┐
│  1 │  1 │  1 │ Int8           │
└────┴────┴────┴────────────────┘

--COLUMNS Each column returned by the expression is passed to the function as a separate parameter. If other parameters are supported, you can also pass them to the function. Be careful when using the function. If the function does not support the number of arguments passed to it, then ClickHouse An exception is thrown.
--For example:
SELECT COLUMNS('a') + COLUMNS('c') FROM col_names

Received exception from server (version 19.14.1):
Code: 42. DB::Exception: Received from localhost:9000. DB::Exception: Number of arguments for function plus doesn't match: passed 3, should be 2.
--because COLUMNS('a')Return two columns, COLUMNS('c')Return a column,+There is no way to operate three parameters at the same time

asterisk

You can place an asterisk in any part of the query instead of an expression. After analyzing the query, the asterisk will be displayed to the list of all table columns (excluding virtual columns).

It is reasonable to use asterisks only in a few cases:

  • When creating a table dump.
  • For tables that contain only a few columns, such as system tables.
  • Used to get information about which columns in the table. In this case, set LIMIT 1. However, it is best to use DESC TABLE query.
  • PREWHERE is used when a small number of columns are strongly filtered.
  • In a subquery (because columns that are not required by an external query are excluded from the subquery).

In all other cases, we do not recommend using an asterisk because it only brings you the disadvantages, not the advantages, of a columnar DBMS. In other words, asterisks are not recommended.

extremum

In addition to the results, you can also get the minimum and maximum values of the result column.

To do this, set {extremes} to 1. The minimum and maximum values will be calculated for numeric types, dates, and dates with time. For other columns, default values are output.

Calculate the other two lines – minimum and maximum. These additional two lines are output in JSON *, TabSeparated * and Pretty * formats, separated from the other lines. They will not be output in other formats.

  • In JSON * format, the limit value is output in a separate "extremes" field.
  • In the TabSeparated * format, the row is after the main result and after "totals" (if any). It is preceded by an empty line (after other data).
  • In Pretty * format, this row is output as a separate table after the main result and after the total (if any).

Extreme values are calculated for lines before LIMIT but after LIMIT BY.
However, when using the LIMIT offset size, the line before the offset is included in the extreme.
In the flow request, the result may also contain a small number of lines passing through LIMIT.

AS

You can use synonyms (AS aliases) in any part of the query.
The GROUP BY and ORDER BY clauses do not support positional parameters. This contradicts MySQL, but conforms to standard SQL. For example, GROUP BY 1, 2 will be interpreted as grouping by constant (that is, aggregating all rows into one).

Implementation details

If the query omits DISTINCT, GROUP BY and ORDER BY clauses and IN and JOIN subqueries, the query will be fully stream processed with O (1) amount of RAM.

If it is not omitted, you need to specify the appropriate limit, otherwise the query may consume a lot of RAM. Setting item:

  • max_memory_usage
  • max_rows_to_group_by
  • max_rows_to_sort
  • max_rows_in_distinct
  • max_bytes_in_distinct
  • max_rows_in_set
  • max_bytes_in_set
  • max_rows_in_join
  • max_bytes_in_join
  • max_bytes_before_external_sort
  • max_bytes_before_external_group_by

For more information, see general settings.

SELECT modifier  

You can use the following modifiers in a SELECT query.

APPLY

Allows you to call some functions for each row returned through the external table expression of the query.

SELECT <expr> APPLY( <func> ) FROM [db.]table_name
 

INSERT INTO columns_transformers VALUES (100, 10, 324), (120, 8, 23);
SELECT * APPLY(sum) FROM columns_transformers;
┌─sum(i)─┬─sum(j)─┬─sum(k)─┐
│    220 │     18 │    347 │
└────────┴────────┴────────┘

EXCEPT

Exclude the names of one or more columns from the results. All matching column names will be omitted from the output.

SELECT <expr> EXCEPT ( col_name1 [, col_name2, col_name3, ...] ) FROM [db.]table_name

SELECT * EXCEPT (i) from columns_transformers;
┌──j─┬───k─┐
│ 10 │ 324 │
│  8 │  23 │
└────┴─────┘

REPLACE

Specify one or more expression aliases. Each alias must match the column name in the SELECT * statement.

In the output column list, the column matching the alias is replaced by the expression in this REPLACE. This modifier does not change the name or order of the columns. However, it can change values and value types.

SELECT <expr> REPLACE( <expr> AS col_name) from [db.]table_name

SELECT * REPLACE(i + 1 AS i) from columns_transformers;
┌───i─┬──j─┬───k─┐
│ 101 │ 10 │ 324 │
│ 121 │  8 │  23 │
└─────┴────┴─────┘

Modifier Combinations

You can use each modifier individually or in combination.

--Use the same modifier multiple times.
SELECT COLUMNS('[jk]') APPLY(toString) APPLY(length) APPLY(max) from columns_transformers;
┌─max(length(toString(j)))─┬─max(length(toString(k)))─┐
│                        2 │                        3 │
└──────────────────────────┴──────────────────────────┘

--Use multiple modifiers in a single query.
SELECT * REPLACE(i + 1 AS i) EXCEPT (j) APPLY(sum) from columns_transformers;
┌─sum(plus(i, 1))─┬─sum(k)─┐
│             222 │    347 │
└─────────────────┴────────┘

 

ARRAY JOIN clause

This is a common operation for tables that contain array columns to produce a new table with a column containing each individual array element of the initial column, while the values of the other columns are repeated. This is the basic case of the ARRAY JOIN clause.

It can be seen as performing a JOIN using an array or nested data structure. The purpose is similar to the arrayJoin function, but the clause function is more extensive.

grammar

SELECT <expr_list>
FROM <left_subquery>
[LEFT] ARRAY JOIN <array>
[WHERE|PREWHERE <expr>]
...

You can specify only one ARRAY JOIN clause in a SELECT query.

The supported ARRAY JOIN types are listed below:

  • ARRAY JOIN - in the basic case, the result of the JOIN does not contain an empty array.
  • Left join join - the result contains rows with empty arrays. The value of an empty array is set to the default value of the array element type (usually 0, empty string or NULL).

example

The following example demonstrates the usage of the ARRAY JOIN and LEFT ARRAY JOIN clauses.

CREATE TABLE arrays_test
(
    s String,
    arr Array(UInt8)
) ENGINE = Memory;

INSERT INTO arrays_test
VALUES ('Hello', [1,2]), ('World', [3,4,5]), ('Goodbye', []);
┌─s───────────┬─arr─────┐
│ Hello       │ [1,2]   │
│ World       │ [3,4,5] │
│ Goodbye     │ []      │
└─────────────┴─────────┘

--use ARRAY JOIN Examples:
SELECT s, arr
FROM arrays_test
ARRAY JOIN arr;
┌─s─────┬─arr─┐
│ Hello │   1 │
│ Hello │   2 │
│ World │   3 │
│ World │   4 │
│ World │   5 │
└───────┴─────┘

--use LEFT ARRAY JOIN Examples:
SELECT s, arr
FROM arrays_test
LEFT ARRAY JOIN arr;
┌─s───────────┬─arr─┐
│ Hello       │   1 │
│ Hello       │   2 │
│ World       │   3 │
│ World       │   4 │
│ World       │   5 │
│ Goodbye     │   0 │
└─────────────┴─────┘

Use alias

You can specify an alias for the array in the ARRAY JOIN clause. In this case, you can use this alias to access the array items, or you can use the original name to access the array itself. Example:

SELECT s, arr, a
FROM arrays_test
ARRAY JOIN arr AS a;
┌─s─────┬─arr─────┬─a─┐
│ Hello │ [1,2]   │ 1 │
│ Hello │ [1,2]   │ 2 │
│ World │ [3,4,5] │ 3 │
│ World │ [3,4,5] │ 4 │
│ World │ [3,4,5] │ 5 │
└───────┴─────────┴───┘

--Using aliases, you can execute external arrays ARRAY JOIN. 
SELECT s, arr_external
FROM arrays_test
ARRAY JOIN [1, 2, 3] AS arr_external;
┌─s───────────┬─arr_external─┐
│ Hello       │            1 │
│ Hello       │            2 │
│ Hello       │            3 │
│ World       │            1 │
│ World       │            2 │
│ World       │            3 │
│ Goodbye     │            1 │
│ Goodbye     │            2 │
│ Goodbye     │            3 │
└─────────────┴──────────────┘

--Multiple arrays can be ARRAY JOIN Clauses are separated by commas. under these circumstances, JOIN Execute simultaneously with them (direct sum, not Cartesian product). Note that all arrays must be the same size. Example:
SELECT s, arr, a, num, mapped
FROM arrays_test
ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num, arrayMap(x -> x + 1, arr) AS mapped;
┌─s─────┬─arr─────┬─a─┬─num─┬─mapped─┐
│ Hello │ [1,2]   │ 1 │   1 │      2 │
│ Hello │ [1,2]   │ 2 │   2 │      3 │
│ World │ [3,4,5] │ 3 │   1 │      4 │
│ World │ [3,4,5] │ 4 │   2 │      5 │
│ World │ [3,4,5] │ 5 │   3 │      6 │
└───────┴─────────┴───┴─────┴────────┘


--The following example uses arrayEnumerate Function:
SELECT s, arr, a, num, arrayEnumerate(arr)
FROM arrays_test
ARRAY JOIN arr AS a, arrayEnumerate(arr) AS num;
┌─s─────┬─arr─────┬─a─┬─num─┬─arrayEnumerate(arr)─┐
│ Hello │ [1,2]   │ 1 │   1 │ [1,2]               │
│ Hello │ [1,2]   │ 2 │   2 │ [1,2]               │
│ World │ [3,4,5] │ 3 │   1 │ [1,2,3]             │
│ World │ [3,4,5] │ 4 │   2 │ [1,2,3]             │
│ World │ [3,4,5] │ 5 │   3 │ [1,2,3]             │
└───────┴─────────┴───┴─────┴─────────────────────┘

ARRAY JOIN with nested data structure

CREATE TABLE nested_test
(
    s String,
    nest Nested(
    x UInt8,
    y UInt32)
) ENGINE = Memory;

INSERT INTO nested_test
VALUES ('Hello', [1,2], [10,20]), ('World', [3,4,5], [30,40,50]), ('Goodbye', [], []);
┌─s───────┬─nest.x──┬─nest.y─────┐
│ Hello   │ [1,2]   │ [10,20]    │
│ World   │ [3,4,5] │ [30,40,50] │
│ Goodbye │ []      │ []         │
└─────────┴─────────┴────────────┘

SELECT s, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN nest;
┌─s─────┬─nest.x─┬─nest.y─┐
│ Hello │      1 │     10 │
│ Hello │      2 │     20 │
│ World │      3 │     30 │
│ World │      4 │     40 │
│ World │      5 │     50 │
└───────┴────────┴────────┘

--stay ARRAY JOIN When the name of a nested data structure is specified in, its meaning is the same as ARRAY JOIN And all the array elements it consists of are the same. Examples are as follows:
SELECT s, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN `nest.x`, `nest.y`;
┌─s─────┬─nest.x─┬─nest.y─┐
│ Hello │      1 │     10 │
│ Hello │      2 │     20 │
│ World │      3 │     30 │
│ World │      4 │     40 │
│ World │      5 │     50 │
└───────┴────────┴────────┘


--Another change:

SELECT s, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN `nest.x`;
┌─s─────┬─nest.x─┬─nest.y─────┐
│ Hello │      1 │ [10,20]    │
│ Hello │      2 │ [10,20]    │
│ World │      3 │ [30,40,50] │
│ World │      4 │ [30,40,50] │
│ World │      5 │ [30,40,50] │
└───────┴────────┴────────────┘

--Aliases can be used to nest data structures for selection JOIN Result or source array. Example:
SELECT s, `n.x`, `n.y`, `nest.x`, `nest.y`
FROM nested_test
ARRAY JOIN nest AS n;
┌─s─────┬─n.x─┬─n.y─┬─nest.x──┬─nest.y─────┐
│ Hello │   1 │  10 │ [1,2]   │ [10,20]    │
│ Hello │   2 │  20 │ [1,2]   │ [10,20]    │
│ World │   3 │  30 │ [3,4,5] │ [30,40,50] │
│ World │   4 │  40 │ [3,4,5] │ [30,40,50] │
│ World │   5 │  50 │ [3,4,5] │ [30,40,50] │
└───────┴─────┴─────┴─────────┴────────────┘


--use arrayEnumerate Examples of functions:
SELECT s, `n.x`, `n.y`, `nest.x`, `nest.y`, num
FROM nested_test
ARRAY JOIN nest AS n, arrayEnumerate(`nest.x`) AS num;
┌─s─────┬─n.x─┬─n.y─┬─nest.x──┬─nest.y─────┬─num─┐
│ Hello │   1 │  10 │ [1,2]   │ [10,20]    │   1 │
│ Hello │   2 │  20 │ [1,2]   │ [10,20]    │   2 │
│ World │   3 │  30 │ [3,4,5] │ [30,40,50] │   1 │
│ World │   4 │  40 │ [3,4,5] │ [30,40,50] │   2 │
│ World │   5 │  50 │ [3,4,5] │ [30,40,50] │   3 │
└───────┴─────┴─────┴─────────┴────────────┴─────┘

Implementation details

When the ARRAY JOIN is run, the query execution order is optimized.

Although array joins must be specified before the where / preview clause of the query, technically they can be executed in any order unless filtered using the results of ARRAY JOIN.

The processing order is controlled by the query optimizer.

DISTINCT clause

If SELECT DISTINCT is specified, only unique rows are retained in the query results. Therefore, there is only one row left in all exactly matched row sets in the result.

Null value processing

DISTINCT is used with NULL to assume that NULL is a specific value and NULL == NULL.

In other words, different combinations with NULL occur only once in the DISTINCT result. In most other cases, it is different from NULL processing.

Different from groupby

The result of using distinct is likely to be the same as that of using group by but not aggregate function. difference:

  • DISTINCT can be applied together with GROUP BY.
  • When ORDER BY is omitted and LIMIT is defined, the query will stop running immediately after reading the required number of different rows.
  • Data blocks are output during processing without waiting for the entire query to finish running.

limitations

distinct is not supported when SELECT contains array columns.

example

--ClickHouse Support use DISTINCT and ORDER BY Different columns in a query. DISTINCT Clause in ORDER BY Clause is executed before.
--Example table:
┌─a─┬─b─┐
│ 2 │ 1 │
│ 1 │ 2 │
│ 3 │ 3 │
│ 2 │ 4 │
└───┴───┘

--When executed SELECT DISTINCT a FROM t1 ORDER BY b ASC To query the data, we get the following results:
┌─a─┐
│ 2 │
│ 1 │
│ 3 │
└───┘

--If we change the sorting direction SELECT DISTINCT a FROM t1 ORDER BY b DESC,We get the following results:
┌─a─┐
│ 3 │
│ 1 │
│ 2 │
└───┘
Line 2, 4 Before being cut.

FORMAT clause

ClickHouse supports a wide range of serialization formats that can be used for query results, etc. There are many ways to SELECT the output FORMAT of SELECT. One way is to specify the FORMAT at the end of the query to obtain the result data in any specific FORMAT.

For convenience, specific formats can be used to integrate with other systems or improve performance.

default format

If the FORMAT clause is omitted, the default FORMAT is used, depending on the settings and the interface used to access the ClickHouse server.

  • For HTTP interface and command-line client in batch mode, the default format is TabSeparated.
  • For command-line clients in interactive mode, the default format is PrettyCompact.

Implementation details

When using the command-line client, data is always transmitted over the network in an internally valid FORMAT (native). The client independently interprets the FORMAT clause of the query and formats the data itself (thus reducing the additional burden on the network and server).

example

--use TabSeparated format
vm:) select * from rmTest format TabSeparated;

SELECT *
FROM rmTest
FORMAT TabSeparated

1	lisi	2020-10-10	90
1	zhangsan	2020-10-10	88
1	zhangsan	2020-10-11	89
15	lmt	2020-11-24	1
16	lmt	2020-11-24	1

5 rows in set. Elapsed: 0.011 sec. 

--Use default format
vm:) select * from rmTest;

SELECT *
FROM rmTest

┌─id─┬─name─┬───────date─┬─score─┐
│ 15 │ lmt  │ 2020-11-24 │     1 │
│ 16 │ lmt  │ 2020-11-24 │     1 │
└────┴──────┴────────────┴───────┘
┌─id─┬─name─────┬───────date─┬─score─┐
│  1 │ zhangsan │ 2020-10-11 │    89 │
└────┴──────────┴────────────┴───────┘
┌─id─┬─name─────┬───────date─┬─score─┐
│  1 │ lisi     │ 2020-10-10 │    90 │
│  1 │ zhangsan │ 2020-10-10 │    88 │
└────┴──────────┴────────────┴───────┘

5 rows in set. Elapsed: 0.006 sec.

FROM clause

The FROM clause specifies the source of the read data:

The JOIN and ARRAY JOIN clauses can also be used to extend the functionality of the FROM clause.

A subquery is another SELECT query that can be specified in parentheses in the FROM clause.

The FROM clause can contain multiple data sources separated by commas, which is equivalent to performing CROSS JOIN on them.

FINAL modifier

When FINAL is specified, ClickHouse will completely merge the data before returning the results, and perform the data conversion during the merge.

FINAL modifier support:

  • Tables using the MergeTree engine family (except GraphiteMergeTree)
  • Replicated version of the MergeTree engine
  • View s, Buffer, Distributed and materialized views created through the MergeTree engine table that can run on other engines

shortcoming

Queries using FINAL are slightly slower than similar queries without FINAL because:

  • Data is merged during query execution.
  • In addition to the columns specified in the query, queries with FINAL also read primary key columns.

In most cases, avoid using FINAL.

Common practices are different queries. For example, the background process of the current mergetree engine has not yet occurred. Aggregation (for example, discarding duplicate items) can be used to solve them.

Implementation details

If the FROM clause is ignored, the FROM system Read data FROM the one table. system. The one table contains only one row (the purpose of this table is the same as that of the DUAL table in other DBMS)

To execute the query, all columns listed in the query are extracted from the corresponding table; Unnecessary columns in the query are thrown from the subquery.

If the query does not list any columns (for example, SELECT count () FROM t), some columns will be extracted from the table (select the smallest column) to calculate the number of rows.

GROUP BY clause

The GROUP BY clause switches the SELECT query to aggregation mode. Its working mode is as follows:

  • The GROUP BY clause contains a list of expressions (or a single expression, which is treated as a list of length 1). This list is used as a grouping keyword, and each individual expression will be called a keyword expression.
  • All expressions in the SELECT, HAVING and ORDER BY clauses are evaluated based on set functions on keyword expressions or non keyword expressions (normal columns). In other words, each column selected from the table must be used in a key expression or aggregate function, but not both.
  • After using groupby, the number of results returned by the query is the same as the number of grouping keys, which usually reduces the returned results by several orders of magnitude. However, if the grouping keys are different, the amount of result data will not be reduced.

If the key columns contained in the query are used in the aggregate function, group by can be omitted

NULL value processing

For grouping, ClickHouse interprets NULL as a value and NULL == NULL.  

If there is a table:
┌─x─┬────y─┐
│ 1 │    2 │
│ 2 │ ᴺᵁᴸᴸ │
│ 3 │    2 │
│ 3 │    3 │
│ 3 │ ᴺᵁᴸᴸ │
└───┴──────┘

SELECT sum(x), y FROM t_null_big GROUP BY y:
┌─sum(x)─┬────y─┐
│      4 │    2 │
│      3 │    3 │
│      5 │ ᴺᵁᴸᴸ │
└────────┴──────┘

If you pass multiple keys to GROUP BY, the result will give you all the combinations you choose.

WITH ROLLUP modifier

The WITH ROLLUP modifier is used to calculate the subtotal of key value expressions based on their order in the GROUP BY list.

The calculation order of subtotal is opposite to the given order: first calculate the subtotal of the last key expression in the list, then calculate the subtotal of the previous key, and so on until the first key expression.

In the subtotal line, the value of the key expression that has been "grouped" is set to 0 or an empty line.

The HAVING clause affects the subtotal result.

give an example

--Consideration table:
┌─year─┬─month─┬─day─┐
│ 2019 │     1 │   5 │
│ 2019 │     1 │  15 │
│ 2020 │     1 │   5 │
│ 2020 │     1 │  15 │
│ 2020 │    10 │   5 │
│ 2020 │    10 │  15 │
└──────┴───────┴─────┘

SELECT year, month, day, count(*) FROM t GROUP BY year, month, day WITH ROLLUP;

--because GROUP BY The section has three key expressions, so the result contains four tables whose subtotals are "summarized" from right to left:
--GROUP BY: 
--year, month, day
--year, month(day Columns are filled with zeros)
--year(month, day Columns are filled with zeros)
┌─year─┬─month─┬─day─┬─count()─┐
│ 2020 │    10 │  15 │       1 │
│ 2020 │     1 │   5 │       1 │
│ 2019 │     1 │   5 │       1 │
│ 2020 │     1 │  15 │       1 │
│ 2019 │     1 │  15 │       1 │
│ 2020 │    10 │   5 │       1 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│ 2019 │     1 │   0 │       2 │
│ 2020 │     1 │   0 │       2 │
│ 2020 │    10 │   0 │       2 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│ 2019 │     0 │   0 │       2 │
│ 2020 │     0 │   0 │       4 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│    0 │     0 │   0 │       6 │
└──────┴───────┴─────┴─────────┘

WITH CUBE modifier

The WITH CUBE modifier is used to calculate the subtotal of each key expression combination in the GROUP BY list. A subtotal row is added after the result table.

In the subtotal row, the value of all group key expressions is set to 0 or an empty row.

The HAVING clause affects the subtotal result.

give an example

--Consider table
┌─year─┬─month─┬─day─┐
│ 2019 │     1 │   5 │
│ 2019 │     1 │  15 │
│ 2020 │     1 │   5 │
│ 2020 │     1 │  15 │
│ 2020 │    10 │   5 │
│ 2020 │    10 │  15 │
└──────┴───────┴─────┘

SELECT year, month, day, count(*) FROM t GROUP BY year, month, day WITH CUBE;

--because GROUP BY The section has three key expressions, so the result contains eight tables with subtotals for all key expression combinations:
--GROUP BY : 
--year, month, day
--year, month
--year, day
--year
--month, day
--month
--day
--whole
--from GROUP BY Columns excluded from are filled with zeros.
┌─year─┬─month─┬─day─┬─count()─┐
│ 2020 │    10 │  15 │       1 │
│ 2020 │     1 │   5 │       1 │
│ 2019 │     1 │   5 │       1 │
│ 2020 │     1 │  15 │       1 │
│ 2019 │     1 │  15 │       1 │
│ 2020 │    10 │   5 │       1 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│ 2019 │     1 │   0 │       2 │
│ 2020 │     1 │   0 │       2 │
│ 2020 │    10 │   0 │       2 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│ 2020 │     0 │   5 │       2 │
│ 2019 │     0 │   5 │       1 │
│ 2020 │     0 │  15 │       2 │
│ 2019 │     0 │  15 │       1 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│ 2019 │     0 │   0 │       2 │
│ 2020 │     0 │   0 │       4 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│    0 │     1 │   5 │       2 │
│    0 │    10 │  15 │       1 │
│    0 │    10 │   5 │       1 │
│    0 │     1 │  15 │       2 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│    0 │     1 │   0 │       4 │
│    0 │    10 │   0 │       2 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│    0 │     0 │   5 │       3 │
│    0 │     0 │  15 │       3 │
└──────┴───────┴─────┴─────────┘
┌─year─┬─month─┬─day─┬─count()─┐
│    0 │     0 │   0 │       6 │
└──────┴───────┴─────┴─────────┘

WITH TOTALS modifier

If the WITH TOTALS modifier is specified, another row is evaluated.

The row contains: a key column with a default value (zero or empty row), and an aggregate function column with a value calculated in all rows ("total" value).

This additional line is generated only in JSON *, TabSeparated * and Pretty * formats, separated from other lines:

  • In JSON * format, this row is output as a separate "total" field.
  • In TabSeparated * format, the row follows the main result, followed by an empty row (after other data).
  • In Pretty * format, the row is output as a separate table after the main result.
  • Other formats are not available.

WITH TOTALS can be run in different ways when HAVING is present. Behavior depends on totals_mode setting.

totals processing configuration

before_having

By default (my default in version 20.8.3 is' after_having_exclusive '). In this case, "total" is calculated for all rows (including those not passed through HAVING and max_rows_to_group_by).

Other options are to calculate the lines passing through having and set max_rows_to_group_by and group_ by_ overflow_ The results are different when mode = 'any'.

after_having_exclusive

There are no rows in "totals" that are not restricted by "max_rows_to_group_by".

after_having_inclusive

Include rows in "totals" that are not restricted by "max_rows_to_group_by".

after_having_auto

Calculate the number of rows limited by HAVING. If more than a certain number (50% by default), include rows in "totals" that are not restricted by "max_rows_to_group_by". Otherwise, not included.

totals_auto_threshold

The default is 0.5. after_having_auto coefficient.

If Max is not used_ rows_ to_ group_ By and group_by_overflow_mode ='any', then after_ All variants of having are the same, and you can use any of them (for example, after_having_auto).

You can use WITH TOTALS in subqueries, including subqueries in the JOIN clause (in this case, the individual totals will be aggregated).

Implementation details

Aggregation is one of the most important functions of column oriented DBMS, so its implementation is one of the parts of ClickHouse optimization.  

By default, aggregation is done in memory using hash tables. It has more than 40 different ways, which can be automatically selected according to the "group key" data type.

GROUP BY in external memory

You can enable the dump of temporary data to disk to constrain the memory usage of GROUP BY.

max_ bytes_ before_ external_ group_ The by parameter sets the threshold for using RAM. If set to 0 (the default), it is disabled.

Use Max_ bytes_ before_ external_ group_ When by, it is recommended to set max_memory_usage is set to about twice.

Because aggregation has two stages: (1) reading data and forming intermediate data; (2) Merge intermediate data. Dumping data to disk can only be done in phase 1. If no temporary data is dumped, phase 2 may require the same amount of memory as phase 1.

For example, if you set Max_ memory_ If usage is set to 10000000000 and you want to use external aggregation, you can set max_ bytes_ before_ external_ group_ Set by to 1000000000 and set max_memory_usage is set to 2000000000. When external aggregation is triggered (if there is at least one temporary data dump), the maximum RAM consumption is only slightly greater than max_bytes_before_external_group_by.

Through distributed query processing, external aggregation can be performed on remote servers. In order for the requester server to use only a small amount of RAM, set distributed_ aggregation_ memory_ Effective is set to 1.

When the merged data is flushed to disk, and when distributed is enabled_ aggregation_ memory_ When efficiency is set and the results are merged from the remote server, the maximum RAM consumption is 1/256 * the_number_of_threads.

When external aggregation is enabled, if the data is less than max_bytes_before_external_group_by (i.e. the data is not refreshed to disk), the query runs as fast as when there is no external aggregation. If any temporary data is refreshed, the run time will be several times longer (about three times).

If there is an ORDER BY with LIMIT after GROUP BY, the amount of RAM used depends on the amount of data in the LIMIT, not the amount of data in the whole table. However, if ORDER BY does not have a LIMIT, you need to set an external sort (max_bytes_before_external_sort)

HAVING clause

Allows you to filter aggregate results generated by GROUP BY. It is similar to the WHERE clause, but the difference is that WHERE is executed before aggregation and HAVING is executed after aggregation.

The aggregate result of the SELECT clause in the HAVING clause can be referenced by an alias. In addition, the HAVING clause can filter the results of other aggregations that are not returned in the query results.

limit

HAVING cannot be used if the summary is not performed. Use WHERE instead.

Into output clause

Add the [into output file filename] clause in the SELECT query (where filename is a string literal) to redirect its output to the specified file on the client.

Implementation details

  • This feature is available in the command line client and Clickhouse local. However, queries sent through the HTTP interface will fail.
  • If a file with the same file name already exists, the query will fail.
  • The default output format is TabSeparated (similar to the command line client batch mode).

JOIN clause

Join builds a new table by using the contents of one or more tables to make the same fields the same. This is a common operation in databases with SQL support, which corresponds to relational algebraic joins. The special case of a table join is often called "self join".

Expressions from the ON clause and columns from the USING clause are called join keys. Unless otherwise noted, join generates Cartesian products from rows with matching "join keys", which may produce results for many more rows than the source table.

grammar

SELECT <expr_list>
FROM <left_table>
[GLOBAL] [INNER|LEFT|RIGHT|FULL|CROSS] [OUTER|SEMI|ANTI|ANY|ASOF] JOIN <right_table>
(ON <expr_list>)|(USING <column_list>) ...

Supported join types

All standard SQL JOIN types are supported:

  • INNER JOIN, which returns only matching rows.
  • In addition to the rows in the left join table that do not match, the rows returned by ou do not match.
  • RIGHT OUTER JOIN returns unmatched rows from the right table in addition to the matched rows.
  • FULL OUTER JOIN returns unmatched rows in two tables in addition to matching rows.
  • CROSS JOIN, which generates Cartesian product of the whole table without specifying "join key".

A JOIN of type not specified represents an INNER. The keyword OUTER can be safely omitted.  

Another syntax for CROSS JOIN is to specify multiple tables in the FROM clause, separated by commas.

Other join types available in ClickHouse:

  • LEFT SEMI JOIN and # RIGHT SEMI JOIN, the white list on the "join key" does not produce Cartesian product.
  • LEFT ANTI JOIN and # RIGHT ANTI JOIN are blacklists on the "join key", which do not produce Cartesian products.
  • LEFT ANY JOIN, RIGHT ANY JOIN and # INNER ANY JOIN, some (for the opposite side of LEFT and RIGHT) or all (for INNER and FULL) will disable the Cartesian product of the standard JOIN type.
  • ASOF JOIN and LEFT ASOF JOIN have incomplete matching connection sequences. The usage of ASOF JOIN is described below.

set up

You can use join_ default_ The strictness setting overrides the default join type. The behavior of ClickHouse server to ANY JOIN operation depends on any_join_distinct_right_table_keys settings.

ASOF JOIN usage

ASOF JOIN is useful when you need to connect records that do not match exactly.

The algorithm requires special columns in the table, and the columns need to meet the following properties:

  • Must contain an ordered sequence.
  • It can be one of the following types: Int, UInt, Float *, Date, DateTime, Decimal *.
  • Cannot be the only column in a JOIN clause.
SELECT expressions_list
FROM table_1
ASOF LEFT JOIN table_2
ON equi_cond AND closest_match_cond
--You can use any number of equality conditions and one closest match condition
SELECT count() FROM table_1 ASOF LEFT JOIN table_2 ON table_1.a == table_2.b AND table_2.t <= table_1.t.

The closest matching conditions are supported: >, > =, <, < =

--ASOF JOIN use equi_columnX Perform equality Association
--ASOF JOIN use asof_column Proceed with given conditions (e.g table_1.asof_column> = table_2.asof_column)The closest connection. 
--asof_column Column is always USING The last in the clause.

SELECT expressions_list
FROM table_1
ASOF JOIN table_2
USING (equi_column1, ... equi_columnN, asof_column)

give an example

ASOF JOIN can select a timestamp from table 1 and the event that best matches a given condition from table 2. If there are equal, then the equal is the closest.

In this example, user_ The ID column can be used for equality joins, while ev_ The time column can be used for the closest match join. In the example, the event_ 1_ 1 can be associated with events_ 2_ 1. Merge, event_ 1_ 2 can be related to events_ 2_ 3 merge, but event_ 2_ 2 cannot merge.

     table_1                           table_2
  event   | ev_time | user_id       event   | ev_time | user_id
----------|---------|----------   ----------|---------|----------
              ...                               ...
event_1_1 |  12:00  |  42         event_2_1 |  11:59  |   42
              ...                 event_2_2 |  12:30  |   42
event_1_2 |  13:00  |  42         event_2_3 |  13:00  |   42
              ...                               ...

ASOF joins are not supported in the Join table engine.

Distributed connection

There are two ways to perform joins involving distributed tables:

  • When using a normal JOIN, the query is sent to the remote server, and the subquery is run on each server to create the correct table and use the table to perform the JOIN. In other words, the right table is formed separately on each server.
  • When using global When joining, the sending server runs a sub query to generate a temporary table. The temporary table is passed to each server and the query is run.

Using global carefully, you can understand Distributed subqueries 

Use suggestions

Null value processing

Empty cells may appear when joining tables. Set join_use_nulls defines how ClickHouse is handled.

If the JOIN key is a field of Nullable type, at least one row with NULL key will not be connected.

syntax

The column specified in USING must have the same name in both subqueries, while the names of other columns must be different. You can use aliases to change column names in subqueries.

USING clause specifies one or more columns to join to establish the equality of these columns. Set column list without square brackets. More complex connection conditions are not supported.

Grammatical restrictions

For multiple JOIN clauses in a single SELECT query:

  • The [*] sign can be used only when the whole table is join ed instead of the sub query
  • The preview clause is not available

For ON, WHERE and GROUP BY clauses:

  • You cannot use any expression in the ON, WHERE, and GROUP BY clauses, but you can define an expression in the SELECT clause and then use it in those clauses by alias.

performance

When running the JOIN, the execution order related to other stages of the query is not optimized. JOIN (search in the right table) runs before filtering and aggregation in WHERE.

Each time you run a query using the same join, the results will not be cached, so each query will be re executed. To avoid this, use a special join table engine.

IN some cases, using IN instead of JOIN is more effective.

If you need a JOIN to JOIN dimension tables (these tables are smaller tables that contain dimension attributes, such as the name of an advertising campaign), the JOIN may not be very convenient because each user needs to revisit the correct table query.

In this case, the "external dictionary" function should be used instead of JOIN. For more information, see the external dictionary section.

Memory limit

By default, ClickHouse uses the hash join algorithm. ClickHouse use < right_ Table > and create a hash table for it in RAM. After reaching a certain memory consumption threshold, ClickHouse will merge.

If you need to limit the memory consumption of the JOIN operation, use the following settings:

When any of these limits are reached, ClickHouse will follow the join rule_ overflow_ Operate as indicated by the mode setting.

give an example

SELECT
    CounterID,
    hits,
    visits
FROM
(
    SELECT
        CounterID,
        count() AS hits
    FROM test.hits
    GROUP BY CounterID
) ANY LEFT JOIN
(
    SELECT
        CounterID,
        sum(Sign) AS visits
    FROM test.visits
    GROUP BY CounterID
) USING CounterID
ORDER BY hits DESC
LIMIT 10
┌─CounterID─┬───hits─┬─visits─┐
│   1143050 │ 523264 │  13665 │
│    731962 │ 475698 │ 102716 │
│    722545 │ 337212 │ 108187 │
│    722889 │ 252197 │  10547 │
│   2237260 │ 196036 │   9522 │
│  23057320 │ 147211 │   7689 │
│    722818 │  90109 │  17847 │
│     48221 │  85379 │   4652 │
│  19762435 │  77807 │   7026 │
│    722884 │  77492 │  11056 │
└───────────┴────────┴────────┘

LIMIT clause

LIMIT m allows you to select the first m rows from the results.

LIMIT n, m allows M rows to be selected from the result after skipping the first n rows. The LIMIT m OFFSET n syntax is equivalent.

n and m must be non negative integers.

If there is no ORDER BY Clause for explicitly sorting results, the selection of result rows may be arbitrary or uncertain.

LIMIT... WITH TIES modifier

Set the WITH TIES modifier and specify ORDER BY expr_list, you will get the first n rows or N, m rows of the results, and all rows with the same ORDER BY field value as these results

This modifier can also be used in conjunction with the ORDER BY... WITH FILL modifier.

SELECT * FROM (
    SELECT number%50 AS n FROM numbers(100)
) ORDER BY n LIMIT 0,5
SELECT * FROM (
    SELECT number%50 AS n FROM numbers(100)
) ORDER BY n LIMIT 0,5

┌─n─┐
│ 0 │
│ 0 │
│ 1 │
│ 1 │
│ 2 │
└───┘

--add to WITH TIES Modifier 

SELECT * FROM (
    SELECT number%50 AS n FROM numbers(100)
) ORDER BY n LIMIT 0,5 WITH TIES

┌─n─┐
│ 0 │
│ 0 │
│ 1 │
│ 1 │
│ 2 │
│ 2 │
└───┘
--Because the field in line 6 n Have the same value "2" as line 5

LIMIT BY clause

A query with a LIMIT n BY expressions clause selects the first n rows for each different expression value. The key of LIMIT BY can contain any number of expressions.

ClickHouse supports the following syntax variants:

  • LIMIT [offset_value, ]n BY expressions
  • LIMIT n OFFSET offset_value BY expressions

During query processing, ClickHouse selects data sorted by sort key.

The by clause can be set as the explicit sort engine key, and the by clause can also be set as the implicit sort engine property.

ClickHouse then applies LIMIT n BY expressions and returns up to N lines for each different expression combination.

If OFFSET is specified, ClickHouse skips OFFSET from the beginning of the block for each data block belonging to a different expression combination_ Value the number of rows and returns up to n rows. If OFFSET_ If value is greater than the number of rows in the data block, ClickHouse returns zero rows from the block.

LIMIT BY is not related to LIMIT. They can all be used in the same query.

give an example

--Create a table
CREATE TABLE limit_by(id Int, val Int) ENGINE = Memory;
INSERT INTO limit_by VALUES (1, 10), (1, 11), (1, 12), (2, 20), (2, 21);
SELECT * FROM limit_by ORDER BY id, val LIMIT 2 BY id
┌─id─┬─val─┐
│  1 │  10 │
│  1 │  11 │
│  2 │  20 │
│  2 │  21 │
└────┴─────┘
SELECT * FROM limit_by ORDER BY id, val LIMIT 1, 2 BY id
┌─id─┬─val─┐
│  1 │  11 │
│  1 │  12 │
│  2 │  21 │
└────┴─────┘
--SELECT * FROM limit_by ORDER BY id, val LIMIT 2 OFFSET 1 BY id The above results will also be returned

--The following query returns the first five Web addresses of each domain, device_type Yes, up to 100 lines in total( LIMIT n BY + LIMIT). 
SELECT
    domainWithoutWWW(URL) AS domain,
    domainWithoutWWW(REFERRER_URL) AS referrer,
    device_type,
    count() cnt
FROM hits
GROUP BY domain, referrer, device_type
ORDER BY cnt DESC
LIMIT 5 BY domain, device_type
LIMIT 100

ORDER BY clause

The ORDER BY clause contains a list of expressions, and each expression can use the DESC (descending) or ASC (ascending) modifier to determine the sorting direction. If no direction is specified, ASC is used, so it is usually omitted.

The sort direction applies to a single expression, not to the entire list. Example: ORDER BY vistss DESC, searchphase

Rows with the same value in the sort expression list are output in any order, which may be uncertain (different each time).

If the ORDER BY clause is omitted, the order of rows is undefined and the returned result is uncertain.

Special value sorting

NaN and NULL sort order have two methods:

  • NULLS LAST modifier: default. First value, then NaN, then NULL.
  • NULLS FIRST modifier: first NULL, then NaN, then other values.

give an example

┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 2 │    2 │
│ 1 │  nan │
│ 2 │    2 │
│ 3 │    4 │
│ 5 │    6 │
│ 6 │  nan │
│ 7 │ ᴺᵁᴸᴸ │
│ 6 │    7 │
│ 8 │    9 │
└───┴──────┘

SELECT * FROM t_null_nan ORDER BY y NULLS FIRST
┌─x─┬────y─┐
│ 1 │ ᴺᵁᴸᴸ │
│ 7 │ ᴺᵁᴸᴸ │
│ 1 │  nan │
│ 6 │  nan │
│ 2 │    2 │
│ 2 │    2 │
│ 3 │    4 │
│ 5 │    6 │
│ 6 │    7 │
│ 8 │    9 │
└───┴──────┘

When sorting floating-point numbers, NaN is separated from other values. NaN comes last regardless of the sort order.  

In other words, for ascending sort, they are larger than all other numbers, and for descending sort, they are smaller than the rest.

Case support

For sorting by string value, you can specify the collation (comparison). Example: order by searchphase collate 'tr' - sort keywords in ascending order using Turkish letters, case insensitive, assuming that the string is UTF-8 encoded.

COLLATE can be specified or not specified for each expression in ORDER BY. If ASC or DESC is specified, specify COLLATE after it. When COLLATE is used, sorting is always case insensitive.

We only recommend using COLLATE for the final sorting of a small number of rows, because sorting with COLLATE is not as efficient as regular sorting by bytes.

Implementation details

If a sufficiently small LIMIT is specified in addition to ORDER BY, less RAM is used. Otherwise, the amount of memory spent is proportional to the amount of data to be sorted.

For distributed query processing, if GROUP BY is omitted, the sorting is partially completed on the remote server, and then the results are merged on the requester server. This means that for distributed sorting, the amount of data to be sorted may be greater than the amount of memory on a single server.

If there is not enough RAM, sorting can be performed in external storage (creating temporary files on disk). To do this, set max_ bytes_ before_ external_ If sort is disabled, it is set to 0 by default.

If enabled, when the amount of data to be sorted reaches the specified number of bytes, the collected data will be sorted and dumped into a temporary file. After reading all the data, all sorted files will be merged and the results will be output. The temporary file path is in tmp_path is set.

Running a query may be faster than max_bytes_before_external_sort takes up more memory. Therefore, the value of this setting must be significantly less than max_memory_usage. For example, if your server has 128 GB of RAM and you need to run a query, set max_ memory_ Set usage to 100 GB and set max_bytes_before_external_sort is set to 80 GB.

External sorting is far less efficient than RAM sorting.

Optimize data reading

If the prefix of the ORDER BY expression is consistent with the table sort key, you can use optimize_read_in_order setting to optimize the query.

Enable optimize_ read_ in_ After order is set, the Clickhouse server will use the table index and read the data in the order of the ORDER BY key. This avoids reading all data when LIMIT is specified. Therefore, queries on big data with small restrictions will be processed faster.

Optimizations can be used for both ASC and DESC, not with the GROUP BY clause and FINAL modifier.

Disable optimize_ read_ in_ After order is set, the Clickhouse server will not use the table index when processing the SELECT query.

When running a query with an ORDER BY clause, a large LIMIT and WHERE condition (requiring a large number of records to be read before finding the data of the query), manually disable optimize_read_in_order.

The following table engine supports Optimization:

  • MergeTree engine
  • Merge, Buffer and MaterializedView engine tables created through the MergeTree engine table

In the materializedview engine table, optimization can be compared with select FROM merge_ tree_ Used with views such as table ORDER BY PK. But if select If a query such as from view ORDER BY PK does not contain an ORDER BY clause, the query does not support this query.

ORDER BY Expr WITH FILL modifier

This modifier can also be used in conjunction with the LIMIT... WITH TIES modifier.

The WITH FILL modifier can be set after ORDER BY expr using the optional parameters FROM expr, TO expr and STEP expr.

All missing values for the expr column will be populated sequentially, while other columns will be populated with default values.

Fill multiple columns with the following syntax, and then add the WITH FILL modifier and optional parameters after each field name in the ORDER BY section.

ORDER BY expr [WITH FILL] [FROM const_expr] [TO const_expr] [STEP const_numeric_expr], ... exprN [WITH FILL] [FROM expr] [TO expr] [STEP numeric_expr]

 

WITH FILL applies only to numeric (all types of floating point, decimal, integer) or date / datetime type fields.

  • If from const is not defined_ Expr fills the sequence, the minimum expr field value in ORDER BY is used
  • If to const is not defined_ Expr fills the sequence, the maximum expr field value in ORDER BY is used
  • If step const is defined_ numeric_ expr,const_numeric_expr interprets the number type as is, the date type as day, and the date time type as second.
  • When defining STEP const_numeric_expr, the number type is interpreted as is, the date type is interpreted as day, and the date time type is interpreted as second.
  • When omitting STEP const_numeric_expr, use 1.0 for the number type, 1 day for the date type, and 1 second for the date time type.

give an example

SELECT n, source FROM (
   SELECT toFloat32(number % 10) AS n, 'original' AS source
   FROM numbers(10) WHERE number % 3 = 1
) ORDER BY n
┌─n─┬─source───┐
│ 1 │ original │
│ 4 │ original │
│ 7 │ original │
└───┴──────────┘

--use WITH FILL
SELECT n, source FROM (
   SELECT toFloat32(number % 10) AS n, 'original' AS source
   FROM numbers(10) WHERE number % 3 = 1
) ORDER BY n WITH FILL FROM 0 TO 5.51 STEP 0.5
┌───n─┬─source───┐
│   0 │          │
│ 0.5 │          │
│   1 │ original │
│ 1.5 │          │
│   2 │          │
│ 2.5 │          │
│   3 │          │
│ 3.5 │          │
│   4 │ original │
│ 4.5 │          │
│   5 │          │
│ 5.5 │          │
│   7 │ original │
└─────┴──────────┘

When there are multiple order by fields:

SELECT
    toDate((number * 10) * 86400) AS d1,
    toDate(number * 86400) AS d2,
    'original' AS source
FROM numbers(10)
WHERE (number % 3) = 1
ORDER BY
    d2 WITH FILL,
    d1 WITH FILL STEP 5;
┌───d1───────┬───d2───────┬─source───┐
│ 1970-01-11 │ 1970-01-02 │ original │
│ 1970-01-01 │ 1970-01-03 │          │
│ 1970-01-01 │ 1970-01-04 │          │
│ 1970-02-10 │ 1970-01-05 │ original │
│ 1970-01-01 │ 1970-01-06 │          │
│ 1970-01-01 │ 1970-01-07 │          │
│ 1970-03-12 │ 1970-01-08 │ original │
└────────────┴────────────┴──────────┘
SELECT
    toDate((number * 10) * 86400) AS d1,
    toDate(number * 86400) AS d2,
    'original' AS source
FROM numbers(10)
WHERE (number % 3) = 1
ORDER BY
    d1 WITH FILL STEP 5,
    d2 WITH FILL;

┌───d1───────┬───d2───────┬─source───┐
│ 1970-01-11 │ 1970-01-02 │ original │
│ 1970-01-16 │ 1970-01-01 │          │
│ 1970-01-21 │ 1970-01-01 │          │
│ 1970-01-26 │ 1970-01-01 │          │
│ 1970-01-31 │ 1970-01-01 │          │
│ 1970-02-05 │ 1970-01-01 │          │
│ 1970-02-10 │ 1970-01-05 │ original │
│ 1970-02-15 │ 1970-01-01 │          │
│ 1970-02-20 │ 1970-01-01 │          │
│ 1970-02-25 │ 1970-01-01 │          │
│ 1970-03-02 │ 1970-01-01 │          │
│ 1970-03-07 │ 1970-01-01 │          │
│ 1970-03-12 │ 1970-01-08 │ original │
└────────────┴────────────┴──────────┘

OFFSET FETCH modifier

OFFSET and FETCH allow you to retrieve data by section. You can specify data row blocks.

OFFSET offset_row_count {ROW | ROWS}] [FETCH {FIRST | NEXT} fetch_row_count {ROW | ROWS} {ONLY | WITH TIES}]

offset_row_count or fetch_ row_ The count value can be a numeric or literal constant. Fetch can be omitted_ row_ count; By default, it is equal to 1.

OFFSET specifies how many rows to skip from the beginning in the number of rows returned by the query, and FETCH specifies the maximum number of rows that can be included in the query result. The ONLY option returns the row immediately after the row omitted by OFFSET.

In this case, FETCH is an alternative to the LIMIT clause.

give an example

--The following two queries return the same results
SELECT * FROM test_fetch ORDER BY a OFFSET 1 ROW FETCH FIRST 3 ROWS ONLY;

SELECT * FROM test_fetch ORDER BY a LIMIT 3 OFFSET 1;

The WITH TIES option is used to return all other rows related to the last bit in the result set according to the ORDER BY clause. For example, if fetch_ row_ If count is set to 5, but the other two rows match the value of the ORDER BY column in the fifth row, the result set will contain seven rows.

According to the standard, if both exist, the OFFSET clause must precede the FETCH clause.

--Consider table
┌─a─┬─b─┐
│ 1 │ 1 │
│ 2 │ 1 │
│ 3 │ 4 │
│ 1 │ 3 │
│ 5 │ 4 │
│ 0 │ 6 │
│ 5 │ 7 │
└───┴───┘

--use ONLY
SELECT * FROM test_fetch ORDER BY a OFFSET 3 ROW FETCH FIRST 3 ROWS ONLY;
┌─a─┬─b─┐
│ 2 │ 1 │
│ 3 │ 4 │
│ 5 │ 4 │
└───┴───┘

--use WITH TIES
SELECT * FROM test_fetch ORDER BY a OFFSET 3 ROW FETCH FIRST 3 ROWS WITH TIES;
┌─a─┬─b─┐
│ 2 │ 1 │
│ 3 │ 4 │
│ 5 │ 4 │
│ 5 │ 7 │
└───┴───┘

Preview clause

WHERE is an optimization that can apply filtering more effectively. It is enabled by default even if the preview clause is not explicitly specified. It automatically moves part of the WHERE condition to the preparatory stage to work.

The purpose of the preview clause is to control optimization only if you think you are doing better than the default

Manually control the front position

This clause has the same meaning as the WHERE clause.  

The difference lies in the way data is read from the table. Manually control preview to use filtering conditions for a few columns, which can provide powerful data filtering function. This reduces the amount of data read.

The query can specify both PREWHERE and WHERE. In this case, PREWHERE works before WHERE.

If optimize_ move_ to_ Setting PREWHERE to 0 disables moving partial expressions from WHERE to PREWHERE.

limit

Only applicable to engine tables of the MergeTree family.

SAMPLE clause

The SAMPLE clause allows approximate SELECT query processing.

When you enable a query (sample only part of the data), you do not perform sampling on all data. For example, if you need to calculate statistics for all accesses, it is sufficient to execute a query on 1 / 10 of all accesses and then multiply the result by 10.

Approximate query processing is used in the following scenarios:

  • If the original data is not accurate, the approximation will not significantly reduce the quality.
  • The objective of the business requirements is the approximate results (for the purpose of cost-effectiveness or marketing the exact results to senior users).
  • If you have strict timing requirements (such as \ < 100ms), but you can't prove whether you need additional hardware resources to meet these requirements.

You can use sampling on tables in the MergeTree family only if a sampling expression is specified during table creation (see the MergeTree engine).

The functions of data sampling are as follows:

  • Data sampling is a deterministic mechanism. Same select The result of the sample query is always the same.
  • For tables with a single sampling key, samples with the same coefficients always select the same subset of data.
  • Samples can be used IN subqueries of the IN clause. IN addition, you can use the JOIN clause to JOIN the sample.
  • Sampling allows less data to be read from disk. However, you must specify the sampling key correctly.  

Usage:

  • SAMPLE K, K is a real number from 0 to 1: SAMPLE 0.1 runs a query on 10% of the data
  • SAMPLE N, N is a positive integer: SAMPLE 100, run the query on at least 100 pieces of data
  • AMPLE k OFFSET m: k and m are numbers from 0 to 1. The query is performed on k score samples of data. The data offset m score used for the sample.

SAMPLE K

--Take 10 from the data%,Because the value of the aggregate function is not automatically corrected, to get an approximate result, set count()Manually multiply by 10.
SELECT
    Title,
    count() * 10 AS PageViews
FROM hits_distributed
SAMPLE 0.1
WHERE
    CounterID = 34
GROUP BY Title
ORDER BY PageViews DESC LIMIT 1000

SAMPLE N

Here, n is an integer large enough. For example, SAMPLE 10000000. Run the query on at least 10000000 rows.

Since the smallest unit of data reading is a particle (its size is set by the index_granularity setting), it makes sense to set a sample much larger than the particle.

When using the SAMPLE n clause, you do not know which relative percentages of data are processed. Therefore, you do not know the coefficient by which the aggregate function should be multiplied. Use_ sample_factor virtual column to get approximate results_ sample_ The factor column contains dynamically calculated relative coefficients. This column is automatically created when a table is created with the specified sampling key_ sample_ An example of the usage of the factor column is shown below.

SELECT sum(PageViews * _sample_factor)
FROM visits
SAMPLE 10000000

SELECT sum(_sample_factor)
FROM visits
SAMPLE 10000000

--The average does not require a factor
SELECT avg(Duration)
FROM visits
SAMPLE 10000000

SAMPLE K OFFSET M

SAMPLE 1/10
[++------------]

SAMPLE 1/10 OFFSET 1/2
[------++------]

UNION clause

By default, UNION has the same behavior as UNION DISTINCT, but you can set union_default_mode to specify the UNION mode. The value can be "ALL", "DISTINCT" or empty string.

However, if you use UNION and use union_default_mode is set to an empty string, which will throw an exception.

UNION ALL

Use UNION ALL for result extension to combine any number of SELECT queries.

SELECT CounterID, 1 AS table, toInt64(count()) AS c
    FROM test.hits
    GROUP BY CounterID

UNION ALL

SELECT CounterID, 2 AS table, sum(Sign) AS c
    FROM test.visits
    GROUP BY CounterID
    HAVING c > 0

The result columns match by their index (order within the SELECT). If the column names do not match, the name of the final result is obtained from the first query.

Performs type conversion on the union. For example, if two queries combine fields that have the same non nullable and nullable types of compatible types, the resulting UNION ALL will have nullable type fields.

Queries belonging to UNION ALL cannot be placed in parentheses. ORDER BY and LIMIT should be applied to individual queries, not the final result. If you need to transform the final result, you can put all queries with UNION ALL in the subquery of the FROM clause.

UNION DISTINCT

The difference between UNION ALL and UNION DISTINCT is that UNION DISTINCT will perform different transformations on the union results, which is equivalent to SELECT DISTINCT in the sub query containing UNION ALL.

Implementation details

Queries belonging to UNION / UNION ALL / UNION DISTINCT can be run simultaneously, and their results can be mixed together.

WHERE clause

The WHERE clause allows you to filter data FROM the FROM clause of SELECT.

If there is a WHERE clause, it must contain an expression of type UInt8. This is usually an expression with comparison and logical operators. Rows whose expression evaluates to 0 are excluded from further conversions or results.

If supported by the underlying table engine, the WHERE expression is evaluated based on the ability to use index and partition pruning.

There is a filtering Optimization called prewhere.

WITH Clause

Clickhouse supports common table expressions (CTE s), which provide the results of using the WITH clause in a SELECT query.

Named subqueries can be included in the current location of the table object and the subquery context.

Recursion can be prevented by hiding the current level of CTE from the WITH expression.

grammar

WITH <expression> AS <identifier>
WITH <identifier> AS <subquery expression>

give an example

--Use constant expressions as variables
WITH '2019-08-01 15:23:00' as ts_upper_bound
SELECT *
FROM hits
WHERE
    EventDate = toDate(ts_upper_bound) AND
    EventTime <= ts_upper_bound;

--from SELECT Select one from the clause column list sum(bytes)Expression result
WITH sum(bytes) as s
SELECT
    formatReadableSize(s),
    table
FROM system.parts
GROUP BY table
ORDER BY s;

--Results using scalar subqueries
/* this example would return TOP 10 of most huge tables */
WITH
    (
        SELECT sum(bytes)
        FROM system.parts
        WHERE active
    ) AS total_disk_usage
SELECT
    (sum(bytes) / total_disk_usage) * 100 AS table_disk_usage,
    table
FROM system.parts
GROUP BY table
ORDER BY table_disk_usage DESC
LIMIT 10;


--Reuse expressions in subqueries
WITH test1 AS (SELECT i + 1, j + 1 FROM test1) 
SELECT * FROM test1;

Tags: Data Warehouse clickhouse

Posted by socratesone on Thu, 05 May 2022 17:18:35 +0300