Clickhouse Metabase

In 2018, I've written an article about Clickhouse, this piece of content is still pretty popular across the internet, and even was translated a few times. More than two years have passed since, and the pace of Clickhouse development is not slowing down: 800 merged PRs just during last month! This didn't blow your mind? Check out the full changelog, for example for 2020: https://clickhouse.tech/docs/en/whats-new/changelog/2020/ The description of just new features for each year may take an hour to go through.

Clickhouse Metabase
Metabase Plugin
Metabase Driver

ClickHouse is an open source columnar database that promises fast scans that can be used for real-time queries. See how it works, complete with benchmarks against Spark.
Welcome to Metabase's Discussion Forum. October 17, 2015. Embedding a Metabase Dashboard. With 0.39 all my Field Filters are without type.
Metabase provides an official Docker image via Dockerhub that can be used for deployments on any system that is running Docker. If you’re trying to upgrade your Metabase version on Docker, check out these upgrading instructions. Launching Metabase on a new container.

默认情况下，ClickHouse应用以下定义的压缩方法服务器设置，列。您还可以定义在每个单独的列的压缩方法 CREATE TABLE 查询。 CREATE TABLE codecexample ( dt Date CODEC ( ZSTD ), ts DateTime CODEC ( LZ4HC ), floatvalue Float32 CODEC ( NONE ), doublevalue Float64 CODEC ( LZ4HC ( 9 )) value Float32 CODEC. The 3rd-place repository has a similar purpose: 'this is a test', but the list of issues look more sane — like someone is using GitHub issues for automated crash reports. 4th place is a replacement for a spreadsheet. The first meaningful result is Microsoft/vscode with over 50k issues from over 15k authors.

For the sake of honest comparison, ElasticSearch repo has jaw-dropping 1076 PRs merged for the same month, and in terms of features, their pace is very impressive, as well!

We are using Clickhouse for log storage and analytics in ApiRoad.net project (which is an API marketplace where developers sell their APIs, still in active development) and we are happy with the results so far. As an API developer myself, I know how important is the observability and analysis of HTTP request/response cycle to maintain the quality of service and quickly detect bugs, this is especially true for pure API service. (If you are an API author and want to utilize ApiRoad analytics & billing plaftform to sell API subscriptions, drop me a message at contact@apiroad.net with your API description – I will be happy to chat!)

We are also using ELK (ElasticSearch, Logstash, filebeat, Kibana) stack on other projects, for very similar purposes - getting http and mail logs, for later analysis and search via Kibana.

And, of course, we use MySQL. Everywhere!

This post is about the major reasons why we chose Clickhouse and not ElasticSearch (or MySQL) as a storage solution for ApiRoad.net essential data - request logs (Important note: we still use MySQL there, for OLTP purposes).

1. SQL support, JSON and Arrays as first class citizens.

SQL is a perfect language for analytics. I love SQL query language and SQL schema is a perfect example of boring tech that I recommend to use as a source of truth for all the data in 99% of projects: if the project code is not perfect, you can improve it relatively easily if your database state is strongly structured. If your database state is a huge JSON blob (NoSQL) and no-one can fully grasp the structure of this data, this refactoring usually gets much more problematic.

I saw this happening, especially in older projects with MongoDB, where every new analytics report and every new refactoring involving data migration is a big pain. Starting such projects is fun – as you don't need to spend your time carefully designing the complete project schema, just 'see how it goes' – but maintaining them is not fun!

But, it is important to note that this rule of thumb - 'use strict schema' - is not that critical for log storage use cases. That's why ElasticSearch is so successful, it has many strong sides, and flexible schema.

Back to JSON: traditional RDBMS are still catching up with NoSQL DBMS in terms of JSON querying and syntax, and we should admit JSON is a very convenient format for dynamic structures (like log storage).

Clickhouse is a modern engine that was designed and built when JSON was already a thing (unlike MySQL and Postgres), and Clickhouse does not have to carry the luggage of backward compatibility and strict SQL standards of these super-popular RDBMS, so Clickhouse team can move fast in terms of features and improvements, and they indeed move fast. Developers of Clickhouse had more opportunities to hit a sweet balance between strict relative schemas and JSON flexibility, and I think they did a good job here. Clickhouse tries to compete with Google Big Query and other big players in the analytics field, so it got many improvements over 'standard' SQL, which makes its syntax a killer combo and in a lot of cases much better than you get in traditional RDBMS, for analytics and various calculation purposes.

Some basic examples:

In MySQL, you can extract json fields, but complex JSON processing, like joining relational data on JSON data, became available only recently, from version 8 with JSON_TABLE function. In PosgreSQL, the situation is even worse - no direct JSON_TABLE alternative until PostgreSQL 12!

Compare it to Clickhouse JSON and related arrays feature set - it is just miles ahead. Links:

These are useful in a lot of cases where you would use generate_series() in PostgreSQL. A concrete example from ApiRoad: we need to map requests amount on chart.js timeline. If you do regular SELECT .. group by day, you will get gaps if some days did not have any queries. And we don't need gaps, we need zeros there, right? This is exactly where generate_series() function is useful in PostgreSQL. In MySQL, the recommendation is to create stub table with calendar and join on it... not too elegant, huh?

Here is how to do it in ElasticSearch: https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket-datehistogram-aggregation.html#_missing_value_2

Regarding the query language: I am still not comfortable with verbosity and approach of ElasticSearch Lucene syntax, HTTP API, and all these json structures that you need to write to retrieve some data. SQL is my preferred choice.

Here is the Clickhouse solution for dates gap filling:

Here, we generate virtual table via lambda function and loop, and then left join it on results from logs table grouped by day.

I think arrayJoin + arrayMap + range functions allow more flexibility than generate_series() from Postgres or ElasticSearch approach. There is also WITH FILL keyword available for a more concise syntax.

2. Flexible schema - but strict when you need it

For log storage tasks, the exact data schema often evolves during project lifetime, and ElasticSearch allows you to put huge JSON blob into index and later figure out the field types and indexing part. Clickhouse allows to use the same approach. You can put data to JSON field and filter it relatively quickly, though it won't be quick on terabyte scale. Then, when you see you often need fast query execution on specific data field, you add materialized columns to your logs table, and these columns extract values from existing JSON on-the-fly. This allows much faster queries on terabytes of data.

I recommend this video from Altinity on the topic of JSON vs Tabular schema for log data storage:

3. Storage and Query Efficiency

Clickhouse is very fast in SELECTs, this was discussed in the previous article.

What is interesting, there is a piece of evidence that Clickhouse can be 5-6 times more efficient in storage, comparing to ElasticSearch, while also being literally an order of magnitude faster in terms of queries.Another one (in Russian)

There are no direct benchmarks, at least I could not find any, I believe because Clickhouse and ElasticSearch are very different in terms of query syntax, cache implementations, and their overall nature.

If we talk about MySQL, any imperfect query, missing index, on a table with mere 100 million rows of log data can make your server crawl and swap, MySQL is not really suited for large-scale log queries. But, in terms of storage, compressed InnoDB tables are surprisingly not that bad. Of course, it's much worse in terms of compression comparing to Clickhouse (sorry, no URLs to benchmarks to support the claim this time), due to its row-based nature, but it still often manages to reduce cost significantly without a big performance hit. We use compressed InnoDB tables for some cases for small-scale log purposes.

4. Statistics functions

Getting median and .99 percentile latency of 404 queries is easy in Clickhouse:

Notice usage of quantileTiming function and how currying is elegantly used here. Clickhouse has generic quantile function! But quantileTiming is optimized for working with sequences which describe distributions like loading web pages times or backend response times.

There are more than that. Want weighted arithmetic mean? Want to calculate linear regression? this is easy, just use specialized function.

Here is a full list of statistics functions of Clickhouse:

Most of these are problematic to get in MySQL.

ElasticSearch is much better in this than MySQL, it has both quantiles and weighted medians, but it still does not have linear regression.

5. MySQL and Clickhouse tight integration

MySQL and Clickhouse has integrations on multiple levels, which make it easy to use them together with minimum of data duplication:

MySQL database engine - similar as previous one but dynamic, without binlog
MySQL table function to connect to MySQL table in specific SELECT query
MySQL table engine to describe specific table statically in CREATE TABLE statement

I can't say for sure how fast and stable dynamic database engines and table engines work on JOINs, this definitely requires benchmarks, but the concept is very appealing - you have full up-to-date clone of your MySQL tables on your Clickhouse database, and you don't have to deal with cache invalidation and reindexing.

Regarding using MySQL with Elasticsearch, my limited experience says that these two techonologies are just too different and my impression is that they are speaking foreign languages, and do no play 'together', so what I usually did is just JSONify all my data that I needed to index in ElasticSearch, and send it to ElasticSearch. Then, after some migration or any other UPDATE/REPLACE happen on MySQL data, I try to figure out the re-indexing part on Elasticseach side. Here is an article of the Logstash powered approach to sync MySQL and ElasticSearch. I should say I don't really enjoy Logstash for it's mediocre performance, and RAM requirements, and since it is another moving part which can break. This syncing and re-indexing task is often a significant stop factor for us to use Elasticsearch in simple projects with MySQL.

6. New Features

Want to attach S3 stored CSV and treat it as table in Clickhouse? Easy.

Want to update or delete log rows to be compilant with GDPR? Now, this is easy!

There was no clean way to delete or update data in Clickhouse in 2018 when my first article was written, and it was a real downside. Now, it's not an issue anymore. Clickhouse leverages custom SQL syntax to delete rows:

This is implemented like this to be explicit that deleting is still a pretty expensive operation for Clickhouse (and other columnar databases) and you should not do it every second on production.

7. Cons

There are cons for Clickhouse, comparing to ElasticSearch. First of all, if you build internal analytics for log storage, you do want to get the best GUI tool out there. And Kibana is good nowadays for this purpose when you compare it to Grafana (at least, this point of view is very popular on the Internet, Grafana UI is not that slick sometimes). And you have to stick to Grafana or Redash if you use Clickhouse. (Metabase, which we adore, also got Clickhouse support!)

But, in our case, in ApiRoad.net project, we are building customer-facing analytics, so we have to build analytics GUI from scratch, anyways (we are using a wonderful stack of Laravel, Inertia.js, Vue.js, and Charts.js to implement the customer portal, by the way).

Another issue, related to the ecosystem: the selection of tools to consume, process data and send them to Clickhouse is somewhat limited. For Elasticsearch, there are Logstash and filebeat, tools native to Elastic ecosystem, and designed to work fine together. Luckily, Logstash can also be used to put data to Clickhouse, this mitigates the issue. In ApiRoad, we are using our own custom-built Node.js log shipper which aggregates logs and then sends them to Clickhouse in a batch (because Clickhouse likes big batches and does not like small INSERTs).

What I don't like in Clickhouse is also weird naming of some functions, which are there because Clickhouse was created for Yandex.Metrika (Google Analytics competitor), e.g. visitParamHas() is a function to check if a key exists in JSON. Generic purpose, bad non-generic name. I should mention that there is a bunch of fresh JSON functions with good names: e.g. JSONHas(), with one interesting detail: they are using different JSON parsing engine, more standards-compliant but a bit slower, as far as I understand.

Conclusion

ElasticSearch is a very powerful solution, but I think its strongest side is still huge setups with 10+ nodes, used for large-scale full-text search and facets, complex indexing, and score calculation – this is where ElasticSearch shines. When we talk about time-series and log storage, my feeling is there are better solutions, and Clickhouse is one of them. ElasticSearch API is enormous, and in a lot of cases it's hard to remember how to do one exact thing without copypasting the exact HTTP request from the documentation, it just feels 'enterprisy' and 'Java-flavored'. Both Clickhouse and ElasticSearch are memory hungry apps, but RAM requirements for minimal Clickhouse production installation is 4GB, and for ElasticSearch it is around 16GB. I also think Elastic team focus is getting pretty wide and blurred with all the new amazing machine-learning features they deploy, my humble opinion is that, while these features sound very modern and trendy, this enormous feature set is just impossible to support and improve, no matter how many devs and money you have, so ElasticSearch more and more gets into 'Jack of all trades, master of none' category for me. Maybe I am wrong.

Clickhouse Metabase

Clickhouse just feels different. Setup is easy. SQL is easy. Console client is wonderful. Everything just feels so light and makes sense, even for smaller setups, but rich features, replicas, and shards for terabytes of data are there when you need it.

Good external links with further info on Clickhouse:

UPD: this post hit top#1 on HackerNews, useful comments there, as usual!

Best comments:

ClickHouse is incredible. It has also replaced a large, expensive and slow Elasticsearch cluster at Contentsquare. We are actually starting an internal team to improve it and upstream patches, email me if interested!

I'm happy that more people are 'discovering' ClickHouse. ClickHouse is an outstanding product, with great capabilities that serve a wide array of big data use cases. It's simple to deploy, simple to operate, simple to ingest large amounts of data, simple to scale, and simple to query. We've been using ClickHouse to handle 100's of TB of data for workloads that require ranking on multi-dimensional timeseries aggregations, and we can resolve most complex queries in less than 500ms under load.

Also from HN:

ClickHouse is known as a data analytics processing engine. ClickHouse is one of the open-source column-oriented database-oriented management systems capable of real-time generation of analytical data reports using SQL queries.

Clickhouse came a long way since it inception 3 years ago.

Why Mydbops recommends ClickHouse for Analytics ?

ClickHouse is a Columnar Store built for SORT / SEARCH queries performance on a very large volume of database.

In Columnar Database Systems the values from different columns are stored separately, and data from the same column is stored together – Benefits Performance of Analytical Queries (ORDER / GROUP BY & Aggregation SQLs).

Columnar Stores are best suited for analytics because of their ability to retrieve just those columns instead of reading all of the rows and filter out unneeded data makes the data accessed faster.

Easy integration with MySQL and other DB engines. ( MySQL and Clickhouse data migration )

Need for Backup and Restore:

As a DBA responsibility, we have to backup the data regularly for security reasons.

If the database crashes or some fatal errors happen, backup is the only way to restore the data and to reduce the loss to the minimum.

There are multiple ways of taking backup. but they all have their own shortcomings. We will be discussing about the below two methods and how to perform the backup and restoration with the below two methods.

Clickhouse Client
Clickhouse backup tool

Method 1 ( Using ClickHouse Client ):

ClickHouse Client is a simple way to backup the data and restore it in ClickHouse without any additional tooling. We are going to make the backup of metadata and data separately here

Metabase Plugin

Metadata Backup:

In example, I am taking the dump of the structure of the table “test_table” from the database “testing” with the TAbSeparatedRaw format. This format is only appropriate for outputting a query result, but not for parsing (retrieving data to insert in a table). ( i.e ) Rows are written without escaping.

MetadataRestore:

I have created the database named “testing1” and trying to restore metadata backup taken earlier.

Restoring backup :

Metadata Validation :

Here is the same comparison of the table Structure from dump file and restored data :

Data Backup :

Before taking the dump of the data, Let us validate the count of records that are going to backup.We can validate the records by making a count.

Here I’m taking the dump of the table “test_table” with TabSeparated format (tsv). In a tab-separated (tsv) format, data is written by row. Each row contains the values separated by tabs. Values are written in text format, without enclosing quotation marks, and with special characters escaped.

Data Restore :

We need to ensure the database and the table (metadata) is created. The table format should be the same as the source table format. The meta data is restored and data dump is restored.

Once the data dump is restored, I have cross checked the count of the data which is restored in the database from the dump file

We can make an automated program to make the metadata and data backup of each table. And the recovery also has to be formulated too.

Method 2 (Clickhouse-backup):

Clickhouse-backup tool for easy backup and restore with S3 (AWS) and GCS support. It is an open source tool which is available on git.

Features

Supports Full and incremental backups.
Supports AWS, GCS, and Alibaba cloud object stores.
Ease of configuration with environment variables.
Support backup administrative tasks like list, delete, and download.

Run the clickhouse backup tool from root user or clickhouse user

GLOBAL OPTIONS:

Default Config Path :

Default config path is defined in the location /var/lib/clickhouse/backup/

ImportantNote :

We shouldn’t change the file permission for the default path /var/lib/clickhouse/backup/. As this path contains the hard links. If we change the permission or ownership of default path on hard link, this will be changed the clickhouse too. This will leads to data corruption.

Config File :

All options can be overwritten via environment variables

Backup the data from tool:

From the backup tool, i have used the option “create” to create a new backup.

By default, while creating the backup from this backup tool, It will create the folder metadata and shadow under the backup directory.

In metadata directory, the metadata file will be present. ( i.e ) it contains the table structure.

In the shadow directory, the data files will be present.

The default dump file is stored in the path -> /var/lib/clickhouse/backup/.

[root@mydbopslabs202 testing]# cat test_table.sql

List the dump file :

We can check the list of backups using the option “list” from the backup tool. It’s shows the dump file with the created date time.

Restoring the dump file using clickhouse-backup tool :

” Restore ” is an option to restore the data from the dump file in the clickHouse server.

While restoring from clickHouse backup tool, first it will restore the metadata ( Structure of the table ) from the dump file which is in the metadata directory. Once the metadata is restored in the table, it will prepare the data by restoring the data files present in the shadow directory. Finally, it will do an ALTER TABLE…ATTACH PART. Simply it will add the data to the table from the detached directory.

Validating the logs from the restored backup tool :

There is best pros in the backup tool in which it differentiate the metadata structure and data files in the separate folder such as metadata dir and shadow directory under the backup directory. As mentioned earlier, The data structure will be available in meta data directory and data files will be available in the shadow directory under the mentioned backup directory.

Metabase Driver

There are some cons in the backup tool as the backup size of remote storage is maximum upto 5TB. This backup tool support only MergeTree family table engine

These are the simple possible ways to backup and restore the data from clickHouse server, We can choose the backup type based on our requirement. Depending on the size of the data, we need to choose the backup type based on our environment. ClickHouse-Copier is another way to take the backup. In the upcoming day, we will discuss the more about Clickhosue further.