an unsystematic journey through the source-code of MySQL and MariaDB: columns (2)

this is my second post about this topic: storing data in a column-oriented form. You can find the first post here.

In the first post the way how the data is stored was not changed; it was still a MyISAM-file in a pure form. With the algorithm presented in this post this will change, but only a little bit: I will add one more file in which the data is stored in a column-oriented form.

warning

The goal is to show how to speed-up a SELECT-statements (and if this is possible). The code presented here is not production-ready, it is (as written before) a proof of concept. It does not contain so may hardcoded details of the query as the code of the first shot did but it still is not generally usable. It lacks error-checking, handling of all possible cases, handling of concurrent access (no locking) and so son. And it is not thoroughly tested.

Nevertheless I hope to show what we can expect if we follow this route.

algorithm

Adding to the datafile of a MyISAM-table I want to store the data of one (or more) column(s) in an additional file.

In MySQL/MariaDB the data is organized as a database (= a folder) and in this database as tables, each table is represented as one file. This is valid for the MyISAM-engine, other engines can handle this differently. We will base our test-engine on MyISAM so we will follow this route.
On my PC here the data is organized on the disk in this directory: ~/MariaDB/data/
and in this folder I have a database TestCol with tables TestBig and TestSmall. So one can find the table TestBig in ~/MariaDB/data/TestCol/TestBig.MYD.

For storing the additional data I will create a subdirectory in this folder and give it the name of the table and append '.cols'. In this folder I will create a file with the name of the column appended by '.column' for storing the values of the column in a different form. Let's assume a columnar-storage of the values of the column Hersteller of the table TestMyCols1Big, then the location of the new file will be: ~/MariaDB/data/TestCol/TestMyCols1Big.cols/Herteller.column. This gives us unique path-names for the files.

In such a column-file the data will be stored in entries with this structure:

len of the value = 1 byte
the value in the internal representation = no. of bytes as described before
the file-offset of the record in the MYD-file = 4 bytes

There are some restrictions introduced here: this algo can only handle VARCHARs up to 255 byte in length. And it can only handle datafiles up to 4GB in size, not bigger ones (MyISAM can). For our tests this should be sufficient.

Let me demonstrate this approach by an example (I marked the relevant parts in bold):

MariaDB [TestCol]> select * from TestMyCols1Big;
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
| Id       | PZN    | EVP    | HAP   | ArtikelBez                           | ArtikelText                | Hersteller |
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
| 01000000 | 999999 | 132.36 | 0.00  | UROSAFE BEINBEUTEL 7732C      1  L   | SPRING DE LU AM K1 NO TOP5 | 25677      |
| 01000001 | 535818 | 0.00   | 18.91 | PK FUER TIERE    250 ML              | URETERSCHIENEAC4P66        | 36367      |
| 01000002 | 454652 | 0.00   | 0.00  | GLANDULA SUP SUIS INJ O    100 ST    | STIBIUM ARSENICOSUM  D10   | 10070      |
| 01000003 | 999999 | 7.93   | 11.10 | MEASURIT URIMETER 2K 200ML      1 ST | SPRING STYROPORKISS 30X40  | 26058      |
| 01000004 | 999999 | 0.00   | 23.58 | BORT KNOECHELST PO WE R S      1 ST  | NOVACOM BABY BETTN JUN HB  | 87240      |
| 01000005 | 999999 | 0.00   | 3.04  | MEZEREUM C 6     80 ST               | AZELAINSAEURE              | 36245      |
| 01000006 | 999999 | 0.00   | 0.00  | FUCUS VESICUL D 2    200 ST          | LUMB GURT M EINSA 09PE M   | 36260      |
| 01000007 | 999999 | 12.70  | 6.26  | CARBO ANIMALIS Q18     10 ML         | ZITRONE AETHERISCHES OEL   | 02484      |
| 01000008 | 999999 | 6.91   | 0.00  | GILOFA FEIN AM GR5 MARBEL      1 ST  | ORTEL C3 CERV 2391 G2 B    | 87240      |
| 01000009 | 999999 | 0.00   | 0.00  | SUPRIMA KRANK HEMD44/46 71      1 ST | CYPRIPEDIUM PUBESC D12     | 18017      |
+----------+--------+--------+-------+--------------------------------------+----------------------------+------------+
10 rows in set (0.00 sec)

If we store the values of the column Hersteller in the form as described the additional file will have this content:

august@AMD4:~/MariaDB/data/TestCol/TestMyCols1Big.cols$ od -t x1 -Ax Hersteller.column 
000000 05                         len of the entry following
       32 35 36 37 37             25677
       00 00 00 00                position of corresponding record in MYD-file

       05                         len
       33 36 33 36 37             36367
       64 00 00 00                position

       05                         len
       31 30 30 37 30             10070
       b4 00 00 00                position

       05 
       32 36 30 35 38             26058
       14 01 00 00 

       05 
       38 37 32 34 30             87240
       78 01 00 00 
       and so on

So every entry in this file is 10 bytes long, this depends on how this table is created.

Scanning through the column Hersteller can be done by reading the full file TestMyCols1Big.MYD (the usual way) or by scanning through the file Hersteller.column which is much smaller. If a match is found we take the value of the position and go to the MYD-file to read the full record and give it back to the database-server.

code

So I created a storage-engine MYCOLS1. For some internal reasons it was created as a static engine, not as a loadable plugin. You will find the code here. You may also look at this post for a description of the handling of adding an engine: DUMMY.

As this storage-engine is derived from the MyISAM-engine we inherit a lot of code from this engine but some things must be implemented in an extended way. First of all we need a way to detect a column that should be handled in our way: in the CREATE-statement add a comment with the word columnar_storage in it to the definition of this column (I will show this in an example only some lines later). And naturally we have to modify the create()-function to detect and handle this.
We must also modify the code for writing a row to the datafile: write_row() must be implemented. The same is valid for update_row() and delete_row() (for this a constant is defined to mark an entry in the column-file).

For executing a SELECT-statement we must implement our own functions

rnd_init() for initializing all the structures needed
rnd_next() for scanning through the column-file
rnd_end() for clean-up

For this engine I didn't want to put the condition of the query into the code (as I did for the first test), I wanted a more flexible solution. So I need some code for analyzing the WHERE-part of the SQl-statement. You may look at this post for a description of this: WHERE.
I implemented cond_push() for analyzing the WHERE-part. This code is restricted for handling only one condition, not the full condition. And it is restricted to a single condition or a condition that is part of an AND-condition. In this routine the information gathered from the WHERE-part is stored in a class ColumnarField.

As the WHERE-part is analyzed in cond_push() the real comparison is done in rnd_next() using the information from the class ColumnarField (if this class exists). For a representation of the columnar-data the class ColumnBuffer is used. This class ColumnBuffer handles the data in a column-file.

testing

For a test I created 2 tables in the database Testcol using my own storage-engine:

MariaDB [TestCol]> create table TestMyCols1Big (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5)  comment 'columnar_storage' )
    ->                 engine=MYCOLS1
    -> ;
Query OK, 0 rows affected (0.10 sec)

MariaDB [TestCol]> create table TestMyCols1Small (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez char(40),
    ->                 ArtikelText char(26),
    ->                 Hersteller  char(5)  comment 'columnar_storage' )
    ->                 engine=MYCOLS1
    -> ;
Query OK, 0 rows affected (0.36 sec)

MariaDB [TestCol]>

Please look at the definition of the column Hersteller, the comment tells my storage-engine to store the data of this column in the extra file.

And here you see how the contents of the folder TestCol looks after creating the tables (a bit modified):

august@AMD4:~/MariaDB/data/TestCol$ ls -l
insgesamt 1799756
-rw-rw---- 1 august august        65 Apr 27 10:04 db.opt
.........................
drwxrw---- 2 august august      4096 Mai  7 15:27 TestMyCols1Big.cols
-rw-rw---- 1 august august       698 Mai  7 15:34 TestMyCols1Big.frm
-rw-rw---- 1 august august         0 Mai  7 15:34 TestMyCols1Big.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:34 TestMyCols1Big.MYI
drwxrw---- 2 august august      4096 Mai  7 15:34 TestMyCols1Small.cols
-rw-rw---- 1 august august       698 Mai  7 15:34 TestMyCols1Small.frm
-rw-rw---- 1 august august         0 Mai  7 15:34 TestMyCols1Small.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:34 TestMyCols1Small.MYI
august@AMD4:~/MariaDB/data/TestCol$ 

august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Big.cols
insgesamt 0
-rw-rw---- 1 august august 0 Mai  7 15:34 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Small.cols
insgesamt 0
-rw-rw---- 1 august august 0 Mai  7 15:34 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$

as you can see 2 additional folders are created: TestMyCols1Big.cols and TestMyCols1Small.cols
In these folders an additional file is created: Hersteller.column. In our tests some data will be written into these files.

Let's put the usual data into our tables:

MariaDB [TestCol]> insert into TestMyCols1Big  select * from TestOK.ABDAOK; 
Query OK, 10000000 rows affected (6 min 14.84 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> insert into TestMyCols1Small  select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (11.57 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]>

Because we have some additional work to do these operations are a bit slower than before. And the usual optimizations are not built in into this engine. Accept this, we are looking for other operations.

Again we look at the folder TestCol (=our database):

august@AMD4:~/MariaDB/data/TestCol$ ls -l
insgesamt 2699604
...................
drwxrw---- 2 august august      4096 Mai  7 15:27 TestMyCols1Big.cols
-rw-rw---- 1 august august       698 Mai  7 15:40 TestMyCols1Big.frm
-rw-rw---- 1 august august 895389376 Mai  7 15:48 TestMyCols1Big.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:48 TestMyCols1Big.MYI
drwxrw---- 2 august august      4096 Mai  7 15:34 TestMyCols1Small.cols
-rw-rw---- 1 august august       698 Mai  7 15:34 TestMyCols1Small.frm
-rw-rw---- 1 august august  26049412 Mai  7 15:49 TestMyCols1Small.MYD
-rw-rw---- 1 august august      1024 Mai  7 15:49 TestMyCols1Small.MYI
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Big.cols
insgesamt 97664
-rw-rw---- 1 august august 100000000 Mai  7 15:48 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$ ls -l TestMyCols1Small.cols
insgesamt 2940
-rw-rw---- 1 august august 3010360 Mai  7 15:49 Hersteller.column
august@AMD4:~/MariaDB/data/TestCol$

As you can see the column-files now contain data. As the Big-file contains 10. mio records and our structure is 10 bytes long (1 for the len, 5 for the column, 4 for the offset) the whole file is 100MB, much smaller than the 1 GB of the MYD-file.

A reminder: TestSmall and TestBig are tables of the (pure) MyISAM-type, they are here for comparison and were created in the first post. TestMyCols1Big and TestMyCols1Small are created using the new storage-engine.
So let's do the tests:

MariaDB [TestCol]> select count(*) from TestSmall where hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.43 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (13.25 sec)

MariaDB [TestCol]> set optimizer_switch='engine_condition_pushdown=on';
Query OK, 0 rows affected (0.00 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Small where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.03 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Big where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (1.53 sec)

MariaDB [TestCol]>

Do not forget to set the internal variable (if you are using MariaDB as I did for my tests), otherwise the function cond_push() will not be called by the server.

As you can see it's faster. This is based on the fact that reading the column-file is about 10times faster than reading the MYD-file because of the size of the files.

Let's look at the more complex query:

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (3 min 15.80 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Small A  join TestMyCols1Big B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (3 min 6.12 sec)

MariaDB [TestCol]>

Similar result as before in the first post: a bit faster but the improvement looks like a negligible value - again.

Let's do some tests at the boundaries (the value ABCDEF does not exist in our data):

MariaDB [TestCol]> select count(*) from TestBig where hersteller = 'ABCDEF';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (12.23 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller <> 'ABCDEF'
    -> ;
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (16.39 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Big where hersteller = 'ABCDEF';
+----------+
| count(*) |
+----------+
|        0 |
+----------+
1 row in set (0.68 sec)

MariaDB [TestCol]> select count(*) from TestMyCols1Big where hersteller <> 'ABCDEF';
+----------+
| count(*) |
+----------+
| 10000000 |
+----------+
1 row in set (25.26 sec)

From 12 seconds down to 0.7 seconds: wow!
From 16 seconds up to 25 seconds: pooh.

This can easily be explained: in the first case everything is handled in rnd_next(). As no match is found the corresponding record is not read and the search in the column-file continues until the end of this file (rnd_next() is called only once). In the second case everything is a match, so the corresponding record must be read from the MYD-file and given back to the server, and this read-operation is not optimized in our engine.

Something that I avoided in this tour until now was the use of an index. Let's create some indexes. And as the join is done via the column PZN let's create an index on this column in every table used in our tests:

MariaDB [TestCol]> create index pzn on TestSmall (pzn);
Query OK, 301036 rows affected (1.57 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index pzn on TestBig (pzn);
Query OK, 10000000 rows affected (1 min 19.46 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (19.67 sec)

MariaDB [TestCol]>

And now the same test for our engine:

MariaDB [TestCol]> create index pzn on TestMyCols1Small (pzn);
Query OK, 301036 rows affected (10.96 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index pzn on TestMyCols1Big (pzn);
Query OK, 10000000 rows affected (6 min 33.83 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestMyCols1Small A  join TestMyCols1Big B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (8.89 sec)

Looks good, a speed-up of a factor of approx. 2.

The usual way to improve the handling of a query is to create an index on the column where the filtering is done: Hersteller. OK let's do this:

MariaDB [TestCol]> create index hersteller on TestSmall (hersteller);
Query OK, 301036 rows affected (2.40 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> create index hersteller on TestBig (hersteller);
Query OK, 10000000 rows affected (2 min 15.97 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> select count(*) from TestSmall where hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.01 sec)

MariaDB [TestCol]> select count(*) from TestBig where hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (0.58 sec)

MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (0.06 sec)

MariaDB [TestCol]>

OK, I will stop here, because I can't beat this value.

Thursday 24 April 2014

columns (2)

warning

algorithm

code

testing