an unsystematic journey through the source-code of MySQL and MariaDB: columns

In a database the data is usually stored in row-oriented form. But why? Can't we store it column-oriented?

Wikipedia has an article about that topic, you can find it here.

row oriented

Let me give you an example: think of the data organized in the form of a spreadsheet:

In the case of a MyISAM-file (a file with the extension MYD) of type static this data would be stored like this:

 1074892 | 809032 | 11.31   | 5.04    | KOCHSALZ 0.9% ISOTON GL 250 ML       | KOCHSALZ 0.9% ISOTON GL    | 08650      | 1286800 | 252888 | 516.92  | 310.00  | FREKA BUTTON EINZELS 1.7CM 1X1 ST    | FREKA BUTTON EINZELS 1.7CM | 08650      | 1007388 | 44173  | 12.04   | 7.10    | EASY BAG SYSTEM SCHWERKRAF 1X1 ST    | EASY BAG SYSTEM SCHWERKRAF | 08650      |
and so on

Take this picture as a symbol, data is stored in an internal format which usually is not human-readable. And different engines have different forms of storing the data but in general the usual form looks like the data above (I've added the character "|" as a delimiter, it's not stored in the datafile).

On a hard-disk data is stored in the form of a stream of bytes, an organized form of storing the data must be added on top of this stream. So storing the data row by row is the easiest way. In the notation of the spreadsheet above the data is stored in the following order of the cells:

B3  C3  D3  E3  F3  G3  H3  B4  C4  D4  E4  F4  G4  H4  B5  C5  and so on

I hope this picture shows this:

Starting with row 1 (at cell B3) simply follow the direction of the arrow in line 3 of the sheet. At the end of row 1 (=at cell H3) continue with row 2 (in line 4 at cell B4 of the sheet) and so on.

column-oriented

How does it look when the data ist stored column-wise? Look at this picture:

Simply start with column B of the sheet (at cell B3 again) and follow the arrow. At the end of the data in column B (=at cell B7) switch over to column C, start with cell C3 and follow the arrow in this column down to cell C7, then switch to column D and so on. So the data would be stored in the datafile like this:

1074892 |  1286800 | 1007388 | 1074700 | 1074856 | 809032 | 252888 | 44173  | 41447  | 541598 | 11.31   | 516.92  | 12.04   | 79.76   | 104.86  | 5.04    | 310.00  | 7.10    | 43.64   | 
and so on.

In the cell-notation of the spreadsheet the data will be stored in this form:

B3  B4  B5  B6  B7  C3  C4  C5  C6  C7  D3 D4  D5  D6  D7  E3  E4  and so on

These pictures and the descriptions should show the different approaches in storing the data in a datafile. But this is simplified.

advantage

When the data is stored in column-oriented fashion the values in one column are stored close to each other. By filtering the data we do not need to read the full record to decide if this record fulfills the search-criteria. Reading less data (=reading only one column) should be much faster.

disadvantage

Immediately you can see one problem: when a record is added to this table it can simply be written at the end of the datafile in the first case (row-oriented) but in the second case (columns-oriented) this looks like a problem:

B3  B4  B5  B6  B7  B8  C3  C4  C5  C6  C7  C8  D3 D4  D5  D6  D7  D8  E3  E4  and so on

And it is a problem: look at the cell-names marked in bold which describe the new values. For these values we have to make room in the datafile to store the first column of this row (and we have to move the bytes again for storing the value of column 2 of this record and again for column 3 and so on).

So we easily found a disadvantage but are there any advantages for storing the data in columnar form? And the answer is: it depends.
Let's look at some examples.

code

When we change the orientation of the data into a column-oriented format we do need to change a lot of things in the code of the database-server. So let's not start with the full picture but instead start wit a trivial approach and continue improving this. I want to start with the existing format (=row-oriented) and simply modify the access by filtering the records on the level of the storage-engine. So I will stay with the row-oriented format for the first shot.

Sequentially reading a table is done by the member-function rnd_next() of any storage-engine. So I will modify this routine a bit.

The code shown here is for demonstration purpose only. It's a proof of concept, do not add it to any production version of MySQL/MariaDB.
It is based on the MINTEMPL-engine which is presented here. You will find the code of this engine here. The handling of adding an engine to MySQL/MaraDB is described here.

You have to modify the code of the engine MINTEMPL:
In the file ha_mintempl.h please add this line:

 int rnd_next(uchar *buf);

In the file ha_mintempl.cc please add these lines:

int ha_mintempl::rnd_next(uchar *buf)
{
  DBUG_ENTER("ha_mintempl::rnd_next");
  int retCode;

  if ( 0 == strncmp(table_share->table_name.str, "TestColsBig",11) )
  {
      // this is table TestColsBig
      Field **ptr2Col=table->field;           // get the pointers to the data of the record
      ptr2Col += 6;                           // now it points to the column Hersteller
      void *ptrData = (*ptr2Col)->ptr;
      while ( true )
      {
          retCode = ha_myisam::rnd_next(buf); // fetch the original record
          if ( retCode != 0 )                 // in case of EOF
              DBUG_RETURN( retCode );
          // check for: WHERE hersteller = '36440'
          if ( 0 == strncmp( (const char*)ptrData, "36440", 5) )
              DBUG_RETURN( 0 );               // found one record -> return it
          // not found: continue reading
      }
  } else if ( 0 == strncmp( table_share->table_name.str, "TestColsSmall",11) )
  {
      // this is the table TestColsSmall
      Field **ptr2Col=table->field;           // get the pointers to the data of the record
      ptr2Col += 6;                           // now it points to the column Hersteller
      void *ptrData = (*ptr2Col)->ptr;
      while ( true )
      {
       retCode = ha_myisam::rnd_next(buf); // fetch the original record
       if ( retCode != 0 )                 // in case of EOF
        DBUG_RETURN( retCode );
       // WHERE hersteller = '08650'
       if ( 0 == strncmp(  (const char*)ptrData, "08650", 5) )
        DBUG_RETURN( 0 );               // found one record -> return it
       // not found: continue reading
      }
  }
  // any other table: read the record and return
  DBUG_RETURN( ha_myisam::rnd_next(buf) );
}

Now compile everything and we have another storage-engine which we could install and will use in our tests.

filtering

Our tests will be done on tables without any indexes, so the tables contain only data. Indexes can speed-up the operation dramatically but I didn't use them here; I wanted to demonstrate the results of different approaches of the execution of a table-scan.

When executing a table-scan the db-server-software calls rnd_next() to fetch one row of a table, so this function is called once for every row in the table. The approach in the existing code of the MyISAM-engine is to read a record and give it back to the caller where the filtering will be done. The idea here is to not act like this but instead filter the records in the storage-engine and give only those rows back to the caller (= higher level of the db-server) that match the condition (=the WHERE-clause). As I will show you soon I have a table which contains approx. 10 mio. records but only 3% of these records match the given condition. In another table only 0.5% of the 300K records match the condition. So wouldn't it be helpful to return only those records back the the db-server? It would be a much smaller amount of data that the server-software has to handle.

So let's take this little code and see what happens.

preparations

As MINTEMPL is dervied from MyISAM (precise: it is MYISAM) I created a database TestCol and in this database I created 2 tables of type MYISAM for comparison:

MariaDB [(none)]> create database TestCol;
Query OK, 1 row affected (0.00 sec)

MariaDB [(none)]> use TestCol;
Database changed
MariaDB [TestCol]> create table TestBig (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> create table TestSmall (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MYISAM
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> insert into TestSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (1.48 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> insert into TestBig select * from TestOK.ABDAOK;
Query OK, 10000000 rows affected (49.73 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]>

Then I created 2 tables of type MINTEMPL to do my tests:

MariaDB [TestCol]> create table TestColsSmall (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MINTEMPL
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> create table TestColsBig (                                                                                                                                                                             
    ->                 Id   char(8),
    ->                 PZN  char(7),
    ->                 EVP  char(8),
    ->                 HAP  char(8),
    ->                 ArtikelBez varchar(40),
    ->                 ArtikelText varchar(26),
    ->                 Hersteller  char(5) )
    ->                 engine=MINTEMPL
    -> ;
Query OK, 0 rows affected (0.04 sec)

MariaDB [TestCol]> insert into TestColsSmall select * from TestOK.ABDAPZN;
Query OK, 301036 rows affected (1.53 sec)
Records: 301036  Duplicates: 0  Warnings: 0

MariaDB [TestCol]> insert into TestColsBig select * from TestOK.ABDAOK; 
Query OK, 10000000 rows affected (49.40 sec)
Records: 10000000  Duplicates: 0  Warnings: 0

MariaDB [TestCol]>

OK, now we have 2 tables of type MyISAM and 2 tables of type MINTEMPL, and for each storage engine we have a small table (with 300K records) and a big table (with 10 mio. records). In the tables the data is identical meaning that the data in the smaller tables contain the same data and so for the bigger tables, so we can do some tests on each and compare the results.

testing

Let's do a quick test:

MariaDB [TestCol]> select count(*) from TestSmall where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.40 sec)

MariaDB [TestCol]> select count(*) from TestColsSmall where Hersteller = '08650';
+----------+
| count(*) |
+----------+
|     1334 |
+----------+
1 row in set (0.24 sec)

MariaDB [TestCol]> select count(*) from TestBig where Hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (12.32 sec)

MariaDB [TestCol]> select count(*) from TestColsBig where Hersteller = '36440';
+----------+
| count(*) |
+----------+
|   315857 |
+----------+
1 row in set (7.41 sec)

As you can see I did a simple SELECT on each table: on the table TestSmall (=standard MyISAM, contains 300K records) and TestBig (=standard MyISAM, contains 10 mio. records) and also on the tables TestColsSmall (=of type MINTEMPL, approx. 300K records, our modified code handles the access) and TestColsBig (=of type MINTEMPL, 10 mio. records, our modified code handles the access). Compare these values and it looks like a speed-up of about 35 to 40%.

As this is such a simple operation let's do something a bit more complex:

MariaDB [TestCol]> MariaDB [TestCol]> select count(*) from TestSmall A  join TestBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (4 min 7.80 sec)

MariaDB [TestCol]> select count(*) from TestColsSmall A  join TestColsBig B on (A.PZN = B.PZN) where A.Hersteller = '08650' and B.Hersteller = '36440';
+----------+
| count(*) |
+----------+
|      112 |
+----------+
1 row in set (4 min 2.77 sec)

MariaDB [TestCol]>

In this case I joined the 2 tables of type MyISAM together and got the first result. For comparison I joined the 2 tables of type MINTEMPL together. You can compare and review the time-values given.

conclusions

If you compare the values of SELECTing from a standard-table to the values of SELECTing from a MINTEMPL-table then you will see a noticeable speed-up in the case of a simple SELECT, but this type of operation is not of interest. In the case of JOINing 2 tables the improvement looks like a negligible value.

So I would say: forget this approach (and the code). We have to think about another algorithm.

a remark

If you look at the code I presented some lines above you will see that this code is unusable in any environment except for this test. It contains a lot of hardcoded constants like the table-names (TestColsBig, TestColsSmall), the position of the field Hersteller (no. 6) and the values to check for ("36440", "08650").

The intention for this is to have a quick check if this approach does make sense.

For your own tests you must modify these entries so that they fit to your data.

Thursday 24 April 2014

columns