Friday, January 14, 2011

About InnoDB Index Size Limitations

This is mostly a reflection on a limitation in InnoDB that, in my opinion, has persisted for too long. I founded while reviewing the Amarok media player. The player uses MySQL in the backend, embedded or regular server, so it makes for a great source of real life data.

The Issue

By default, Amarok uses MyISAM tables. This means that if it crashes or stops unexpectedly (a logout while playing music may cause this), the latest updates to the DB are all lost. So I've been looking into using InnoDB instead to avoid loosing my playlists or player statistics.

The Problem

The limitation that bothers me is this one: "Index key prefixes can be up to 767 bytes" which has been in place for several years.
Take this Amarok table for example:
CREATE TABLE urls (
    id int(11) NOT NULL AUTO_INCREMENT,
    deviceid int(11) DEFAULT NULL,
    rpath varchar(324) COLLATE utf8_bin NOT NULL,
    directory int(11) DEFAULT NULL,
    uniqueid varchar(128) COLLATE utf8_bin DEFAULT NULL,

    PRIMARY KEY (id),
    UNIQUE KEY uniqueid (uniqueid),
    UNIQUE KEY urls_id_rpath (deviceid, rpath),
    KEY urls_uniqueid (uniqueid)
) ENGINE=MyISAM AUTO_INCREMENT=314
DEFAULT CHARSET=utf8 COLLATE=utf8_bin
The result of an ALTER TABLE to convert it to InnoDB:
alter table urls engine=InnoDB;
ERROR 1071 (42000): Specified key was too long; max key length is 767 bytes

The Rant

Note that the maximum key length is in bytes, not characters. So lets review the rpath column. This column stores the path to the media file in the catalog, which could easily be something like: /very/long/path/to/find/a/file/with/music.mp3. If it only uses English alphabet characters it's not very long, but as soon as you start using some multi-byte characters (ie: ñ, ç, ü, etc) the length of the string starts to increase in bytes (ie: ó = 4 bytes). A simple query shows the diference:

select id, rpath, bit_length(rpath) / 8 as bytes, 
char_length(rpath) as chars 
from urls limit 1;
+----+-----------------------------------------+---------+-------+
| id | rpath                                   | bytes   | chars |
+----+-----------------------------------------+---------+-------+
|  1 | ./home/gnarvaja/Music/Dodododódodo.mp3 | 41.0000 |    39 |
+----+-----------------------------------------+---------+-------+

So how big can the index be in bytes?

I let MySQL answer this question for me. I created a similar test table with only a PRIMARY and then recreated the index on rpath. Here's the command sequence and it's output:
CREATE TABLE urls_test (
  id int(11) NOT NULL AUTO_INCREMENT,
  rpath varchar(324) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`)
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin;

CREATE INDEX rpath_idx ON urls_test (rpath);
Query OK, 0 rows affected, 2 warnings (0.32 sec)

show warnings;
+---------+------+---------------------------------------------------------+
| Level   | Code | Message                                                 |
+---------+------+---------------------------------------------------------+
| Warning | 1071 | Specified key was too long; max key length is 767 bytes |
| Warning | 1071 | Specified key was too long; max key length is 767 bytes |
+---------+------+---------------------------------------------------------+
2 rows in set (0.00 sec)

SHOW CREATE TABLE urls_test\G
*************************** 1. row ***************************
       Table: urls_test
Create Table: CREATE TABLE `urls_test` (
  `id` int(11) NOT NULL AUTO_INCREMENT,
  `rpath` varchar(324) COLLATE utf8_bin NOT NULL,
  PRIMARY KEY (`id`),
  KEY `rpath_idx` (`rpath`(255))
) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin
1 row in set (0.00 sec)

So down to 255 characters from the original 324. In this case (a music player) it may not be a significant loss, but in a world where applications should support an increasing number of International character sets, in particular Asian ones, this limitation could potentially become serious.

I'm not a source code expert, so I'm not sure what it would take to remove, or at least expand the maximum key size. InnoDB's maximum key length (see URL quoted at the beginning of the article) seems like a good number: 3500 bytes ... or I may be overestimating the need for keys bigger than 255 bytes ... your call.

9 comments:

  1. Keep in mind that in InnoDB all data is organized around an index. InnoDB has limitations about how much data fits on a page and secondary index pages can't contain off-page blobs.

    ReplyDelete
  2. I believe these limitations are in place for performance reasons, MSSQL has a 900 byte prefix limit I believe.

    In general you should not just change from MyISAM to InnoDB and just expect it to work. They are optimized very differently. Also the application will very likely need modifying for performance tuning.

    There are very simple schema modifications you could make so that this would work well, but they would also require query modifications.

    ReplyDelete
  3. You can always index a prefix of a column so if you indexed the first 255 chars on your rpath it might be enough.

    Also you seem to have a duplicate index:
    UNIQUE KEY uniqueid (uniqueid),
    KEY urls_uniqueid (uniqueid)

    There's no reason to have both

    ReplyDelete
  4. It sure looks to me like you have a duplicate key?

    UNIQUE KEY uniqueid (uniqueid),
    KEY urls_uniqueid (uniqueid)

    Is this a typo? You really don't want TWO keys on the same field .....

    ReplyDelete
  5. @LinuxJedi, yes I am aware of the issues. In this case, the DB is very low traffic: 90% of the time it's feeding playlists for the media player running on a local machine and tracking some basic statistics. It's just a convenient platform for testing without having to generate data through complicated scripts.

    However, I'm aware that for a real world facing application it's not such a trivial task to move from MyISAM to InnoDB.

    ReplyDelete
  6. The original schema was taken as-is from the Amarok version distributed in Kubuntu 10.04, so the duplicate key might be something worth reporting back to the Amarok team.

    I'm not sure who in the Amarok team came up with VARCHAR(324), given that it's such a specific string length.

    You can always define the INDEX to be shorter and / or combine it with a numeric PRIMARY KEY to make sure it's UNIQUE. However this is a decision that the developer is forced to make and I suspect there are other applications out there that may need alphabetical keys longer than 255 bytes ... then again, I may be wrong on this.

    Thank you all for your feedback.

    ReplyDelete
  7. I would try using MariaDb with aria eninge instead.

    http://kb.askmonty.org/v/aria-faq

    Monty in the url there, is the original developer of Mysql, who has forked MariaDb off of Mysql ( but kept it mostly compatible), and continued the development of MyIsam under the moniker Aria ( formerly also called maria). The current release of the engine is crash safe, so it should prevent the problem you are having with losing data after a crash. "Eventually" it will also be transactional, but don't hold your breath on that.

    ReplyDelete
  8. @William The Amarok team is evaluating MariaDB as an alternative to MySQL.

    ReplyDelete
  9. I would suggest using MySQL 5.5 with the options: innodb_file_format=barracuda
    innodb_large_prefix=1

    You can create indexes up to 3072 bytes!

    mysql> show create table urls \G
    *************************** 1. row ***************************
    Table: urls
    Create Table: CREATE TABLE `urls` (
    `id` int(11) NOT NULL AUTO_INCREMENT,
    `deviceid` int(11) DEFAULT NULL,
    `rpath` varchar(324) COLLATE utf8_bin NOT NULL,
    `directory` int(11) DEFAULT NULL,
    `uniqueid` varchar(128) COLLATE utf8_bin DEFAULT NULL,
    PRIMARY KEY (`id`),
    UNIQUE KEY `uniqueid` (`uniqueid`),
    UNIQUE KEY `urls_id_rpath` (`deviceid`,`rpath`),
    KEY `bigkey` (`rpath`,`uniqueid`)
    ) ENGINE=InnoDB DEFAULT CHARSET=utf8 COLLATE=utf8_bin ROW_FORMAT=DYNAMIC
    1 row in set (0.01 sec)

    ReplyDelete