MongoDB data conversion phase II complete

The data conversion from MySQL to MongoDB is finally done. Phase I was to move everything except cache logs and was completed as part of the upgrade to 0.8. Phase II was to move the cache logs out of the old database and into the new database.

Since there is not enough disk space to support two full copies of the logs (for 32 million logs, 8-11 GB per copy depending on indexes) I planned to free up disk space out of the old database periodically. That turned out to be impossible since it was too expensive to delete logs from MySQL and freeing up space requires making a new copy of the data before the old one is deleted… did I mention I don’t have enough disk space?

Instead I migrated about a third of the cache logs (about 10 million) then dumped the rest to an off-line archive. Then I took the MySQL DB offline, freed up the space, and started loading the archived logs into MongoDB.

Now for something a little more technical…

In the MySQL database, cache logs were stored in a single table with the cache log ID as the primary key and a secondary index on cache ID + log time. That structure requires index storage in the 2+ GB size. That’s a lot of index data to traverse, especially when looking up logs for many caches at once (as would be the case for the synchronization API). With limited memory resources it also means that very little of the index can be cached in memory so MOST cache log lookups will require quite a lot of disk seeks.

But, since cache logs are always accessed with the cache ID available, a much more efficient structure would be to group all cache logs in a “table” organized by cache ID. For the current 1.5 million cache records, the index storage would be less than 200 MB.

Unlike MySQL, MongoDB allows for unstructured data storage and it would be easy to do this as a table of cache documents (one for each cache ID), then each document would contain an array of all the related cache logs. For cache log retrieval, it would require a single index lookup per cache, followed by (approximately) one disk seek to retrieve ALL cache logs. The downside to this structure is that when adding additional logs to a cache document, the document will grow and may have to be moved (on disk) if there is not enough free space in the document’s current location. Even so, this structure has a strong enough advantage with the significantly reduced index size and will free up memory resources for other indexes and data.

That WAS my original plan, but it was taking so long to extract the logs from MySQL by cache ID I ended up migrating the data using the SAME structure, organized by cache LOG ID. As a result, the performance under the new database is only marginally better than before.

So now there is a “Phase III” where the old structure will be converted and migrated to the new structure.

Stay tuned.


No comments yet

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

%d bloggers like this: