Archive for the ‘MongoDB’ Tag

MongoDB data conversion phase II complete

The data conversion from MySQL to MongoDB is finally done. Phase I was to move everything except cache logs and was completed as part of the upgrade to 0.8. Phase II was to move the cache logs out of the old database and into the new database.

Since there is not enough disk space to support two full copies of the logs (for 32 million logs, 8-11 GB per copy depending on indexes) I planned to free up disk space out of the old database periodically. That turned out to be impossible since it was too expensive to delete logs from MySQL and freeing up space requires making a new copy of the data before the old one is deleted… did I mention I don’t have enough disk space?

Instead I migrated about a third of the cache logs (about 10 million) then dumped the rest to an off-line archive. Then I took the MySQL DB offline, freed up the space, and started loading the archived logs into MongoDB.

Now for something a little more technical…

In the MySQL database, cache logs were stored in a single table with the cache log ID as the primary key and a secondary index on cache ID + log time. That structure requires index storage in the 2+ GB size. That’s a lot of index data to traverse, especially when looking up logs for many caches at once (as would be the case for the synchronization API). With limited memory resources it also means that very little of the index can be cached in memory so MOST cache log lookups will require quite a lot of disk seeks.

But, since cache logs are always accessed with the cache ID available, a much more efficient structure would be to group all cache logs in a “table” organized by cache ID. For the current 1.5 million cache records, the index storage would be less than 200 MB.

Unlike MySQL, MongoDB allows for unstructured data storage and it would be easy to do this as a table of cache documents (one for each cache ID), then each document would contain an array of all the related cache logs. For cache log retrieval, it would require a single index lookup per cache, followed by (approximately) one disk seek to retrieve ALL cache logs. The downside to this structure is that when adding additional logs to a cache document, the document will grow and may have to be moved (on disk) if there is not enough free space in the document’s current location. Even so, this structure has a strong enough advantage with the significantly reduced index size and will free up memory resources for other indexes and data.

That WAS my original plan, but it was taking so long to extract the logs from MySQL by cache ID I ended up migrating the data using the SAME structure, organized by cache LOG ID. As a result, the performance under the new database is only marginally better than before.

So now there is a “Phase III” where the old structure will be converted and migrated to the new structure.

Stay tuned.

Advertisements

What’s new database?

Every month bcaching grows a little bigger with more users and more data. The database has grown to over 12 GB in order to accommodate 1.5 million geocache records and over 32 million finder logs. One of the persistent problems has been how to keep that geocache data up-to-date as efficiently as possible and working fast enough on reasonably-priced hosting services (currently a single medium-sized Windows based VPS).

Over 700 GPX files are processed every day. Over the past 24 hours, 890 GPX files were processed containing 388,891 geocaches, 1,902,996 finder logs, 106,832 waypoints, and 85,612 travel bugs. It took 17 hours and 41 minutes of processing time to read and load those files. That’s 2,341 data objects per minute and it’s too slow even after several performance improvements over the years. It also takes resources away from serving web and api requests, even though GPX processing is run at a lower priority than other work.

For the past couple of months I have spent some time investigating, testing and implementing a completely new back-end database using MongoDB to replace MySql. Mongo is fast. It sacrifices features that could slow it down and requires the application to take on responsibility for more functions but it provides excellent performance in return.

The jump from a traditional SQL database to a schema-less database required a complete rewrite of all the data access logic but it simplified some of the logic as well (especially the GPX file processing). Some of the data model also had to be reorganized to best take advantage of certain mongo features.

One of the remaining problems is how to migrate from the old database to the new one. Normally a release includes only minor database changes and can be completed in a couple hours or less but a full database migration would take the better part of a day and I’m not willing to shut down the site for that long. Another approach would be to synchronize the two databases while the site is live, then shut down the site only long enough for a final synchronization before switching over. That is not an option either because there is not enough disk space to support two full databases at the same time.

The remaining option (and current plan) is to synchronize some of the data, then switch to the new database but continue to use the old database in a temporary hybrid mode until the remaining data can be moved. The new database is now being synchronized with everything but finder logs (logs take up the largest percentage of space) and the new application can load logs from both databases and merge the results. There is a definite performance hit to loading the logs this way, bit it’s not much worse than before and it will get better after the migration is complete.

There is still more testing to do so I haven’t scheduled a release date but I wanted to give everyone a heads up. There will also be at least one new feature with this release: support for Cache Attributes! If you’re interested in testing the new site, you can use your existing credentials at http://test.bcaching.com but any uploads or logs may be overwritten by the nightly sync from the main database.

Questions or comments are welcome here or in the forums.