Tuesday, November 28, 2006

Premature Optimisation

Here is an example of C.A.R.Hoare's statement that premature optimization is the root of all evils.

In implementing the Lock Manager in SimpleDBM, I was concerned about the impact thread synchronisation would have on the scalability of the Lock Manager. The Lock Manager uses a hash table to speed up lookups of the locks. Many implementations of Lock Managers synchronize the entire hash table while manipulating locks. There is an issue within the Apache Derby project discussing the scalability problems posed by such global synchronisation. The SimpleDBM implementation does not use a global lock. Instead it uses a custom hash table, where every hash bucket contains a ReentrantLock. The hash bucket lock is acquired before any traversal or modification is performed on the bucket's chain. I went even further and put a ReentrantLock into each Lock Header object. A Lock Header is required for each active lock in the system. The Lock Header maintains a queue of lock requests (clients) for the lock. Lock Headers are the objects that get put on the hash chains.

The idea was that the hash bucket lock could be released early by obtaining a lock on the Lock Header, once the Lock Header had been found. This would increase concurrency by reducing the duration for which the hash bucket needs to be kept locked.

In my attempt to further reduce locking, I tried to find ways of reducing the time for which the hash bucket lock was held. One such optimisation involved setting a field in the Lock Header to null rather than manipulating the hash bucket chain when releasing a lock. This code worked fine until the day I tried testing in a multi processor environment (CoreDuo), when I found that one of my test cases started failing. This test case was designed to stress concurrent operations on the same hash bucket. The problem was that in attempting to optimise I had broken the code. The optimisation was flawed because in one section of the code I was modifying the field in the Lock Header while holding the Lock Header lock, while in another section, a search was being performed on the hash bucket while holding the hash bucket lock, and this search was inspecting the field that was being updated.

A lesson from this is that multi-threaded code must be tested on a multi-processor machine.

Anyway, when I hit the bug described above, I decided that it was time to revisit the thread synchronisation code in the Lock Manager. I realised that I was attacking the wrong problem in my tuning efforts, because there were bigger scalability issues that needed fixing compared to the problem I was trying to fix.

The first problem was of memory usage. I was using too many ReentrantLocks; one per hash bucket, plus one for each lock header. In a system with 10,000 concurrent active locks, the total number of ReentrantLocks would be 10,000 plus the number of buckets in the hash chain. If I wanted to make the Lock Manager scalable, I needed to reduce the amount of memory used per lock.

The second problem was that the hash table was not dynamic. Its size was fixed at the time of creating the Lock Manager. This was likely to cause severe performance and scalability problems due to hash collisions. It would be virtually impossible to set the correct hash table size statically, and a poor hash table would cause more hash collisions leading to long hash chains, which would mean greater contention between threads trying to access the same hash bucket.

I decided that locking at the level of Lock Headers was not giving me sufficient bang for money, whereas, if I made the hash table dynamic, and used Java's inbuilt monitors instead of ReentrantLocks to synchronise the buckets, the system's overall efficiency and scalability would be greater than what my current code was achieving. The increase in the number of hash buckets would reduce the number of hash collisions and thereby lock contention amongst threads. The reduction in the number of ReentrantLocks would reduce memory usage and therefore make the Lock Manager more scalable. The refactored code would also be simpler and easier to maintain.

No comments: