reducing memory usage in a graph structure using Scala
A straightforward implementation of Ukkonen's suffix tree algorithm in Scala
is optimized to reduce memory usage,
by using a custom, open addressing hash map implementation,
along with other tricks to minimize the number of object instances.
motivation
In the Fall of 2008, for a biosequence analysis course that I was auditing,
I coded an implementation of Ukkonen's suffix tree algorithm in Java.
There was already existing a C++ implementation,
but Java was the language that I had been using lately,
and I wanted to see how it could perform for this sort of graph-structured data.
My Java implementation hit a limit on the string length that it could handle—something well short of the 109 characters that I had been wanting for nucleotide sequences.
In early 2010, I started toying with Scala.
Then and in early 2011, I tried applying some of Scala's capabilities to simplify the data structures that Java creates.
Specifically, I thought that the @specialized annotation and trait mix-ins could help.
I didn't have much time for it then, so I only got as far as porting some of the code.
In the meantime, macros were introduced to Scala.
Testing of the initial implementation
got bogged down in garbage collection shortly after processing 27 Mbp.
An analysis of the memory usage (on a smaller data set),
using the excellent Java Object Layout tool,
yielded the summary in listing 1.
The memory is used primarily by the internal nodes and the hash maps that they contain:
InternalNode
InternalNode$$anon$1 (a mutable.HashMap subclass)
Array[HashEntry]
DefaultEntry
merge the HashTable into InternalNode
Clearly there's a lot of overhead in the object headers, which run 8-12 bytes,
but also in the object alignment padding,
as can be seen in listings 2 and 3.
(These exaggerate the loss to alignment padding compared to the average sizes in the actual object footprint.)
Combining these classes was easy to do,
and should remove an object header per internal node,
eliminating the separate InternalNode$$anon$1 instances.
We can see that it does, in listings 4 and 5.
But this only gives a 6% reduction in memory use,
and tree construction stops shortly after processing 29 Mbp.
open addressing
Of the remaining memory usage, 58% goes to Array[HashEntry] and DefaultEntry instances,
whose layouts are in listings 6 and 7.
The HashTable chaining structure doesn't work well with the JVM's object overhead.
We can see in figure 1 how much space is wasted.
Much of this could be eliminated by setting def initialSize = 8,
but there's a better data structure for this.
Open addressing
avoids the extra pointers and should eliminate the DefaultEntrys and 25% of memory use.
But there may be some overhead added to maintain a lower load factor in the hash table.
Starting from the naive implementation again,
but replacing the InteralNode's member HashMap with an OpenHashMap,
we see the memory footprint in listing 8.
It's actually worse than the naive implementation!
The combined size of the Array[OpenEntry]s and OpenEntrys
is comparable to the Array[HashEntry]s and HashEntrys.
The Array[OpenEntry] is smaller due to the use of initialSize = 8;
however, an OpenEntry has an additional hash: Int member.
But its implementation utilizes an Option instance in each OpenEntry as well.
The problem is that this implementation doesn't take advantage of the possibility of storing the value reference
directly in the hash table.
As we did with HashTable, we could eliminate the Map header overhead.
It's not as nice as with the HashTable trait,
because Debox makes its members public, so they become public in the InternalNode API.
But that would still leave us with substantial header overhead
in the Map.keys, .vals, and .buckets array instances.
Instead, we can replace all the maps with one map
that's keyed off both the internal node identity and the first edge label character.
The simple approach would be to key the map on a Tuple2 of (InternalNode[I], I)
(where I is the character type).
But it's often necessary to traverse all of the children of a given node,
and this keying scheme wouldn't cluster together the children of each node.
So what's needed is a custom map structure that does this.
TwoKeyOpenHashMap
A simple design would be to hash the internal node to a reference
to a second-level hash map from the edge character to the child node.
But this would create a (second-level hash map) instance for each internal node,
incurring the instance overhead as in the naive implementation.
A better design would divide the hash table into "bucket" sub-tables, one per internal node key,
into which the edge character would be hashed.
In the special case in which the alphabet is small,
the bucket may be able to efficiently contain a slot for each character in the alphabet.
But in the case of a general alphabet:
(1) we need to plan for the case of bucket overflow; and
(2) the allocation of one bucket per internal node would waste space,
since the number of child nodes per internal node will vary widely.
The solution that I use here is
to apply open addressing to the first key (internal node ID) at the bucket table level,
and again to the second key (first edge character) within a bucket.
If the slot in the bucket is in use, linear probing finds another.
When no space can be found in a bucket, double hashing on the first key finds the next bucket.
When no space can be found in any bucket, resizing occurs.
(Actually, in this implementation, resizing occurs to keep the load factor below 50%.)
Any bucket size will work, but it is optimally chosen to contain a whole alphabet.
I call the map implementation TwoKeyOpenHashMap.
Its effect on memory usage is shown in listing 10.
It exhibits the least usage of all the structures so far, but not quite the savings that one would have expected.
This is because it needs to use a map internally (to track the number of values stored per internal node),
and the implementation I chose was OpenHashMap, which has the shortcomings that we've seen already.
There are two other open-address map implementation classes in the Scala library:
LongMap and AnyRefMap.
The former only accepts Longs as keys, while the latter accepts only AnyRefs.
For my needs, since the InternalNode keys satisfy the latter,
I can, for the time being, place an upper bound of <: AnyRef on the first key's type parameter.
With this,
we see the memory usage in listing 11,
with the layout as in figure 2.
Now we're starting to see significant reductions, 27% less than the naive solution,
and tree construction stops shortly after processing 40 Mbp—but still not the number I'm aiming for.
conclusions
The total memory usage for the various implementations is shown in table 1.
Further improvements are the subject of future articles.
Table 1. Memory usage in the various implementations
No comments:
Post a Comment