Wednesday, September 23, 2015

reducing memory usage in a graph structure using Scala

A straightforward implementation of Ukkonen's suffix tree algorithm in Scala is optimized to reduce memory usage, by using a custom, open addressing hash map implementation, along with other tricks to minimize the number of object instances.

motivation

In the Fall of 2008, for a biosequence analysis course that I was auditing, I coded an implementation of Ukkonen's suffix tree algorithm in Java. There was already existing a C++ implementation, but Java was the language that I had been using lately, and I wanted to see how it could perform for this sort of graph-structured data.

My Java implementation hit a limit on the string length that it could handle—something well short of the 109 characters that I had been wanting for nucleotide sequences.

In early 2010, I started toying with Scala. Then and in early 2011, I tried applying some of Scala's capabilities to simplify the data structures that Java creates. Specifically, I thought that the @specialized annotation and trait mix-ins could help. I didn't have much time for it then, so I only got as far as porting some of the code. In the meantime, macros were introduced to Scala.

naive implementation

Recently I found time to port the basic suffix tree algorithm to Scala. I performed tests on a virtual machine with 8GB RAM, using the OpenJDK 1.8.0_45 release in 64-bit server mode, with "-Xmx6g". I used the 139 Mbp D. melanogaster genome as my test data.

Testing of the initial implementation got bogged down in garbage collection shortly after processing 27 Mbp. An analysis of the memory usage (on a smaller data set), using the excellent Java Object Layout tool, yielded the summary in listing 1.

com.performantdata.suffixtree.SuffixTree@3098cf3bd footprint:
     COUNT       AVG       SUM   DESCRIPTION
         5      3321     16608   [B
        37       492     18208   [C
         6    349544   2097264   [Ljava.lang.Object;
...
    320437        80  25634960   [Lscala.collection.mutable.HashEntry;
    320436        32  10253952   com.performantdata.suffixtree.InternalNode
    320437        40  12817480   com.performantdata.suffixtree.InternalNode$$anon$1
    499950        24  11998800   com.performantdata.suffixtree.Leaf
         1        16        16   com.performantdata.suffixtree.NucleotideAlphabet
         1        32        32   com.performantdata.suffixtree.RootNode
         1        72        72   com.performantdata.suffixtree.SuffixTree
...
         5        16        80   java.lang.Character
...
         1        24        24   scala.collection.mutable.ArrayBuffer
    820386        24  19689264   scala.collection.mutable.DefaultEntry
...
   2281845            82531592   (total)
Listing 1. Major memory users in naive implementation

The memory is used primarily by the internal nodes and the hash maps that they contain:

  • InternalNode
  • InternalNode$$anon$1 (a mutable.HashMap subclass)
  • Array[HashEntry]
  • DefaultEntry

merge the HashTable into InternalNode

Clearly there's a lot of overhead in the object headers, which run 8-12 bytes, but also in the object alignment padding, as can be seen in listings 2 and 3. (These exaggerate the loss to alignment padding compared to the average sizes in the actual object footprint.)

com.performantdata.suffixtree.InternalNode object internals:
 OFFSET  SIZE         TYPE DESCRIPTION                    VALUE
      0     4              (object header)                19 3b cf 98 (0001 1001 0011 1011 1100 1111 1001 1000)
      4     4              (object header)                30 00 00 00 (0011 0000 0000 0000 0000 0000 0000 0000)
      8     4              (object header)                05 a8 02 f8 (0000 0101 1010 1000 0000 0010 1111 1000)
     12     4          int Node._edgeStart                (access denied)
     16     4 InternalNode Node.parent                    (access denied)
     20     4          int InternalNode._edgeEnd          (access denied)
     24     4          Map InternalNode.childrenByElement (access denied)
     28     4 InternalNode InternalNode._suffixLink       (access denied)
     32    40              (loss due to the next object alignment)
Instance size: 72 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 40 bytes external = 40 bytes total
Listing 2. InternalNode class layout
scala.collection.mutable.HashMap object internals:
 OFFSET  SIZE        TYPE DESCRIPTION                    VALUE
      0     4             (object header)                19 3b cf 98 (0001 1001 0011 1011 1100 1111 1001 1000)
      4     4             (object header)                30 00 00 00 (0011 0000 0000 0000 0000 0000 0000 0000)
      8     4             (object header)                05 a8 02 f8 (0000 0101 1010 1000 0000 0010 1111 1000)
     12     4         int HashMap._loadFactor            (access denied)
     16     4         int HashMap.tableSize              (access denied)
     20     4         int HashMap.threshold              (access denied)
     24     4         int HashMap.seedvalue              (access denied)
     28     4 HashEntry[] HashMap.table                  (access denied)
     32     4       int[] HashMap.sizemap                (access denied)
     36    36             (loss due to the next object alignment)
Instance size: 72 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 36 bytes external = 36 bytes total
Listing 3. HashMap class layout

Combining these classes was easy to do, and should remove an object header per internal node, eliminating the separate InternalNode$$anon$1 instances. We can see that it does, in listings 4 and 5.

com.performantdata.suffixtree.SuffixTree@63070babd footprint:
     COUNT       AVG       SUM   DESCRIPTION
...
    320436        56  17944416   com.performantdata.suffixtree.InternalNode
    499950        24  11998800   com.performantdata.suffixtree.Leaf
...
   1961408            77404600   (total)
Listing 4. Memory use changes from merging InternalNode
com.performantdata.suffixtree.InternalNode object internals:
 OFFSET  SIZE         TYPE DESCRIPTION                    VALUE
      0     4              (object header)                09 ab 0b 07 (0000 1001 1010 1011 0000 1011 0000 0111)
      4     4              (object header)                63 00 00 00 (0110 0011 0000 0000 0000 0000 0000 0000)
      8     4              (object header)                2d a5 02 f8 (0010 1101 1010 0101 0000 0010 1111 1000)
     12     4          int Node._edgeStart                (access denied)
     16     4 InternalNode Node.parent                    (access denied)
     20     4          int InternalNode._edgeEnd          (access denied)
     24     4          int InternalNode._loadFactor       (access denied)
     28     4          int InternalNode.tableSize         (access denied)
     32     4          int InternalNode.threshold         (access denied)
     36     4          int InternalNode.seedvalue         (access denied)
     40     4 InternalNode InternalNode._suffixLink       (access denied)
     44     4  HashEntry[] InternalNode.table             (access denied)
     48     4        int[] InternalNode.sizemap           (access denied)
     52    20              (loss due to the next object alignment)
Instance size: 72 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 20 bytes external = 20 bytes total
Listing 5. Merged InternalNode class layout

But this only gives a 6% reduction in memory use, and tree construction stops shortly after processing 29 Mbp.

open addressing

Of the remaining memory usage, 58% goes to Array[HashEntry] and DefaultEntry instances, whose layouts are in listings 6 and 7.

[Lscala.collection.mutable.HashEntry; object internals:
 OFFSET  SIZE                               TYPE DESCRIPTION                                     VALUE
      0     4                                    (object header)                                 19 3b cf 98 (0001 1001 0011 1011 1100 1111 1001 1000)
      4     4                                    (object header)                                 30 00 00 00 (0011 0000 0000 0000 0000 0000 0000 0000)
      8     4                                    (object header)                                 05 a8 02 f8 (0000 0101 1010 1000 0000 0010 1111 1000)
     12     4                                int [Lscala.collection.mutable.HashEntry;.length    N/A
     16     0 scala.collection.mutable.HashEntry [Lscala.collection.mutable.HashEntry;.<elements> N/A
     16    56                                    (loss due to the next object alignment)
Instance size: 72 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 56 bytes external = 56 bytes total
Listing 6. Array[HashEntry] class layout
scala.collection.mutable.DefaultEntry object internals:
 OFFSET  SIZE   TYPE DESCRIPTION                    VALUE
      0     4        (object header)                19 3b cf 98 (0001 1001 0011 1011 1100 1111 1001 1000)
      4     4        (object header)                30 00 00 00 (0011 0000 0000 0000 0000 0000 0000 0000)
      8     4        (object header)                05 a8 02 f8 (0000 0101 1010 1000 0000 0010 1111 1000)
     12     4 Object DefaultEntry.key               (access denied)
     16     4 Object DefaultEntry.value             (access denied)
     20     4 Object DefaultEntry.next              (access denied)
     24    48        (loss due to the next object alignment)
Instance size: 72 bytes (reported by Instrumentation API)
Space losses: 0 bytes internal + 48 bytes external = 48 bytes total
Listing 7. DefaultEntry class layout

The HashTable chaining structure doesn't work well with the JVM's object overhead. We can see in figure 1 how much space is wasted. Much of this could be eliminated by setting def initialSize = 8, but there's a better data structure for this.

image/svg+xml length object header table(0) table(1) table(2) table(3) table(4) table(5) table(6) table(7) table(8) table(9) table(10) table(11) table(12) table(13) table(14) table(15) _loadFactor object header tableSize threshold seedvalue table sizemap padding mutable.HashTable key object header value next mutable.DefaultEntry key object header Character Node _edgeStart object header parent _edgeEnd children... _suffixLink InternalNode Array[HashEntry]
Figure 1. Naive implementation, with HashMap

Open addressing avoids the extra pointers and should eliminate the DefaultEntrys and 25% of memory use. But there may be some overhead added to maintain a lower load factor in the hash table.

Starting from the naive implementation again, but replacing the InteralNode's member HashMap with an OpenHashMap, we see the memory footprint in listing 8.

com.performantdata.suffixtree.SuffixTree@2118cddfd footprint:
     COUNT       AVG       SUM   DESCRIPTION
...
    320437        61  19587504   [Lscala.collection.mutable.OpenHashMap$OpenEntry;
    320436        32  10253952   com.performantdata.suffixtree.InternalNode
    320437        40  12817480   com.performantdata.suffixtree.InternalNode$$anon$1
...
    820386        16  13126176   scala.Some
...
    820386        32  26252352   scala.collection.mutable.OpenHashMap$OpenEntry
...
   3102231            96173400   (total)
Listing 8. Memory footprint with OpenHashMap instead of HashMap

It's actually worse than the naive implementation! The combined size of the Array[OpenEntry]s and OpenEntrys is comparable to the Array[HashEntry]s and HashEntrys. The Array[OpenEntry] is smaller due to the use of initialSize = 8; however, an OpenEntry has an additional hash: Int member. But its implementation utilizes an Option instance in each OpenEntry as well. The problem is that this implementation doesn't take advantage of the possibility of storing the value reference directly in the hash table.

Debox

Debox is an open addressing hash map implementation that does this. In listing 9, we see that the memory footprint with Debox is improved over the naive case.

com.performantdata.suffixtree.SuffixTree@5552768bd footprint:
     COUNT       AVG       SUM   DESCRIPTION
    320442        24   7707096   [B
    320623        32  10284320   [C
    320437        48  15380976   [Lcom.performantdata.suffixtree.Node;
...
    320436        32  10253952   com.performantdata.suffixtree.InternalNode
    499950        24  11998800   com.performantdata.suffixtree.Leaf
...
    320437        56  17944472   debox.Map
...
   2103075            75742344   (total)
Listing 9. Memory footprint with Debox Map

As we did with HashTable, we could eliminate the Map header overhead. It's not as nice as with the HashTable trait, because Debox makes its members public, so they become public in the InternalNode API. But that would still leave us with substantial header overhead in the Map.keys, .vals, and .buckets array instances. Instead, we can replace all the maps with one map that's keyed off both the internal node identity and the first edge label character.

The simple approach would be to key the map on a Tuple2 of (InternalNode[I], I) (where I is the character type). But it's often necessary to traverse all of the children of a given node, and this keying scheme wouldn't cluster together the children of each node. So what's needed is a custom map structure that does this.

TwoKeyOpenHashMap

A simple design would be to hash the internal node to a reference to a second-level hash map from the edge character to the child node. But this would create a (second-level hash map) instance for each internal node, incurring the instance overhead as in the naive implementation.

A better design would divide the hash table into "bucket" sub-tables, one per internal node key, into which the edge character would be hashed. In the special case in which the alphabet is small, the bucket may be able to efficiently contain a slot for each character in the alphabet. But in the case of a general alphabet: (1) we need to plan for the case of bucket overflow; and (2) the allocation of one bucket per internal node would waste space, since the number of child nodes per internal node will vary widely.

The solution that I use here is to apply open addressing to the first key (internal node ID) at the bucket table level, and again to the second key (first edge character) within a bucket. If the slot in the bucket is in use, linear probing finds another. When no space can be found in a bucket, double hashing on the first key finds the next bucket. When no space can be found in any bucket, resizing occurs. (Actually, in this implementation, resizing occurs to keep the load factor below 50%.) Any bucket size will work, but it is optimally chosen to contain a whole alphabet.

I call the map implementation TwoKeyOpenHashMap. Its effect on memory usage is shown in listing 10.

com.performantdata.suffixtree.SuffixTree@499d5c4bd footprint:
     COUNT       AVG       SUM   DESCRIPTION
         6    352296   2113776   [B
       209     20262   4234808   [C
         1   8388624   8388624   [Lcom.performantdata.suffixtree.InternalNode;
         1   8388624   8388624   [Lcom.performantdata.suffixtree.Node;
...
         6    349544   2097264   [Ljava.lang.Object;
...
         1   8388624   8388624   [Lscala.collection.mutable.OpenHashMap$OpenEntry;
    320436        32  10253952   com.performantdata.suffixtree.InternalNode
    499950        24  11998800   com.performantdata.suffixtree.Leaf
...
    320437        16   5126992   scala.Some
...
    320437        32  10253984   scala.collection.mutable.OpenHashMap$OpenEntry
...
   1462305            71305504   (total)
Listing 10. Memory footprint with TwoKeyOpenHashMap

It exhibits the least usage of all the structures so far, but not quite the savings that one would have expected. This is because it needs to use a map internally (to track the number of values stored per internal node), and the implementation I chose was OpenHashMap, which has the shortcomings that we've seen already.

There are two other open-address map implementation classes in the Scala library: LongMap and AnyRefMap. The former only accepts Longs as keys, while the latter accepts only AnyRefs. For my needs, since the InternalNode keys satisfy the latter, I can, for the time being, place an upper bound of <: AnyRef on the first key's type parameter. With this, we see the memory usage in listing 11, with the layout as in figure 2. Now we're starting to see significant reductions, 27% less than the naive solution, and tree construction stops shortly after processing 40 Mbp—but still not the number I'm aiming for.

com.performantdata.suffixtree.SuffixTree@3fe9ee68d footprint:
     COUNT       AVG       SUM   DESCRIPTION
         6    352296   2113776   [B
       210     20164   4234536   [C
         1   4194320   4194320   [I
         1   8388624   8388624   [Lcom.performantdata.suffixtree.InternalNode;
         1   8388624   8388624   [Lcom.performantdata.suffixtree.Node;
...
         8   1310738  10485904   [Ljava.lang.Object;
...
    320436        32  10253952   com.performantdata.suffixtree.InternalNode
    499950        24  11998800   com.performantdata.suffixtree.Leaf
...
    821444            60119320   (total)
Listing 11. Memory footprint with TwoKeyOpenHashMap using AnyRefMap
image/svg+xml length object header key1s(0) key1s(1) object header TwoKeyOpenHashMap Node _edgeStart object header parent _edgeEnd children... _suffixLink InternalNode Array[InternalNode] key1s vals states key1Sizes length object header vals(0) vals(1) Array[Node] length object header key2s(0) Array[Char] key2s key2s(1) key2s(2) key2s(3) 'A' length object header Array[Byte] Empty Empty Empty Empty Empty Empty Empty Empty Empty Empty Occ'd Occ'd Occ'd Occ'd Occ'd Occ'd AnyRefMap
Figure 2. TwoKeyOpenHashMap implementation, with AnyRefMap

conclusions

The total memory usage for the various implementations is shown in table 1. Further improvements are the subject of future articles.

Table 1. Memory usage in the various implementations
implementation bytes used reduction
naive/initial 82531592
merged map/InternalNode 77404600 6%
OpenHashMap instead of HashMap 96173400 -17%
Debox Map 75742344 8%
TwoKeyOpenHashMap using OpenHashMap 71305504 14%
TwoKeyOpenHashMap using AnyRefMap 60119320 27%

No comments:

Post a Comment