The Bupstash Garbage Collector

February 26, 2021

Overview

My backup tool bupstash stores backups in a repository as an evergrowing set of encrypted data trees which use content addressing and structural sharing to deduplicate data. In order to delete unused backups we need to do something very similar to how many programming languages free unreferenced memory - garbage collection. This post will explain the evolution and implementation of the garbage collector in bupstash for the curious.

Stop, mark and sweep.

The initial version of the bupstash garbage collector was a naive stop-the-world garbage collector. It walks all backup data trees creating a set of reachable data chunks by address and then deletes unreachable data.

The algorithm can be described as:

Stop the world.
Mark.
Sweep.
Restart the world.

Here is the pseudo code:

lock_repository()
reachable_addresses = empty_set()

for data_tree in all_backups():
  
  work_list = new_work_list_from_backup(data_tree.root_address)
  
  until work_list.is_empty():
    
    node_height, node_address = work_list.pop()
    reachable_addresses.add(node_address)
    
    if node_height != 0:
      add_child_nodes_to_worklist(work_list, node_address)

for chunk_address in repository:
  if not reachable_addresses.has(chunk_address):
    delete_chunk(chunk_address)

unlock_repository()

This algorithm is short and sweet, but if the repository is very large our repository becomes unavailable for a potentially long time. The next version shortens the downtime of the repository significantly.

Mark, stop, mark and sweep.

The bupstash v0.6.4 garbage collector takes advantage of two facts:

Because many backups share data with each other, we can memoize the walk phase to skip work.
Because backups are immutable while the repository is unlocked, we can walk most of them without stopping the world.

This time we walk the repository without the repository locked, then lock it and walk the repository again using our memoization to quickly complete the job. If any new backups appeared during our first walk, we are guaranteed to mark them now that the repository lock is held. Doing the majority of the slow mark phase without locking the repository greatly increases the repository availability for other operations.

The algorithm can be described as:

Memoized mark repository.
Stop the world.
Memoized mark repository.
Sweep.
Restart the world.

Here is the pseudo code:

walked_trees = empty_set()
reachable_addresses = empty_set()

func walk_repository():

  for data_tree in all_backups():

    if walked_trees.has(data_tree):
      continue

    walked_trees.add(data_tree)
    
    work_list = new_work_list_from_backup(data_tree.root_address)
    
    until work_list.is_empty():
      
      node_height, node_address = work_list.pop()
      already_walked = reachable_addresses.has(node_address)
      reachable_addresses.add(node_address)
      
      if node_height != 0 and not already_walked:
        add_child_nodes_to_worklist(work_list, node_address)

walk_repository()
lock_repository()
walk_repository()

for chunk_address in repository:
  if not reachable_addresses.has(chunk_address):
    delete_chunk(chunk_address)

unlock_repository()

Mark, stop, mark, bloom adjust, and sweep

The next version of the garbage collector is a WIP unreleased version.

In bupstash each data chunk address is 32 bytes, which means we need at least 32 bytes of RAM per chunk in the repository to successfully perform a garbage collection. For a repository containing a 100 million chunks or more this could easily exhaust memory on a busy or small system.

If we are willing to accept a very low probability of retaining some extra data, we can reduce this down to a few bits per chunk reducing memory usage by 64x or more. To do this we use a probabilistic data structure called a bloom filter to track reachable addresses.

Bloom filters have two downsides:

They can have false positives. For us this means we might keep a data chunk by accident we could have actually freed.
We must presize them. They cannot grow if we got the size wrong.

Luckily for us these downsides are not bad:

Because memory use is so low per address, we can generously size our bloom filter reducing false positives to less than one percent, even for large repositories.
We can easily detect when a bloom filter gets overly full and produces many false positives, and thus adjust it’s size for future garbage collections.

As a bonus, the bupstash implementation of a bloom filter is actually simpler than a hash table, with the implemenation weighing in at around 30 lines of code plus tests and helpers.

The algorithm can be described as:

Memoized mark repository.
Stop the world.
Memoized mark the repository.
Adjust bloom filter size.
Sweep.
Restart the world.

Here is the pseudo code:

walked_trees = empty_set()
walked_addresses = empty_cache()
reachable_addresses = empty_bloom_filter(repository_bloom_size())

func walk_repository():

  for data_tree in all_backups():

    if walked_trees.has(data_tree):
      continue

    walked_trees.add(data_tree)
    work_list = new_work_list_from_backup(data_tree.root_address)
    
    until work_list.is_empty():
      
      node_height, node_address = work_list.pop()
      already_walked = walked_addresses.has(node_address)
      reachable_addresses.add(node_address)
      walked_addresses.add(node_address)
      
      if node_height != 0 and not already_walked:
        add_child_nodes_to_worklist(work_list, node_address)

walk_repository()
lock_repository()
walk_repository()

if bloom_filter_overutilized(reachable_addresses):
  increase_bloom_filter_size_for_next_gc()
else if bloom_filter_underutilized(reachable_addresses):
  decrease_bloom_filter_size_for_next_gc()

for chunk_address in repository:
  if not reachable_addresses.has(chunk_address):
    delete_chunk(chunk_address)

unlock_repository()

One thing to note about this implementation is that because the bloom filter has false positives we can not use it to skip processing addresses, instead we must introduce a new cache for this purpose. A cache also lets us put a fixed upper bound on memory usage for large repositories.

Overall for large repositories, the addition of a bloom filter reduces ram requirements from the gigabyte range down to tens of megabytes while increasing the repository size by only a fraction of a percent.

Conclusion and the future

The bupstash garbage collector tries hard to keep the repository avaliable as long as possible while also providing excellent performance with a small memory profile. If you are implementing a content addressed storage system, this post will hopefully provide you with some new ideas.

In the future bupstash could follow the same path as programming language garbage collectors to find more improvements:

Parallelism in the mark phase.
Parallelism in the sweep phase.
Incremental collection.
Object generations.
Write barriers
Yet to be invented techniques…

As always, thanks for reading and please feel free to give bupstash a try.