Sunday, October 18, 2015

Backshift not That slow, and for good reason





  • Backshift is a deduplicating backup program in Python.
  • At http://burp.grke.org/burp2/08results1.html you can find a performance comparison between some backup applications.


  • The comparison did not compare backshift, because backshift was believed to have prohibitively slow deduplication.
  • Backshift is truly not a speed-demon. It is designed to:
    1. minimize storage requirements
    2. minimize bandwidth requirements
    3. emphasize parallel (concurrent backups of different computers) performance to some extent
    4. allow expiration of old data that is no longer needed
  • Also, it was almost certainly not backshift's deduplication that was slow, it was:
    1. backshift's variable-length, content-based blocking algorithm. This makes python inspect every byte of the backup, one byte at a time.
    2. backshift's use of xz compression. xz packs files very hard, reducing storage and bandwidth requirements, but it is known to be slower than something like gzip that doesn't compress as well.
  • Also, while the initial fullsave is slow, subsequent backups are much faster because they do not reblock or recompress any files that still have the same mtime and size as found in 1 of (up to) 3 previous backups.
  • Also, if you run backshift on Pypy, its variable-length, content-based blocking algorithm is many times faster than if you run it on CPython. Pypy is not only faster than CPython, it's also much faster than CPython augmented with Cython.
  • I sent G. P. E. Keeling an e-mail about this some time ago (the date of this writing is October 2015), but never received a response