Backshift not That slow, and for good reason
Backshift is a deduplicating backup program in Python.
At http://burp.grke.org/burp2/08results1.html you
can find a performance comparison between some backup applications.
The comparison did not compare backshift, because backshift was believed to have prohibitively slow
deduplication.
Backshift is truly not a speed-demon. It is designed to:
- minimize storage requirements
- minimize bandwidth requirements
- emphasize parallel (concurrent backups of different computers) performance to some extent
- allow expiration of old data that is no longer needed
Also, it was almost certainly not backshift's deduplication that was slow, it was:
- backshift's variable-length, content-based blocking algorithm. This makes python inspect every byte
of the backup, one byte at a time.
- backshift's use of xz compression. xz packs files very hard,
reducing storage and bandwidth requirements, but it is known to be slower than something like
gzip that doesn't compress as well.
Also, while the initial fullsave is slow, subsequent backups are much faster because they
do not reblock or recompress any files that still have the same mtime and size as found in 1 of (up to) 3
previous backups.
Also, if you run backshift on Pypy, its variable-length, content-based
blocking algorithm is
many times faster than if you run it on
CPython. Pypy is not only faster than CPython, it's also much
faster than CPython augmented with Cython.
I sent G. P. E. Keeling an e-mail about this some time ago (the date of this writing is October 2015), but
never received a response
Hello Dan,
ReplyDeleteThis is Graham Keeling.
I replied to your email on the 28th of August 2015, about a day and a half after you sent it.
Maybe my email went into your spam folder, or something?
Would you like me to resend it?
Yes, please.
DeleteOK, done.
DeleteThis comment has been removed by the author.
ReplyDelete