Friday, January 17, 2014

Python dictionary-like trees with O(log2n) find_min and find_max, and O(n) ordered traversal

I've made some changes to the tree comparison I did in 2012.

I've added a few more dictionary-like tree datastructures.

I also changed the methodology: Instead of running 4 million operations for all datastructures, I've told it to run each test to 8 million ops, or 2 minutes, whichever comes first.  This allowed comparison of datastructures that perform well at one workload and poorly at another.

I also told it to use both random and sequential workloads this time - some datastructures are good at one and poor at the other.

The chief reasons to use a tree instead of a standard dictionary, are:

  • You get find_min and find_max methods, which run in O(log2n) time.  Standard dictionaries do this in O(n) time, which is much slower for large n.
  • You get an ordered iteration in O(n) time.  Standard dictionaries do this in O(nlogn) time, which is also much slower for large n.

The latest version of the comparison is here, as a series of graphs and as an HTML click-to-expand hierarchy.  It's also linked from the URL at the top of this entry.

Friday, January 3, 2014

Backshift announcement

Backshift is a deduplicating filesystem backup tool in Python that compresses input data very hard and supports removing old data to make room for new data.

Backshift is thoroughly, automatically tested; is thoroughly documentedruns on a variety of Python interpreters; runs on a wide assortment of operating systems, and passes pylint.

Files to backup are selected by piping "find /dir -print0" into backshift, and files are restored by piping backshift to tar xvfp.  Usage examples can be found here.

Here's a table comparing backshift and related tools.

Here's a list of backshift's features and misfeatures.

My hope is it will be useful to some of you.

Results of the Python 2.x vs 3.x survey

The results are at