Friday, January 17, 2014

Python dictionary-like trees with O(log2n) find_min and find_max, and O(n) ordered traversal


I've made some changes to the tree comparison I did in 2012.

I've added a few more dictionary-like tree datastructures.

I also changed the methodology: Instead of running 4 million operations for all datastructures, I've told it to run each test to 8 million ops, or 2 minutes, whichever comes first.  This allowed comparison of datastructures that perform well at one workload and poorly at another.

I also told it to use both random and sequential workloads this time - some datastructures are good at one and poor at the other.

The chief reasons to use a tree instead of a standard dictionary, are:

  • You get find_min and find_max methods, which run in O(log2n) time.  Standard dictionaries do this in O(n) time, which is much slower for large n.
  • You get an ordered iteration in O(n) time.  Standard dictionaries do this in O(nlogn) time, which is also much slower for large n.

The latest version of the comparison is here, as a series of graphs and as an HTML click-to-expand hierarchy.  It's also linked from the URL at the top of this entry.

Friday, January 3, 2014

Backshift announcement


Backshift is a deduplicating filesystem backup tool in Python that compresses input data very hard and supports removing old data to make room for new data.

Backshift is thoroughly, automatically tested; is thoroughly documentedruns on a variety of Python interpreters; runs on a wide assortment of operating systems, and passes pylint.

Files to backup are selected by piping "find /dir -print0" into backshift, and files are restored by piping backshift to tar xvfp.  Usage examples can be found here.

Here's a table comparing backshift and related tools.

Here's a list of backshift's features and misfeatures.

My hope is it will be useful to some of you.

Results of the Python 2.x vs 3.x survey


The results are at https://wiki.python.org/moin/2.x-vs-3.x-survey

Thanks.

Tuesday, December 31, 2013

I put up a page listing some of the datastructures I've worked on over the years.  Most are in Python.  It can be found here.

Listed are a few dictionary-like datastructures (including one that wraps others to allow duplicates), a linked list, a hash table (in C), and a bloom filter.

Monday, December 30, 2013

Python 2.x vs 3.x usage survey


I've put together a 9-multiple-choice-question survey about Python 2.x vs 3.x at https://www.surveymonkey.com/s/N5N5PG2 .

If you have a free minute, I hope you'll take it.

Wednesday, July 10, 2013


I put together a comparison of several Python-in-the-browser technologies, for my local Python user group. They almost all "transpile" (compile to a different language) from Python to JavaScript. Some transpile once for each pageload, while most transpile once, like a separate compilation step.

I hope it helps others as well.

It can be found here.

Saturday, June 23, 2012

Python and tree datastructures: A performance comparison


At http://stromberg.dnsalias.org/~strombrg/python-tree-and-heap-comparison/ I've put together a comparison of a few in-memory tree datastructures for Python.  I hope to add heaps to the comparison someday, hence the URL.

The trees have each been made to run on cpython 2.x, cpython 3.x, pypy and jython, and they are all passing pylint now.

I tossed out two of the datastructures I wanted to compare, for the following reasons:

  1. 2-3 trees were tossed because they were giving incorrect results.  This  may or may not be my fault.  I had to add a simple class to make them work like a dictionary.
  2. Red-Black trees were tossed because they were very slow with large trees, so much so that they were making the other datastructures in which red-black trees appeared hard to read - that is, the things with better performance were getting all squished together and kinda merging into one fat line.  The problem seemed to be related to garbage collection - perhaps the red-black tree implementation was creating and discarding lots of temporary variables or something.  This too may or may not be my fault.
The top three datastructures I continued to examine were:
  1. Splay trees - sometimes first in performance.
  2. AVL trees - sometimes first in performance.
  3. Treaps - sometimes first in performance, never worse than second in performance.