Sunday, June 5, 2016

from-table

I've put from-table here.

It's a small python3 script that knows how to extract one or more HTML tables as CSV data.  You can give it a URL or a file.  It can extract to stdout or to a series of numbered filenames (one file per table).

I hope folks find it useful.

Sunday, October 18, 2015

Backshift not That slow, and for good reason





  • Backshift is a deduplicating backup program in Python.
  • At http://burp.grke.org/burp2/08results1.html you can find a performance comparison between some backup applications.


  • The comparison did not compare backshift, because backshift was believed to have prohibitively slow deduplication.
  • Backshift is truly not a speed-demon. It is designed to:
    1. minimize storage requirements
    2. minimize bandwidth requirements
    3. emphasize parallel (concurrent backups of different computers) performance to some extent
    4. allow expiration of old data that is no longer needed
  • Also, it was almost certainly not backshift's deduplication that was slow, it was:
    1. backshift's variable-length, content-based blocking algorithm. This makes python inspect every byte of the backup, one byte at a time.
    2. backshift's use of xz compression. xz packs files very hard, reducing storage and bandwidth requirements, but it is known to be slower than something like gzip that doesn't compress as well.
  • Also, while the initial fullsave is slow, subsequent backups are much faster because they do not reblock or recompress any files that still have the same mtime and size as found in 1 of (up to) 3 previous backups.
  • Also, if you run backshift on Pypy, its variable-length, content-based blocking algorithm is many times faster than if you run it on CPython. Pypy is not only faster than CPython, it's also much faster than CPython augmented with Cython.
  • I sent G. P. E. Keeling an e-mail about this some time ago (the date of this writing is October 2015), but never received a response
  • Wednesday, August 12, 2015

    Latest python sorted dictionary comparison


    I recently completed another sorted dictionary comparison, and thought I'd share the results.

    This time I've eliminated the different mixes of get/set.  It's all 95% set and 5% get now.

    Also, I added sorteddict, which proved to be an excellent performer.

    And I added standard deviation to the graph and collapsible detail.

    The latest comparison can be found here.

    HTH someone.

    fdupes and equivs3e


    I recently saw an announcement of fdupes on linuxtoday.

    Upon investigating it a bit, I noticed that it uses almost exactly the same algorithm as my equivs3e program.

    Both are intended to find duplicate files in a filesystem, quickly.

    The main difference seems to be that fdupes is in C, and equivs3e is in Python.  Also, fdupes accepts a directory in argv (like tar), while equivs3e expects to have "find /directory -type f -print0" piped into it (like cpio).

    However, upon doing a quick performance comparison, it turns out that fdupes is quite a bit faster on large collections of small files, and equivs3e is quite a bit faster on collections of large files.  I really don't know why the python code is sometimes outperforming the C code, given that they're so similar internally.

    I've added a "related work" section on my equivs3e page that compares equivs3e and fdupes.

    Anyway, I hope people find one or both of these programs useful.

    Tuesday, May 19, 2015


    Python compared to C++ - machine performance

    Someone posted to LinkedIn's Python Professionals forum with the subject "Why Python is so slow?" and said Python is 10x slower than C++ (on one specific microbenchmark).  You can guess what the rest of it was like.

    So:
    • even though machine efficiency rarely matters anymore
    • even though algorithm improvements are usually better than worrying about using the fastest language available (assembler anyone?  I used to love it, but not anymore)
    • even though microbenchmarks are poor indicators of overall performance
    • even though the innermost loop of a program is usually the only part that makes a difference (if any part at all) for performance
    ...I decided to go ahead and compare raw performance for almost the same problem, across 2 C++ compilers and a smattering of Pythons.

    First off, we probably mostly know by now that Python != CPython anymore.

    The code I used for this comparison can be viewed using a web browser or checked out using Subversion at http://stromberg.dnsalias.org/svn/why-is-python-slow/trunk/ .

    BTW, the OP used range in Python 2.x; this should of course be xrange.  range is fine in 3.x, but in 2.x it's awful.

    Anyway, here are the results.  Lower numbers are better:
    1.000000 ./g++-t
    0.974000 ./clang++-t
    4.993000 ./cython2-t
    5.981000 ./cython3-t
    4.341000 ./cython2_types_t
    4.917000 ./cython3_types_t
    4.850000 ./cpython2-t
    5.563000 ./cpython3-t
    5.567000 ./cpython2+numba-t
    5.491000 ./cpython3+numba-t
    1.957000 ./pypy-t
    1.979000 ./pypy3-t
    1.152000 ./shedskin-t


    Some interesting things to note (all relative to g++ -O3) :
    • clang++ was slightly faster than g++ sometimes.  The two C++'s were practically the same.
    • Naive Cython was slower than CPython.
    • Typed Cython was faster than CPython.
    • CPython was around 4.9 and 5.6x slower on this particular microbenchmark, not 10x (but perhaps using the wrong range would've made it 10x, I didn't check that).
    • Numba was faster once, and slower once than CPython - but not by a lot in either case.
    • pypy was a touch less than 2x slower, but that's for pure python code!
    • shedskin was only a hair slower than the pure C++ compilers.
    I hope that helps someone put things in machine-efficiency-context.

    We perhaps should also measure how long it takes to debug undefined memory references in the two languages - that's developer efficiency, and it's usually more important than machine efficiency.  ^(^

    Saturday, September 6, 2014

    Fibonacci Heap implementation in Pure Python


    I've put a Pure Python Fibonacci Heap (priority queue) implementation at:

    1. https://pypi.python.org/pypi/fibonacci-heap-mod
    2. http://stromberg.dnsalias.org/~strombrg/fibonacci-heap-mod/
    It passes pylint and pep8, is thoroughly unit tested, and runs on CPython 2.[67], CPython 3.[01234], Pypy 2.3.1, Pypy3 2.3.1 and Jython 2.7b3.

    It's similar to the standard library's heapq module.  The main difference is Fibonacci heaps have better big-O for some operations, and also supports decrease-key without having to remove and re-add.

    This Fibonacci heap implementation is also better abstracted than heapq.

    Friday, January 17, 2014

    Python dictionary-like trees with O(log2n) find_min and find_max, and O(n) ordered traversal


    I've made some changes to the tree comparison I did in 2012.

    I've added a few more dictionary-like tree datastructures.

    I also changed the methodology: Instead of running 4 million operations for all datastructures, I've told it to run each test to 8 million ops, or 2 minutes, whichever comes first.  This allowed comparison of datastructures that perform well at one workload and poorly at another.

    I also told it to use both random and sequential workloads this time - some datastructures are good at one and poor at the other.

    The chief reasons to use a tree instead of a standard dictionary, are:

    • You get find_min and find_max methods, which run in O(log2n) time.  Standard dictionaries do this in O(n) time, which is much slower for large n.
    • You get an ordered iteration in O(n) time.  Standard dictionaries do this in O(nlogn) time, which is also much slower for large n.

    The latest version of the comparison is here, as a series of graphs and as an HTML click-to-expand hierarchy.  It's also linked from the URL at the top of this entry.