Automating The Tests (for buildbot)

An important trick -- perhaps the most important trick -- in our approach to agile testing, and continuous integration in particular, is automating the tests. Test automation lets you leverage everyone's test code with no extra effort. Unfortunately it also turns out to be a pretty tricky thing to do, technically! Here is a simple narrative (with links to the source!) showing what we did and how we did it.

(Yes, all of our tests are automated and run frequently via buildbot. If you don't believe us, visit our buildbot pages.)

General Automation Rules

One huge benefit of automating your tests and running them via buildbot is that you quickly figure out what parts of your app and which tests are environment, permission, or path-dependent. This becomes even more apparent when you try to automate tests that someone else has written using a test framework you've never used. (Fun, fun, fun!)

The three short rules we have to offer are:

  1. Don't depend on PATH. Each developer account may set it differently, and (more importantly) each machine your tests are running on may have it set differently. Especially Windows machines.
  2. Don't depend on sys.path or PYTHONPATH being set correctly.
  3. If you find yourself using a '/' at the front of a temporary filename or a path to a data file, stop yourself. Figure out how to write every filename that's part of your package as a filename relative to your root package directory.

The most commonly used path trick in our arsenal was the use of __file__ to figure out what your local directory path is; e.g. see bin/_mypath. Any command-line script in bin/ could 'import _mypath' and the lib/ directory would automatically be placed into sys.path; nifty!

Automating unit tests

OK, this is the easy one. Unit tests are supposed to be automated to begin with, right? Yes, but sometimes they're awfully slow, and then you should pick a subset that can be run quickly by developers while they're working. The slow tests can be run just prior to check-in, or (if you're appropriately lazy) in buildbot.

For MOS, we had a bunch of different unit tests all running under nose. The proper way to run them was either with the command line 'nosetests' or (better) 'python setup.py test'. Unfortunately as soon as we added the more heavyweight twill functional tests into the test suite, the tests started to take 1-2 minutes to run. That's waaaay too slow.

So what did we do?

First off, we hacked the unittest framework to display a simple duration (in seconds) for each test; see run-timed-unit-tests for the hack. This immediately told us which tests needed to be disabled for fast running.

Now, all of our tests run through mostest.py. At the top of mostest.py, we stuck a boolean flag "DEBUG_AVOID_SLOW"; any test that was slower than a second or two imported mostest.py and just exited itself if the flag was set to True.

Then, in buildbot, we ran run-timed-unit-tests with the DEBUG_AVOID_SLOW flag set to False -- even if it took an extra 5 minutes, it didn't matter there.

mostest.py is worth looking at for some other reasons, too: it has hooks to enable code coverage analysis and it also has some simple code to display all of the relevant results.

Profiling

We didn't get much of a chance to work on profiling, because we were too busy setting up all of the tests. However, I'm very happy with what I've seen of statprof, a statistical profiler written by Andy Wingo. statprof uses the itimer signal to interrupt the main process and periodically sample the current stack; from this it assembles a statistical picture of where the program is spending its time.

It turns out that profiling all the unit tests is a bit silly, because they do a lot of stuff that's not "normal". However, the twill function tests (discussed below) did a decent job of loading in a bunch of e-mail and searching it. Look at profile-twill-tests to see how we used it.

Like run-timed-unit-tests, profile-twill-tests is primarily run via buildbot where we can go check out the results whenever we feel like it.

One caveat: profile-twill-tests periodically dies unexpectedly, and it seems like there may be some component of urllib (which is used deep inside the twill tests) that doesn't properly handle itimer interrupts. O well.

Note that in Python 2.5, cProfiler has been added. Woo-hoo!

An interesting medium-term idea is to do Web profiling with twill and wsgi_intercept: set up a bunch of functional or acceptance tests with twill, run them in-process with wsgi_intercept, and profile them. Then, minus the (probably nearly 0) overhead of twill etc., you can get an accurate idea of where in your code the tests are taking a long time. More on this anon.

Running twill sensibly

twill is a simple domain-specific scripting language layered on top of mechanize. Basically, it's a simple functional testing system that lets you write scripts to browse Web sites without user intervention. (You can check out our twill tests here.)

OK, you say, it's built for automation. How hard can it be to run it in a buildbot setup!?

Well, it's actually somewhat tricky to test Web sites in a completely automated script. We mostly used the wsgi_intercept trick to mock the network interface, thus avoiding actually setting up a server or binding a port; see test-via-wsgi.py for the code. This worked pretty well for most twill tests, although of course we also had a "hello, world" style test where we fired up HTTP in the normal way (see test-http-server.py for this code). Remember, somewhere you should check to make sure an HTTP port is actually bound!

Automating Selenium

Selenium is a fantastically cool JavaScript test framework that lets you drive your browser through your Web site with relatively simple scripts. The tests are written entirely in HTML and can be "shipped" with the project; see our Selenium tests page for links to the various tests (e.g. MOSCommentary.html).

One big advantage that Selenium has over twill is that you can test a full JavaScript-y or AJAX-y Web interface with it, because it fully groks JavaScript; another great advantage to Selenium tests is that they run in multiple browsers, so you can test IE, Mozilla, Konqueror, and Safari compliance with one set of tests. Of course, a big disadvantage is that it requires a full browser to run, and those can be nasty to automate.

We did manage to automate the Selenium tests on our Linux buildslaves, and even though it involved big nasty packages (VNC to set up the X display, and an automatically-started Firefox browser) it was relatively easy.

You can see our hacked solution in run-selenium.py. Basically, we:

  1. start an X server in VNC and grab the display name;
  2. fork a child process and run firefox with a Magic URL that tells Firefox to start running the tests;
  3. run the MailOnnaStick Web server (with coverage analysis);
  4. at the end of the tests, Selenium tells Firefox to hit a particular URL with the results, and that URL then tells the MailOnnaStick Web server to exit with an appropriate error code;
  5. kill the X server.

Most of the trickiness is in #2 (running Firefox with the Magic URL) and #4 (doing clever things with the results URL).

The Magic URL turns out to be a known feature of Selenium: by passing Firefox specific parameters on the initial URL, Selenium will run and post its results to a particular location. That location can be pretty much anywhere; in this case, we built a page in MailOnnaStick that took the results and did things with them.

This special results URL is implemented in test_pages.py, function post_selenium_results. Ignoring the billboard stuff -- more about that later -- this function takes in a bunch of form parameters, writes the passed-in result pages into the var/tmp directory, examines the report for success/failure, and exits with an appropriate error code. Simple, eh?

OK, it took some time to get it to work, but it sure was fun when it finally worked!

A side benefit of writing out the results pages is that when things fail, you can figure out exactly what test failed. This is because Selenium writes out the results pages with the proper annotations, so it's easy to figure out which tests passed and which tests failed.

(Incidentally, one cute future hack might be to use the VNC-to-Flash recorder (vnc2swf) to record each Selenium session. hmm.)

While invaluable, Selenium did turn out to be more annoying than we thought it would. First of all, writing the tests was not particularly fun; they're in a kind of hacky HTML format which grows old quite quickly. Secondly, the tests were (and are) a bit brittle in the face of slow computers. (Asynchronus AJAX stuff sounds great right up until your tests fail because the call didn't return in time. We've fixed that since in changeset:313 by switching to using waitForCondition.)

Our experience with Selenium suggests that twill is a better way to test your basic Web app; leave Selenium for the stuff that can't be done any other way, e.g. your AJAX user interface code.

One other mildly nifty thing we did with twill and Selenium: we used twill to do the complicated mailbox setup for the Selenium tests. Because filling out forms in twill is pretty simple, this turned out to be more efficient than filling out the forms in Selenium. Go check out our Selenium test setup page to see where we used mosweb.twillscripts to call the load-test-mboxes twill script for our environment setup.

TextTest

TextTest is an acceptance-testing or regression-testing framework -- we're not entirely sure how to classify it -- that lets you record logging output and compare it to a "gold standard".

Installing texttest turned out to be a bit tricky, for several reasons. Unlike most Python software, texttest comes as a heap o' Python files, rather than as a Python package; there's also no setup.py, so it's not as easy as 'python setup.py install'. In the end, you need to:

  1. Pick a place to install the source;
  2. Pick a place to install the tests (this should be writeable by the person who wants to test the texttest install);
  3. Run install.py;
  4. Put the source install directory into your PATH, and set $TEXTTEST_HOME to the location of the tests.
  5. Run 'texttest' to test itself. (In a non-intuitive move, there should be differences -- read the manual to see why. You're supposed to run the tests to generate a gold standard for your machine, and then (I guess) run the tests again to compare against the gold standard you just created. However, as I found out by failing to have /usr/bin/time installed, empty always equals empty, so this isn't always the most useful of tests...)

Note that $TEXTTEST_HOME is going to change when you start writing your own tests -- that's what points texttest itself at the tests!

So, how does texttest work? Well, you set up a bunch of shell commands for it to run, tell it what to record and compare, and then run it to generate your gold standard. After that (until it's time to generate a new gold standard), you run texttest periodically (before each check-in, or via buildbot) and it compares the gold standard to the current output. (You can think of it as a set of fixtures for running diff commands systematically, if you like.)

You can see our test fixtures in a few places: first, the test- files in bin/ are what are actually run by our texttest tests, which in turn are in tests/texttest. (You can run these tests yourself by setting $TEXTTEST_HOME to that location, if you download MOS.) The magic that ties the texttest tests into the MOS code itself is in config.mos -- yep, that line there at the top, 'binary=...'.

Running texttest via buildbot was easy, once we sorted out what all the paths were. See run-texttest for the script that sets everything appropriately and runs it.

Our end comment on TextTest was that it's not well suited to the way we develop, or at least not yet. Apart from the various installation and execution difficulties, we had to regenerate the gold standard log files on a regular basis -- overall, it was unclear how this was supposed to catch unexpected errors, because we never really looked seriously at the diffs; perhaps we just have attention deficit disorder. Nonetheless, it may be a useful tool in some situations; Magnus Lycka has a good post here. In particular, as we build more code, having good regression tests will almost certainly become more useful!

FitNesse

FitNesse is a "customer-facing" acceptance test framework that lets customers interact with test fixtures via a Wiki. Essentially, users write in the expected results for tests (via the Wiki) and then run the tests automagically via the Web interface.

Our test fixtures are under tests/MOS_FIT, and the Wiki pages are under tests/MOS_FIT/MosAcceptanceTests. (To look at and/or edit the tests, you'll need to install FitNesse and link in the MosAcceptanceTests directory under the Wiki directory.)

Now, FitNesse is written in Java, and it has to talk to Python somehow. It does this via PyFit; more about that here. So, you need to install PyFit. Now, once you've installed PyFit, you can run the Python tests in two ways: manually, or scripted.

When you run the tests manually, you tell FitNesse to run a specific Python server program that "feeds" the FitNesse client requests being sent from the Wiki. This requires either setting up the paths etc. in the Wiki itself (with special config commands) or making sure that your PYTHONPATH is set properly prior to running the FitNesse Wiki code.

When you run the tests in scripted mode, you can use the PyFit TestRunner to pull the tests from the FitNesse client; this means you don't actually need to manually interact with any part of the FitNesse Web site, which is a boon for buildbot automation. TestRunner can also be executed from within Python, which means that you can do all your path setup etc. without resorting to system() calls.

Our buildbot script for automating all of this is run-fitnesse.py, and it does the following:

  1. links the tests into the FitNesse install;
  2. fixes the various paths;
  3. picks a random port to run the FitNesse wiki on;
  4. forks & runs FitNesse in the child pid;
  5. executes the PyFit TestRunner code.

TestRunner then contacts FitNesse, pulls down the tests, runs them via the MOS_FIT test fixtures, and records the results. Not so simple, but it all works ;).

egg creation and installation

Python eggs are a nifty new way to package Python programs; they're basically created completely out of the information in setup.py. One of our goals for MailOnnaStick is to (eventually) distribute it as a completely standalone egg, so we wanted to test egg production & installation.

Now, there's no really good way to test your eggs without actually installing the application somewhere. So, we hacked together some scripts to build a Python egg and install it in a private subdirectory. Check out build_egg and install_egg for the dirt.

There are one or two hacks here worth mentioning.

First off, since we wanted to test this with both Python 2.3 and Python 2.4, we had to figure out how to run easy_install (to install the egg) without actually using the version-specific easy_install script installed by setuptools. Our solution? Write our own Python-version-agnostic easy_install.py and drop it into the buildbot scripts directory. (Yes, that's why it's in there.)

Second, setuptools has its own rules regarding package installation. We found it necessary to put the egg install directory into sys.path/PYTHONPATH prior to installing the egg; otherwise, easy_install would be upset by its inability to find the package it had just installed. Kinda schizophrenic, huh?

The billboard stuff

Peppered throughout the code you may see references to billboard -- in particular, in the Selenium test setup stuff and in the twill tests. This is basically a quick hack to set up something like DART's reporting infrastructure, but without all of the Java complications. (See "Record Keeping", below, for some justification for this.) The overall goal was to set up something that would retain e.g. raw coverage data in such a way that we could write custom reporting scripts, but we haven't written any such scripts yet ;(.

If we move forward with billboard, we'll talk more about it here. We may just switch to doing clever things with Trac instead, because the Trac wiki is a good place to post test summaries & twill can easily handle the mechanics of posting. We'll see.

General annoyances

When the tests worked, it was great. When tests failed, quite often we could figure out exactly what was broken by referring to the name of the test -- "oh, yeah, I just checked in the configuration stuff, so it makes sense that test_config might break". However, there were three problems that cropped up regularly.

PYTHONPATH and sys.path modifications

This is an obvious one: Python packages were installed in various places, and they needed access to various modules, and we had to provide it to them. PyFit, texttest, and twill all had various picky path requirements. There's no general way to deal with this, of course, but since there was no structured way to deal with path modifications shrug it was probably more painful than it had to be.

stdout and stderr vs logging

Retrieving stdout and stderr was annoying. Most test frameworks assume you don't want to see lots of output, so they capture stdout/stderr and dump it somewhere. Sometimes your test code traps output too; for example, we really didn't want to see CherryPy server output in the middle of our tests, so we trapped that. All this trapping works out really well -- you get nice, pretty, minimalistic output -- right up until the first complex test failure. Then we spent a lot of time scrambling around the code looking for the output to figure out what went wrong where.

The solution turned out to be to log it. Log it all, and then sort it out later.

With the logging functions, you always knew where the output was going -- into a log. No capture framework could take that away from you, and it was always there -- invisible until needed. Logs turn out to be much better for debugging than stdout/stderr stuff, which was often trapped in odd or inconsistent ways.

Record keeping

The third problem was related to the logging issue: record keeping. When tests failed, you wanted to be able to figure out where; but sometimes tests failed inconsistently, and the logs etc. from the last buildbot run weren't kept around. (This was especially a problem with the profiling tests and Selenium, which tended to fail every so often.) Selenium, TextTest, and FitNesse would spew out tons of reports, often in a convenient HTML form -- and we wanted that information.

(Ironically this was a "problem" only because we had a continuous integration framework running. We wouldn't have even seen the intermittent errors as a real problem if we'd just been running the tests by hand.)

Our solution was to keep a semi-permanent record of everything by outputting it into a tmp directory and then time-stamping and storing the tmp directory in a Web-accessible location at the end of each buildbot run. This also rendered the results amenable to a scripted analysis, which could be convenient in the future.

Comments and Feedback

We're happy to receive your feedback on these tips & tricks; just drop either of us a line. You can also post a comment on Grig's blog entry.