pygr + Google Summer of Code project page, GSoC 2008


pygr is a Python-based graph database for bioinformatics. It is a general toolkit for storing and retrieving biological sequence relationships and annotations. Unlike other Python+biology toolkits (BioPython, corebio) pygr provides a high level abstraction layer for working with these objects that makes it particularly suitable for genome-scale analysis (in our opinion :).

In general, Python seems to be underrepresented in bioinformatics compared to other research fields; this is partly because Perl and R are dominant, but it is still startling given the depth of Python projects available for e.g. numerical work. pygr itself is being primarily used by a small subset of labs at UCLA, Caltech, Michigan State, and in Korea.

The aim of these projects for GSoC '08 is to help bring pygr to a wider community by increasing the base quality level of the codebase, improving new user and expert user documentation, providing more and better pre-fab data sets, and adding more generally useful features.

Possible Mentors

  • Chris Lee, UCLA
  • Marek Szuba, UCLA
  • Titus Brown, Caltech/MSU
  • Jenny Qian, BCGSC

Subproject ideas

Also see

pygr codebase maintenance

A major consideration for me (Titus) is that I don't have a sense of what computation & retrieval demands are placed on what classes. So, when building an NLMSA on a small genome, I tend not to worry about how the features are stored because it's fast; but then when I scale up to a big genome, there's a slowdown due to e.g. unpickling. Examples and information on these issues needs to be available somewhere, preferably in the code and in the docs.

Interface expectations between classes seem undocumented; build some simple in-memory implementations, or stub implementations, that "pass" the interface requirements. And/or specify interfaces in some other way.

pygr contains large swathes of completely uncommented code, both python and pyrex/C. This could should be gone through and automated tests built.

Similarly some of the code is pretty evolved and could use some systematic refactoring, post-test-addition.

Benchmark and profile a few of the large-data-set tests?

pygr.Data additions

"download=True" option to download data sets to local machine if not present (item #12 on mailing list posting, above)

Better dependency detection/hooks for running custom "unpickling" function for installation/detection of dependencies

pygr.Data server Web interface to list available resources

object deletion schema (?) -

This means, if obj has edge relations to other objects
managed under the pygr.Data schema, what happens if that object is
deleted.  These become data integrity rules in the schema: e.g. just delete its
edges; or delete objects that it has a certain relation to (e.g. its "child

Fix pickle (in)security. Signed pickles, e.g. with GPG? TrustedPickle? prototype? See item #13 on mailing lilst posting, above.

"dry run" query: ability to see how pygr.Data would fulfill a specific request, without actually getting it

Documentation, user help, and community building

The Developer Forum has *tons& of examples. Many of these could be written up as text-file doctests or otherwise made "executable", and then run as part of the continuous integration, standard tests, etc.

Installation etc.

Make easy_install-able, build binary eggs for Windows, Mac OS X

Go through and test Windows installer

Add packages/make packages for debian, fink, redhat


From start to finish:

  • Microbial genome annotations
  • importing Wormbase annotations
  • importing UCSC, Ensembl (much work already done)
  • Using SQL db as a backend for the above

Data sets

Post, maintain UCSC alignment data sets

Post leelab databases

Ensembl annotation databases

Expansion of feature set

Build tools & libraries to wrap/import a variety of alignment types and larger data sets, e.g. CLUSTALW, blastz, LAGAN, ENSEMBL, etc. (#10 & 11 on mailing list post, above)

Serve BLAST/MegaBLAST via XML-RPC (#2 on mailing list posting, above) Serve AnnotationDB via XML-RPC (#1 on mailing list posting, above)

fast NLMSA joins (#4, #6)

Fast result filtering (#5)

Store BLAST edge info (#7)

Nucleotide-to-AA annotation and separate coordinates (#8)

TBLASTN and BLASTX support (#9)