Blog

Thomas Wanschik on February 01, 2011

JOINs for NoSQL databases via django-dbindexer - First steps

At the moment, we have a lot to do in terms of finishing our diploma thesis. However we're so excited about the early results of the currently refactored django-dbindexer that we couldn't hold back and keep django-dbindexer's simple JOIN support for non-relational databases a secret anymore. Yes, you didn't read wrongly, this post is about JOIN support for NoSQL databases! We'll show how to use in-memory JOINs and how to get JOINs working if the joined field's value doesn't change. So let's unpack our delayed Christmas present. ;)

Let's rock!

Let's take the example of a photo-user relationship, just like in our post JOINs via denormalization for NoSQL coders, Part 1:

# photo/models.py:

from django.contrib.auth.models import User
from django.db import models

class Photo(models.Model):
    owner = models.ForeignKey(User, null=True, blank=True)
    popularity = models.IntegerField(choices=[(0, 'low'), (1, 'medium'), (2, 'high')],
                                     default=0)
    published = models.DateField(auto_now_add=True)
    title = models.CharField(max_length=500)

Here, each photo entity is linked to a user via a many-to-one relationship and has a popularity tag assigned to it. Additionally we store the date when the photo has been published and give the photo a title. Now let's say you want to get all of a user's photos. In order to do so, NoSQL databases force us to use ugly workarounds. One possible way is to first get the user (or better the corresponding primary key in order to avoid fetching the whole user entity out of the database) and thereafter the user's photos:

def get_user_photos(first_name, last_name):
    user_pk = User.objects.values('pk').get(first_name=first_name, last_name=last_name)
    return Photo.objects.filter(owner__pk=user_pk)

However with our reborn baby, django-dbindexer, you can just use Django's spanning relationship syntax (double underscores) to which you are used to:

def get_user_photos(first_name, last_name):
    return Photo.objects.filter(owner__first_name=first_name, owner__last_name=last_name)

Feels a lot more comfortable, right? :) To get this working all you have to do is to install the django-dbindexer and add the following index definition:

# photo/dbindexes.py:

from models import Photo
from dbindexer.lookups import StandardLookup
from dbindexer.api import register_index

register_index(Photo, {'owner__first_name': StandardLookup(),
                       'owner__last_name': StandardLookup(),
})

and one of the new join resolvers (InMemoryJOINResolver or ConstantFieldJOINResolver) to DBINDEXER_BACKENDS in your settings:

# settings.py:

DBINDEXER_BACKENDS = (
    'dbindexer.backends.BaseResolver',
    'dbindexer.backends.FKNullFix',
    'dbindexer.backends.InMemoryJOINResolver',
)

That's it. django-dbindexer will handle anything else magically for you. :) Under the hook the refactored django-dbindexer now uses backends to resolve lookups. The BaseResolver is responsible for resolving lookups like __iexact or __regex for example and the FKNullFix backend will make __isnull queries work correctly on ForeignKey. In this example we choose the InMemoryJOINResolver which is used to resolve JOINs in-memory, just like we did manually in the example above. On the other hand, the ConstantFieldJOINResolver would denormalize the user's first_name and last_name into the Photo model and use these denormalized properties to execute the query. Anything described in JOINs via denormalization for NoSQL coders, Part 1 is then done automatically by the ConstantFieldJOINResolver for you. :)

JOINs aren't limited to simple examples like the one above. Of course you can combine a query with filters on the Photo model:

def get_popular_user_photos(first_name, last_name):
    return Photo.objects.filter(popularity=2, owner__first_name=first_name,
                                owner__last_name=last_name)

You can even span filters over multiple relationships if you have more complex models, just use multiple double underscores. If the photo model contains a ForeignKey to a group which itself contains a ForeignKey to a creator, a query getting all photos whose group creator is a specific user would look like this:

def get_creators_groups_photos(first_name, last_name):
    return Photo.objects.filter(group__creator__first_name=first_name,
                                group__creator__last_name=last_name)

Note that this query can return photos belonging to more than just one group because a user could have created multiple groups.

And as a nice side effect of the refactoring process, you can combine JOINs with other index definitions like an __iexact lookup on the user's first_name and last_name in combination with a __month lookup on published:

def get_user_photos_created_in_month(first_name, last_name, month):
    return Photo.objects.filter(popularity=2, published__month=month,
                                owner__first_name__iexact=first_name,
                                owner__last_name__iexact=last_name)

Just add the needed index definitions to your index module:

# photo/dbindexes.py:

from models import Photo
from dbindexer.lookups import StandardLookup
from dbindexer.api import register_index

register_index(Photo, {'owner__first_name': (StandardLookup(), 'iexact'),
                       'owner__last_name': (StandardLookup(), 'iexact'),
                       'published': 'month',
})

It couldn't be easier than that, right? :) First we hardly believed our own eyes when we tested JOINs on App Engine on production but it really works. :P Just give it a try and download the django-dbindexer-testapp. :) If you wanna see even more complex examples then take a look at the unit tests of django-dbindexer.

InMemoryJOINResolver or ConstantFieldJOINResolver?

At the moment you have to choose globally between either the InMemoryJOINResolver or the ConstantFieldJOINResolver in you settings. All queries involving JOINs will then use the join resolver chosen. This will be changed in a future release of django-dbindexer. For now, you may wonder which join resolver to use. In general, in-memory JOINs are useful for simple cases for which you know that you get only a small set of entities on the joined side i.e. the User side. This includes examples like the one above as well as one-to-one relationships. If you know that you have to deal with larger sets of entities on the joined side you have to use the ConstantFieldJOINResolver because in-memory JOINs don't scale in such cases.

Currently you can't combine in-memory JOINs with OR-queries and exclude-queries. Anything else should work. In-memory JOINs will try to query as efficiently as possible. On App Engine for example dbindexer will batch-get and use efficient values('pk') filters (which become a keys_only filter) where possible. Additionally, fields on the joined side won't be created for StandardLookup.

The ConstantFieldJOINResolver is more efficient than the InMemoryJOINResolver and doesn't have limitations neither for OR-queries nor for exclude-queries i.e. you can execute the following query for example:

return Photo.objects.exclude(Q(owner__first_name__iexact='Danzo') |
                             Q(owner__last_name__iexact='Shimura'))

Additionally it is useful for more than one-to-one-relations. However, at the moment it only works for constant fields i.e. if the user's first_name and the last_name doesn't change. In the future we will add support for non-constant values too.

For both resolvers, JOINs only work in the ForeignKey direction. No reverse lookups supported at the moment. :(

Expressive NoSQL - Beyond scalability

So the first steps towards JOIN support for NoSQL databases are done. We plan to extend the backends to handle JOINs for non-constant fields in a scaleable way, just like described in our JOINs via denormalization for NoSQL coders series. Additionally, django-dbindexer's API will be extended to allow to specify a join resolver on a per field/filter bases. With the refactored django-dbindexer and its backend system even support for aggregates becomes possible. This will allow you to write complex queries in a few minutes instead of hours (including unit tests and debugging ;). No more hand-written denormalization and map/reduce and aggregate code. Just tell the indexer what you want to do and it'll handle it for you in a way you specify (via a backend)! Developers still have to think about how to design their code. However, django-dbindexer allows them to use Django's expressive ORM and thus being much more productive. Another possible extension is a backend for nonrel-search which could be used in order to add full-text search functionality for the admin interface. :) It's hard to think of something that would not be possible! :P If you want to help with the next exciting phase of development you can drop us a mail: http://groups.google.com/group/django-non-relational