Blog

Waldemar Kornewald on February 22, 2010

4 things to know for NoSQL Django coders

Update 2: Take a look at the django-dbindexer. Since its release it's possible to use __month= queries so you don't have to use tricky code which uses a date range query for example. Additionally django-dbindexer adds support for simple JOINs.

Update: MongoDB backend is now available too :)

This is the first post in a series that should give you an impression of what non-relational / NoSQL model code looks like with django-nonrel. As mentioned in the previous post, you can see django-nonrel in action on our new website (we use it ourselves in the spirit of dogfooding).

While everything discussed here should work on all nonrel DBs we currently only have an App Engine backend and soon a MongoDB backend (more on that once it's finished). If you want to help with other backends (Redis, SimpleDB, CouchDB, etc.) please join the discussion group.

We'll dive into the source of our website which contains a very simple "CMS" and a blog app which can host multiple independent blogs. It runs the admin interface unmodified, but with some limitations. Overall, the code is surprisingly similar to normal Django code, but you'll also find that nonrel-compatible models need their own way of thought. What does it take to write a website that is nonrel-compatible? Let's get rolling.

Setting up the environment

You need Python 2.5 or 2.6. Just download and unpack allbuttonspressed.zip and adjust settings.py. This package contains Django nonrel and all other dependencies except for the App Engine SDK. If you want to use the latest repository code you can alternatively take a look at the manual installation instructions. That page also describes how to install the latest App Engine SDK on different operating systems.

You can now simply use manage.py syncdb, manage.py createsuperuser, and manage.py runserver as usual. Finally, you can deploy the site to App Engine via manage.py deploy. This will automatically run syncdb on the production server. If you need to execute a command on the production server prefix it with "remote". For example: "manage.py remote shell". This only works once you've deployed your website, though.

Now, let's dive into the actual code and experience the differences.

JOINs are evil

As our first app we wrote a minimal CMS. Now you might ask, why can't we just use django.contrib.flatpages?

Obvious JOINs ...

Unfortunately, flatpages has a many-to-many relation to Site which means we need JOINs (any query that spans model relations via "__" needs JOINs). Nonrel DBs normally only allow you to specify simple ANDed filter rules on a single model ("table"). We could do an in-memory JOIN by running multiple queries and manually combining the results, but this would make our code unnecessarily complex. I'll talk about alternative solutions later. Since we only have one domain we don't need Site, so let's make our own flatpages alternative.

... and not so obvious JOINs

We just copied the flatpages middleware and started with a fresh model (note, the code is simplified to make the important parts easier to spot):

# minicms/models.py:

# Define an abstract base class which can be reused for our blog posts
class BaseContent(models.Model):
    title = models.CharField(max_length=200)
    content = models.TextField(blank=True)

    class Meta:
        abstract = True

class Page(BaseContent):
    url = models.CharField('URL', max_length=200)

The only thing to note is that we define an abstract base class (class Meta: abstract = True) for the Page model and later also the blog Post model. This base class will take care of generating the wiki markup, so the code can be shared by Page and Post later. The important point is that the base class must be abstract because otherwise Django will use multi-table inheritance (i.e., the base model has its own table and the child gets a separate table containing only the additional fields). Since this would require JOINs we have to keep our models "flat" by only using abstract base models.

No surprises in simple code

Well, I just have to point this out. Let's take a quick look at the show view (which gets used by the modified flatpages middleware):

# minicms/views.py:

def show(request, url):
    if not url.startswith('/'):
        url = '/' + url
    page = get_object_or_404(Page, url=url)
    return direct_to_template(request, 'minicms/page_detail.html',
        {'page': page})

As you can see, there's no magic, here. Simple queries look familiar. You probably guessed this already. Doh! :)

Date queries can bite you

OK, we also want to have multiple blogs on our website, so we need both a Post and a Blog model:

# blog/models.py:

class Blog(models.Model):
    base_url = models.CharField('Base URL', max_length=200,
        help_text='Example: With base URL "personal" your blog posts would '
                  'be below /blog/personal/...')
    title = models.CharField(max_length=200)
    description = models.CharField(max_length=500, blank=True')

class Post(BaseContent):
    blog = models.ForeignKey(Blog, related_name='posts')
    published = models.BooleanField(default=False)
    author = models.ForeignKey(User, related_name='posts')
    url = models.CharField('URL', blank=True, max_length=200)
    published_on = models.DateTimeField(null=True, blank=True)

As you can see, Post derives from BaseContent, so it inherits wiki markup handling and the title and content fields. Everything here looks pretty normal. The difference is in the views:

# blog/views.py:

def show(request, blog_url, year, month, post_url):
    try:
        start = datetime(int(year), int(month), 1)
        end = datetime(int(year), int(month)+1, 1)
    except ValueError:
        raise Http404('Date format incorrect')
    blog = get_object_or_404(Blog, base_url=blog_url)
    post = get_object_or_404(Post, url=post_url, blog=blog,
        published_on__gte=start, published_on__lt=end)
    return direct_to_template(request, 'blog/post_detail.html',
        {'post': post, 'blog': blog})

def browse(request, blog_url):
    blog = get_object_or_404(Blog, base_url=blog_url)
    query = Post.objects.filter(blog=blog, published=True)
    query = query.order_by('-published_on')
    return object_list(request, query, paginate_by=POSTS_PER_PAGE,
        extra_context={'blog': blog})

The show view does something unusual. With SQL you'd normally formulate the date query as

post = get_object_or_404(Post, url=post_url, blog=blog,
    published_on__year=year, published_on__month=month)

There is no simple way to express a "__month=" query with most non-relational backends. While this particular query could be detected and executed because it also has a "__year=" filter this would add non-trivial complexity to the backend, so we decided to not support it that way. For now, we have to use the uglier version which uses a date range query. We'll provide a flexible solution for all nonrel backends in the future, though.

There's something else you might have noticed. In show and browse we first retrieve the blog and then the post(s). That's not necessarily unusual, but with SQL we could've just retrieved the Post directly via Post.objects.filter(blog__url=blog_url, ...). This would need JOIN support, though.

Neat-o features emulation

Finally, the last point: We have no select_related() in our queries, but this would be a very useful optimization. This isn't supported, yet. So, what can you do? Currently, everyone solves this by hacking around the limitations and writing more code. :'(

However, most of these problems can be solved transparently. We plan to add an independent layer which sits on top of the nonrel backend and emulates many of the neat-o features you like from SQL (e.g., JOINs and date queries). How will it scale, though? This layer could take care of automatic denormalization and MapReduce, so your reads could scale, but complex relationships can take a lot of processing power to keep updated. You'll still have to design your model code carefully and not everything will be practical. It does make a lot more possible, though.

That's where the fun begins because you could run seemingly SQL-only Django code (with JOINs and aggregates) efficiently on a nonrel DB - at the price of eventual consistency. Still, the possibilities are huge! That's the actual reason why we work on django-nonrel.

So, what are the take-aways?

  • Simply avoid JOINs and OR queries (as if that were simple ;). They can appear in "unexpected" places like model inheritance, though.
  • Also, date queries don't work.
  • Simple queries can be the same as always.
  • Django's ORM will allow for transparent emulation of SQL features in the future.

There's more to know. Watch our blog for news and more tutorials. Do you have any questions or improvement suggestions? Place your comments below!