Monday, September 19, 2011

Notes from "Blending MongoDB and RDBMS for eCommerce"

github.com/spf13
spf13.com
@spf13
OpenSky eCommerce platform

products in mongo using custom fields for different product types (i.e. actors in movie, tracklist for album etc)

- ordered items need to be fixed at time of purchase, table inheritance bad for this

## 3 things for e-commerce

- optimistic concurrency (update if current, then try again if document is not current)
- assumes environment with low data contention
- works well for Amazon with long tail product catalogue
- works bad for ebay, groupe, anything with flash-sales (high data contention)

# Commerce is ACID in real-life

purchasing something from a store deals with this without concurrency as each product can only be held by one customer

MongoDB e-Commerce

- each item (not sku) has it's own document
- contains
-- reference to sku
-- state
-- other meta-data (timestamp, order ref etc)

cart in card action difficult, but in Mongo changing state on item makes it unavailable to other customers (e.g. if state is 'in-cart')

## Blending

Doctrine (OS - ORM/ODM) modelled on Hibernate

in SQL
-- product inventory, sellable inventory, orders

- inventory is transient in Mongo collections
- inventory kept in sync with listeners

for financial transactions we want security and comfort of RDBMS

## Playing Nice

products are stored in a document
orders are entities stored in relational DB
store reference not relationships across two databases

jwage.com/2010/08/25/blending-the-doctrine-orm-and-mongdb-odm/

Notes from "Schema design at scale"

# Eliot Horowitz

## Embedding sub-documents vs separate collection

i.e. blog post and comments

- embedded
-- Something like a million sub-documents is going to be unwieldy
-- need to move whole document (expensive) if gets large
- Not embedded
-- Million 'comments' in separate documents means lots of reads
- Hybrid
-- one document for core meta-data
-- separate document for n comments with array separated by buckets (i.e. 100 in each bucket)
-- reduces potential seeks (if 100 in each, reduces from say 500 to 5)

## Indexes

- Right-balanced index access on the B-Tree
-- only have to keep small portion in RAM
-- time based, object id, auto-increment
- Keep data sequential in index (covered index)
-- create an index with just the fields you need so you can retrieve the data straight from the index
-- index is bigger
-- good for reads like this

## Shard Key

- determines how data is partitioned
- hard to change
- most important performance decision
- broken into chunks by range
- want to distribute evenly but be right balanced, like month() + md5(something)
- if sharding for logs, need to think
-- why do you want to scale (for write or reading)?
-- 'want to see the last 1000 messages for my app across the system'
-- take advantage of parallelising commands across shares (index by machine, read by app-name)
- no right answer

## Lessons

Range query vs regex (that uses ^ - essentially 'starts-with') is about same performance
If you have a genuinely unique id, use that instead of the ObjectId

Notes from "Scaling MongoDB for Real-Time Analytics"

* Scaling MongoDB for Real-Time Analytics, a Shortcut around the mistakes I've made - Theo Hultberg Chief Architect, Burt

40gb, 50 million documents per day

use of mongo
- virtual memory (throwaway)
- short time storage (throwaway)
- full time storage

why
- sharding makes write scale
- secondary indexes

using jRuby

mongo 1.8.1
updates near 50% write lock incrementing single number
pushing near 80% write lock

mongo 2
updates near 15% write lock
push near 50% write lock

iterations
- 1
-- one document per session
-- update as new data comes along
-- 1000% write lock!
-- lesson: everything is about working around the global write lock
- 2
-- multiple documents using same id prefix
-- not as much lock, but still not great performance, couldn't remove data at same pace as inserting
-- lesson: everything is about working around the global write lock
-- being more thought about designing primary key (same prefix for a group)
- 3
-- wrote a new collection every hour
-- lots of complicated code, bugs
-- enables fragmented database files on disk
-- lesson: make sure you can remove old data
- 4
-- sharding
-- higher write performance
-- lots of problems (in 1.8) and ops time spent debugging
-- lesson: everything is about working around the global write lock, sharding is a way around this (4 shards means 4 global write lock)
-- lesson: sharding not a silver bullet (it's buggy, avoid it if you can)
-- lesson: it will fail, design for failure (infrastructure)
- 5
-- move things to separate clusters for different usage (high writes then drop; increment a document)
-- lesson: one database with one usage pattern per cluster
-- lesson: monitor everything (found replica that was 12 hours behind, useless!)
- 6
-- upgraded to monster servers (high memory quad extra large with AWS) (downgraded to extra large machine - 6 machines with mongod 12gb ram, 3 machines for mongo-config, 3 machines for arbiters across three availability zones writing 2000 documents per second, reading 4000 documents per secondc)
-- lesson: call the experts when you're out of ideas
- 7
-- partitioning again and pre-chunking (need to know your data and range of your keys)
-- partition by database, new db each day
-- decrease size of documents (less in RAM)
-- no more problems removing data
-- lesson: smaller objects means smaller documents
-- lesson: think about your primary key
-- lesson: everything is about working around the global write lock

Q&A
- would you recommend?
-- best for high read load, high write load is not a solved problem yet
- EC2
-- you have replicas, why also have EBS?
-- use ephemeral disk, it comes included and is predictable performance (with caveats like spreading across availability zones to avoid data centre loss)
-- use RAID 10 if using EBS
- monitoring
-- mongostat
-- plugins that are available
-- mms from 10gen
-- server density (here at Mongo con)
- map reduce
-- moved away to be more real-time (but was using hadoop)

Sunday, June 05, 2011

My first scala app

Well I say first, but that’s not quite the whole story. I’ve worked on existing Scala code bases but recently got the opportunity to start a new project and I got to see how the experiences on those projects and conversations with colleges influenced design decisions. I wanted to note down some of the tricks I’d picked up.

Tools



We start with the tools and first up is SBT, version 0.7.x. We depend on the IDEA plugin to generate IDEA project files. We also use the Scaliform plugin which formats code every compile within SBT and the Sources plugin which when run downloads the source Jars without having to stipulate the ‘withSources’ directive on the dependencies.

The IDEA and Sources plugin are run as processors and are not checked into the SBT project file, whereas the Scalariform plugin is. The distinction is that a developer might choose to use a different IDE or not download the sources, but we should always try and conform to the same coding practices and being lazy, having a plugin do it for you is one less thing to worry about.

Inversion of control / Dependency injection



One of the interesting developments early on was the choice of the dependency injection frameworks. Would it be Spring, Guice or something custom? Actually, we decided not to have any and just use plain old servlets through our routing framework, Scalatra. I think this has actually lead us to write better code, for instance we have taken advantage of the companion objects for classes instead of relaying on repository classes that would need to be injected.

Routing



We started with the routing framework, Scalatra which is similar to Sinatra and other simple routing frameworks. It provides a clean way of exposing your application to the web and combined with the following, enables all of the methods to be only a few very focused lines of code. We used Scalate and SSP for the template rendering which can feel a little verbose with the variable declarations at the top, but when you can pre-compile the templates you can understand why those are required.

Databasing



All new projects are using MongoDB and we’ve been using the Scala driver, Casbah, for a while but this gave us an opportunity to try out Salat the case class wrapper. This is probably as close as I’ve seen ORM like features in Scala while maintaining the MongoDB query sugar. As mentioned, using companion objects instead of repository classes allowed us to do tidy things like:

case class Artist(_id: ObjectId = new ObjectId, name: String) {
def save() = Artist.saveArtist(this)
}

object Artist extends SalatDAO[Artist, ObjectId](...mongo config...) {
def saveArtist(artist: Artist) = update(Map(“_id” -> artist._id, artist
}

And use them like:

val artist = Artist(“My new favourite band”)
artist.save()


This meant we were using a repository like model, but hiding it behind the case class and the companion object. It makes for very neat code further upstream.

Webifying



The app has to call out to other web APIs and for this we use dispatch which wraps the Apache HttpClient project with a bow making it really simple to get resources and convert them to string, XML or JSON. We were already familiar with lift-json so we would get the response as a string and use case classes to parse the JSON into. Once again, XML is almost a joy to work with in Scala compared to other languages, but lift-json still wins in simplicity so we use JSON formatted APIs where we can.

What’s next

I’ve got my eye on Lift and Akka. I’ve had a play with Akka, but I feel I’d like to understand it a little more before using it in a project. Other than that, I think I’m generally getting my head around Scala and functional programming all the more and I’m rather enjoying it. I still feel like I’m just scratching the surface, but it is my first app remember?

Sunday, April 10, 2011

Building Curseleague.com


I really wanted an excuse to play with NodeJS once again and after a conversation with blagdaross in the office I had my idea. I had also been looking for an excuse to stretch the legs of Joyents no.de hosting service for NodeJS. I had all of the pieces needed:


Once again I turned to the trusty toolset of ExpressJS for routing and EJS for template rendering. I wanted to build straight out of Twitter lists, so first thing was to hit Twitter to get the members of a list. As this is an open call, no oAuth handshaking has to take place, which is nice. However, getting the list of members returns a whole bunch of metadata about each member on the list. I didn't need any of this information, and I couldn't find all of the members without pagination. This is where I started to have some fun with writing asynchronous code for very much a synchronous web request.


Once I had the list of members of a list, I could do a lookup on Cursebird for the points they have given those members. I have to say the Cursebird API is just fantastic! Talk about the API being the website, just stick .json on the end of any page you visit and get the data you need. No API key nonsense for open data is also a great move. Well done to Cursebird on their efforts. After sorting the list according to points, all it takes is rendering.


I recognised that crunching all of this data with potentially a few hundred server side calls to Twitter and Cursebird would be slow, so I introduced a really dumb key-value pair cache and use that cache at every level, from caching a single request to Twitter or Cursebird, to caching a whole dataset for a webpage.


Onto hosting, and no.de has this really interesting deployment technique where you do a git push into a repository you're given on the virtual server you have. From there a post-push hook is fired which deploys your app and runs a named file (server.js). The no.de real time analytics are really impressive, although it'd be great to get historical information out of that too.


The one on-going grievance I'm dealing with is locally I deploy on port 3000 but in production I deploy on port 80, and currently I'm changing this on commits as a work. I had a way of doing this through a node module I'd built a while ago to get properties from a json file, although up until now I'd been using nDistro to manage my dependencies, but with no.de I'd have to do this a little differently. I could push the properties project into the NPM repositories but I'm also thinking of building a node module that can read nDistro files and add dependencies at run-time. It's food for thought.


Finally a big thank you to JT for working up the styling.


Cheers,


Robbie

Monday, February 14, 2011

Guardian SXSW Hack Review

What I built and why



I really wanted to scratch my own itch at the hack weekend with SXSW in mind. There is going to be over a thousand bands playing at the music festival, and many of those will be trying to break through and make it. This means, there’s a fair chance I won’t have heard of many of the bands. Also, the sheer number of bands playing means it’ll be difficult for me to do any quick research about those bands to find out who I should go and see play.

I’ve used Last.FM for nearly four years and have something like approaching 20,000 scrobbled tracks in their dataset, so they have a good impression of the type of music and artists I like listening too and I wanted to tap into that data. Bearing in mind what I just said about not knowing any of the bands playing, I needed a different way to look at the data. Using Matt Andrew’s band listing API, I used the Last.FM API to find around 20 similar artists for each band playing at the festival, leaving me with a dataset of about 20,000 artists that are like those playing at the festival. My thinking here was there would be a good chance of bigger bands would be amongst this dataset and I might have more of a chance of finding a match.

Now, using the top artists from the Last.FM API, I could do an intersection of the artists I like and artists similar to those playing at SXSW. I could then do a look up to see what bands are similar to bands I like, helping me discover new music at the festival without investing too much time doing any research on all of the artists. Yes, it gave me a bit of a headache too.

I only really got this working by lunch time on the second day, and although I had intended to build a simple HTML layer on the data I’d built, I just didn’t have the energy. I went for a coffee, a chat and a nice sit down instead. I don’t think it helped tremendously with the presentation not having any visuals, but I think a few people saw the potential of the application and the data behind it.

What’s next



The hack day is done, but I want to push this project on a little more. The code certainly needs a lot of love and I wasn’t using a complete dataset over the weekend, so I need to import that to get more comprehensive results. I’d also like to make this more generic for any festival, be it SXSW or Glastonbury.

Lessons, observations and stories



The hack weekend was only my second hack event, and the first one I went too was only for the first day of a two day event, so I suppose this was my first full hack event. I’ve been to plenty of small meet-ups, larger Bar Camp events and mammoth conferences, but a hack day definitely has a different and more intimate vibe too it. The number of attendees seemed pretty optimal and it was certainly a good mix of developers, designers and journalists although this only really shone through in the final presentations.

Although having an idea of what I was intending to build at the event, and even doing some thinking about how the application might work (but no coding before the pistol on Saturday morning, as that’s cheating!) it still took me a while to get started. I probably would have coded my idea up in Node.JS as it’s quickly becoming my language of choice in the ‘get something done fast’ category, but as I was building something that I hoped would eventually be brought into the Guardian for festival coverage I thought I should build the application in the fast becoming language of choice at the Guardian, scala.

Scala, although fantastically awesome on one hand, still runs on the JVM and still requires Java like setup with web.xml files and the like. This isn’t a problem in itself, but it means some non-trivial time is spent just setting up a project the way you like it.

This leads nicely into lesson one:
If intending to use a language that requires some investment in setup, look for a way of reducing this, perhaps by having a library of “hello world” applications pre-configured for repeat use.

Keeping with Scala, as a relative newbie to the language I know I was doing certain things in an inefficient way and that was compounded by the timescales of an event like this. When working on a project and you suffer a setback, either you don’t know how to do something or something you thought would work just doesn’t, you generally have time to investigate, ask around and try a few things out. However, that is a real luxury when trying to build things in hours and not weeks.

Lesson Two:
If you’re going to use a hack day to experiment with a new technology, expect frustrations and delays. Even a half hour delay can feel catastrophic.

Just to finish the Scala points, and this may be my lack of knowledge of the language, but I wish the JSON and HTTP support were a little better. Compared to the XML support, which is an excellent xpath like implementation, the JSON support felt clunky. I actually had to change the data I was getting (I was reading from a file, so I could do this) to remove some parts which I couldn’t get the code to parse. As for the HTTP support, I had to bring in the Jetty HTTP client (which didn’t seem to recognise ‘utf-8’ as a character encoding), then bring in the Apache Commons HTTP client to request data from the Last.FM API. One post on Stack Overflow I was reading while I was looking for answers to a Scala problem suggesting having a personal library for wrapping functions you wish were supported better.

Lesson Three:
Knowing how to do common things really well, and fast is essential. In my case, using web based APIs using JSON and rendering JSON in turn.

One technology I did use which I’m now using on a day to day basis is MongoDB and this is where knowing something about a technology really came into it’s own. Getting stuff into and out of the database was as easy as it should be. I used the effective but perhaps slightly noisy Casbah Scala driver to talk to MongoDB. I was also using the Last.FM API to get information about bands and I realised that the Last.FM API was probably one of the first public web APIs I used thanks to a Paul Downey workshop when I was a fresh faced graduate. In a good way the API doesn’t look to have changed, which should be great as anything I built in that workshop might have a chance of working today. However, the web has moved on a bit from RPC and XML and I’d like to see Last.FM offering more of a RESTful JSON based API. That might just have influenced my decision to use Node.JS instead of Scala due to the support of JSON over XML.

As this was a music project, the sensible thing was using Music Brainz ids for the bands, and although this worked in many instances some of the bands playing SXSW don’t yet have Music Brainz ids and perhaps more surprisingly the Last.FM API doesn’t seem to provide Music Brainz ids for all of the bands in it’s API, even top 100 chart bands can be without one. The algorithm I built depended on this data, and although I could do a best effort and call Music Brainz directly, it would have been nice if this was covered off by the larger provider, Last.FM.

Update: I've already found out Last.FM does actually offer JSON in it's API, that would have been handy to know, and sorta proves the lesson of know what you're doing before starting.