Monday, February 14, 2011

Guardian SXSW Hack Review

What I built and why



I really wanted to scratch my own itch at the hack weekend with SXSW in mind. There is going to be over a thousand bands playing at the music festival, and many of those will be trying to break through and make it. This means, there’s a fair chance I won’t have heard of many of the bands. Also, the sheer number of bands playing means it’ll be difficult for me to do any quick research about those bands to find out who I should go and see play.

I’ve used Last.FM for nearly four years and have something like approaching 20,000 scrobbled tracks in their dataset, so they have a good impression of the type of music and artists I like listening too and I wanted to tap into that data. Bearing in mind what I just said about not knowing any of the bands playing, I needed a different way to look at the data. Using Matt Andrew’s band listing API, I used the Last.FM API to find around 20 similar artists for each band playing at the festival, leaving me with a dataset of about 20,000 artists that are like those playing at the festival. My thinking here was there would be a good chance of bigger bands would be amongst this dataset and I might have more of a chance of finding a match.

Now, using the top artists from the Last.FM API, I could do an intersection of the artists I like and artists similar to those playing at SXSW. I could then do a look up to see what bands are similar to bands I like, helping me discover new music at the festival without investing too much time doing any research on all of the artists. Yes, it gave me a bit of a headache too.

I only really got this working by lunch time on the second day, and although I had intended to build a simple HTML layer on the data I’d built, I just didn’t have the energy. I went for a coffee, a chat and a nice sit down instead. I don’t think it helped tremendously with the presentation not having any visuals, but I think a few people saw the potential of the application and the data behind it.

What’s next



The hack day is done, but I want to push this project on a little more. The code certainly needs a lot of love and I wasn’t using a complete dataset over the weekend, so I need to import that to get more comprehensive results. I’d also like to make this more generic for any festival, be it SXSW or Glastonbury.

Lessons, observations and stories



The hack weekend was only my second hack event, and the first one I went too was only for the first day of a two day event, so I suppose this was my first full hack event. I’ve been to plenty of small meet-ups, larger Bar Camp events and mammoth conferences, but a hack day definitely has a different and more intimate vibe too it. The number of attendees seemed pretty optimal and it was certainly a good mix of developers, designers and journalists although this only really shone through in the final presentations.

Although having an idea of what I was intending to build at the event, and even doing some thinking about how the application might work (but no coding before the pistol on Saturday morning, as that’s cheating!) it still took me a while to get started. I probably would have coded my idea up in Node.JS as it’s quickly becoming my language of choice in the ‘get something done fast’ category, but as I was building something that I hoped would eventually be brought into the Guardian for festival coverage I thought I should build the application in the fast becoming language of choice at the Guardian, scala.

Scala, although fantastically awesome on one hand, still runs on the JVM and still requires Java like setup with web.xml files and the like. This isn’t a problem in itself, but it means some non-trivial time is spent just setting up a project the way you like it.

This leads nicely into lesson one:
If intending to use a language that requires some investment in setup, look for a way of reducing this, perhaps by having a library of “hello world” applications pre-configured for repeat use.

Keeping with Scala, as a relative newbie to the language I know I was doing certain things in an inefficient way and that was compounded by the timescales of an event like this. When working on a project and you suffer a setback, either you don’t know how to do something or something you thought would work just doesn’t, you generally have time to investigate, ask around and try a few things out. However, that is a real luxury when trying to build things in hours and not weeks.

Lesson Two:
If you’re going to use a hack day to experiment with a new technology, expect frustrations and delays. Even a half hour delay can feel catastrophic.

Just to finish the Scala points, and this may be my lack of knowledge of the language, but I wish the JSON and HTTP support were a little better. Compared to the XML support, which is an excellent xpath like implementation, the JSON support felt clunky. I actually had to change the data I was getting (I was reading from a file, so I could do this) to remove some parts which I couldn’t get the code to parse. As for the HTTP support, I had to bring in the Jetty HTTP client (which didn’t seem to recognise ‘utf-8’ as a character encoding), then bring in the Apache Commons HTTP client to request data from the Last.FM API. One post on Stack Overflow I was reading while I was looking for answers to a Scala problem suggesting having a personal library for wrapping functions you wish were supported better.

Lesson Three:
Knowing how to do common things really well, and fast is essential. In my case, using web based APIs using JSON and rendering JSON in turn.

One technology I did use which I’m now using on a day to day basis is MongoDB and this is where knowing something about a technology really came into it’s own. Getting stuff into and out of the database was as easy as it should be. I used the effective but perhaps slightly noisy Casbah Scala driver to talk to MongoDB. I was also using the Last.FM API to get information about bands and I realised that the Last.FM API was probably one of the first public web APIs I used thanks to a Paul Downey workshop when I was a fresh faced graduate. In a good way the API doesn’t look to have changed, which should be great as anything I built in that workshop might have a chance of working today. However, the web has moved on a bit from RPC and XML and I’d like to see Last.FM offering more of a RESTful JSON based API. That might just have influenced my decision to use Node.JS instead of Scala due to the support of JSON over XML.

As this was a music project, the sensible thing was using Music Brainz ids for the bands, and although this worked in many instances some of the bands playing SXSW don’t yet have Music Brainz ids and perhaps more surprisingly the Last.FM API doesn’t seem to provide Music Brainz ids for all of the bands in it’s API, even top 100 chart bands can be without one. The algorithm I built depended on this data, and although I could do a best effort and call Music Brainz directly, it would have been nice if this was covered off by the larger provider, Last.FM.

Update: I've already found out Last.FM does actually offer JSON in it's API, that would have been handy to know, and sorta proves the lesson of know what you're doing before starting.