Monday, September 19, 2011

Notes from "Schema design at scale"

# Eliot Horowitz

## Embedding sub-documents vs separate collection

i.e. blog post and comments

- embedded
-- Something like a million sub-documents is going to be unwieldy
-- need to move whole document (expensive) if gets large
- Not embedded
-- Million 'comments' in separate documents means lots of reads
- Hybrid
-- one document for core meta-data
-- separate document for n comments with array separated by buckets (i.e. 100 in each bucket)
-- reduces potential seeks (if 100 in each, reduces from say 500 to 5)

## Indexes

- Right-balanced index access on the B-Tree
-- only have to keep small portion in RAM
-- time based, object id, auto-increment
- Keep data sequential in index (covered index)
-- create an index with just the fields you need so you can retrieve the data straight from the index
-- index is bigger
-- good for reads like this

## Shard Key

- determines how data is partitioned
- hard to change
- most important performance decision
- broken into chunks by range
- want to distribute evenly but be right balanced, like month() + md5(something)
- if sharding for logs, need to think
-- why do you want to scale (for write or reading)?
-- 'want to see the last 1000 messages for my app across the system'
-- take advantage of parallelising commands across shares (index by machine, read by app-name)
- no right answer

## Lessons

Range query vs regex (that uses ^ - essentially 'starts-with') is about same performance
If you have a genuinely unique id, use that instead of the ObjectId

No comments: