Architecting for Scaling

Our backend is written in Ruby, and we can handle over 5,000 messages per second.  Per messaging server.

Twitter, everyone’s poster child for poor scaling capabilities, is written in Ruby, and using Rails, and because they’ve been having such a terrible time of keeping their system alive, a large majority of people have extrapolated that Ruby and/or Rails is at fault. This is an incorrect assumption.

Where real scaling issues come into play is the architecture, not the language or implementation, assuming sane languages or implementations. Of course there are always the crazies who write some horribly inefficient monstrosity in BASIC that’s too slow by a factor of ten, but by and large, the same problem solved in different languages will generally be in the same order of magnitude.

A friend of mine from the WWU CS department actually performed some tests on this exact issue. Benson Kalahar created a simple test with basic arithmetic operations and data I/O in a variety of languages, and tested them on inputs of varying sizes.


As you can see, different languages have different levels of overhead, but they all tend towards linearization as the input increases. As would be expected, C is the fastest, a Bash script has very little overhead but is computationally slower, interpreted languages are slower than compiled languages, etc. Ruby does show some nasty tendencies as inputs get very large. I’m going to guess this is the GC doing it’s thing.  We’ll see what happens with Ruby 1.9 and YARV, but even if YARV doesn’t live up to the hype, it’s not a blocker.

The important thing to take away from this graph is that all the languages tend towards linearization, which is where we’re interested for building an application that can scale to the sizes that we want.  No matter how lightning-fast our language might be, eventually it’s going to be too slow to run on a single machine alone.

The real solution to making something capable of handling lots of load is to make it capable of scaling out, rather than scaling up. Scaling up, which means to add additional resources to a single node, will always have an upper limit, at which point you simply cannot add more RAM, there are no faster processors available, or you’re maxing out your gigabit ethernet link.

Scaling out, on the other hand, if architected well, has no upper bound. If the system as a whole can handle it, it’s simply a matter of adding another node, potentially adjusting some configuration, and letting it rip.

This is where Twitter went wrong, and this is where our messaging queue is right. 5,000 text messages might sound like a lot right now, but we’ve seen spikes where we have 10,000 messages in the queue, and that 2 seconds of latency is just on the edge of acceptability to us. The solution? Add another server, put it behind the load balancer, add it to the in-queue balancing config, and suddenly we can handle 10,000 messages per second. This is linear all the way out until we start to saturate the network.

Obviously if we had written the backend in C, we might be able to pull 10,000 messages per server, or even more. Unfortunately, this has a significant cost tradeoff. Hardware is cheap - we can add another messaging node for well under $1000. Developer time, on the other hand, isn’t as cheap. That $1000 is less than a week’s worth of a developer’s time. Instead, we opted to spend less developer time, have a more flexible final product, and be able to scale that product out very easily. We can have an extra compute or messaging node online within 24 hours, including hardware acquisition time.

I’m looking forward to seeing what our new architecture can do. It’s loads faster than the existing messaging queue, which has allowed us to send out up to 5 million messages a month since January, with lots of transient load.  I’m excited for launch on the 15th, it’s going to be a great new product!

  1. 6 Responses to “Architecting for Scaling”

  2. But what’s the point of sending 5,000 messags a second if your aggregator is only giving you 10 or 20? The best aggregators in the US provide 1,000 messages a second and it’s insanely expensive. What is your actual throughput thru the aggreagtor?

    By Jason on Jul 4, 2008

  3. Hi Jason.

    Plenty of reasons;
    We’re not just running MT traffic through our messaging server(s), we’re using them for logging, IPC, and a few caching tasks.

    Those numbers of 10-20 messages per second are guaranteed throughput - in reality we’re bursting much more.
    We’re also not running through just one aggregator. Yes, it is fairly expensive.

    I’m sure the gents over at Twitter said “what’s the point” a lot too.

    By Adrian Pike on Jul 5, 2008

  4. I doubt the engineering team at twitter say “what’s the point” at all.

    Your assertion that you can process 5,000 messages per second and horizontally scale is misleading and a ridiculous statement. Even if you’re connecting to multiple aggregators, each we’ll let you bind in at 20-40 MPS via SMPP and post to their web service at 50-100 MPS. You can load up your queues, but they’re still going to at the rate for which can pass messages to your aggregation partner(s), and they’re only going to send them out as fast as they can drain their own queues to each carrier (their binds to each carrier range from 200-2000 MPS across all clients).

    By that's outrageous! on Jul 19, 2008

  5. Hi “that’s”,

    You raise some good points.

    You’ve got the web services vs XMPP rates backwards, we’re able to saturate all of our aggregators HTTP pipes at about 30 a pop before they start either 500′ing or timing out, we’re seeing much more than that through XMPP.

    As I mentioned above, we’re not just running MT traffic through our messaging queue - it’s flexible enough that we’re routing a bunch of other types of traffic between, everything from task scheduling to distributed logging between our rack.

    The key point in all this isn’t the specific usage of the platform or tool, but rather that so often, we try to take the easy way out, and blame the language or framework, when in reality as you scale out, the architecture is really what we need to be paying attention to.

    By Adrian Pike on Jul 31, 2008

  6. Hi,

    I have been reading this blog for some time now but never bothered to comment until today. Wanted to let you know that I am a fan and enjoy your work.

    Thanks,

    By trediuptuggiz on Aug 4, 2008

  7. Brilliant!

    By AppopeImpots on Aug 10, 2008

Post a Comment