Network Failure Handling Roundup

Nov 26 2013

Seth Proctor

When we dropped our 2.0 release back in October it was done with some fanfare. I mean, you know, two-dot-oh. Exciting! On Thursday, when we updated to 2.0.1, we kept it a little more on the down low. While a point-release like this may not seem sexy, it brings with it a couple of pretty cool things we do want to talk about.

One of those things is stability and correctness in the face of network partition. I’ve recently written some thoughts about trade-offs in distributed computing and how to react to failure and partition. That’s all well and good, but it’s no substitute for actual running code.

Several months back I had the pleasure to hang out with Kyle Kingsbury, all around sharp guy and a lot of fun to riff with on distributed computing. He was going through the process of evaluating several SQL and NoSQL databases to see how they react to failures like network partition. The world needs more of these tests that are about correctness in the face of failure! While I’d argue that NuoDB actually did pretty well by comparison to several of the other systems he looked at we didn’t exactly come out with full marks.

A lot of what he wrote (and subsequently presented) I agree with … some of it not so much. We take issues around CAP and database availability very seriously here at NuoDB; I’ll be writing more about that soon. The bottom-line, however, is that he uncovered a few stability issues in our database, and highlighted a default failure mode that’s harder to work with in many deployments. We take that seriously too.

What followed here at NuoDB is the subject of several blog posts you’ll see this week. Dan is going to talk about failure-detection schemes and how you can tune NuoDB to react based on how you’re deploying across systems. Trek is going to talk about how we’ve actually been running jepsen internally to stress our system. He’s also going to lay out the recipes for how you can try it too. To Kyle I say “thank you”! Your test framework gave us a way to measure failure modes, and we can now demonstrate how to pass the tests without any stop-the-world latencies you had observed. Cool stuff. Oh, Trek’s also going to write about some contributions we’re offering back to help others run this test suite too.

While you’re reading these posts keep in mind that this is just the start. We’re a relational database here at NuoDB, so Consistency is paramount. That said, we have some pretty cool ideas about what Availability can mean in a distributed system, especially when you can reason formally about data authority the way our architecture lets us do. What we’re laying out this week is the foundation for some pretty cool stuff. So please check out the flurry of activity on our blog this week, download NuoDB and let us know what you think.

Old nid for redirect:

371

Latest Images

Trending Articles

Latest Images