Jan 27, 2016

Failover Strategies & Techniques in Distributed Applications

Systems fail. Networks fail. Processes fail. The secret, let it fail, but fix it fast. If services are restored quickly enough before the user notices then did the failure occur? 

Broadly there are four types of failover strategies -
  1. No Failover
  2. Cold Failover
  3. Warm Failover
  4. Hot Failover
These strategies vary in their recovery time, cost and impact. Also often a combination of these strategies are used. For example, hot failover strategy for high availability and cold failover strategy for disaster recovery to resume availability.

Next part is techniques. Typically the techniques utilized for doing hot failover are -
  1. Client-based failover
  2. DNS-based failover
  3. Network-based failover
  4. IP Address takeover
  5. Gratuitous ARP based failover
  6. Server-based failover
Some failover techniques like DNS-based failover are provided as a service by many cloud providers. On other extreme, techniques like Gratuitous ARP based failover are not at all cloud friendly. 

Distributed systems are all about trade-offs. Each failover strategy mentioned above has its own pro's and con's. Same with failover techniques. In rest of the post we dive deeper into these details. At the end we touch base some of the challenges in failover like tug-of-war, split-brain, quorums etc.

Jan 23, 2016

Designing Messaging for Scalable Distributed Systems - Part 2

Recently I had few discussions on messaging. Thought it would be an interesting follow up to explore some additional considerations that are important but typically don't surface in early phases of messaging system design.

The focus of this post is on following aspects -
  1. Which scenarios in system design are better suited for async messaging?
  2. What are the issues to consider during implementation of distributed messaging functionalities?
  3. How can we make the distributed messaging solution easier to monitor, debug & support?

Jan 11, 2016

Designing Messaging for Scalable Distributed Systems

The cloud has changed considerably the scale of distributed systems. As the size of the systems grow, it becomes increasingly difficult to design them and keep them running. To avoid those difficulties, most large scale architectures use loosely coupled technologies. 

The vehicle that is often used in this journey to paradise (or inferno) of scaling the system is the message bus. If implemented properly, I think messaging is a highly valuable element in the architecture for near infinite scale. 


If implemented haphazardly...


You get the point...:-)

Jan 6, 2016

Scalable Distributed Systems - Introduction

In recent years couple factors have increasingly become important in design of distributed systems i.e., Scalability & Reliability of the system. Over time I picked few things related to these factors. This post series is an attempt to share my modest knowledge on scalability aspects. 

What is Scalability? 
Simply put it is ability of the system to handle increasing load whether it is addition of users or resources or both. Now typically the scale of a system has 3 dimensions -
  1. The quantity dimension i.e., number of users, resources, objects etc that are part of the system
  2. The distribution dimension i.e., geographical distribution of servers, services, data etc. 
  3. The administrative dimension i.e., the number of organizations, multi-tenancy etc
These dimensions in turn affect a whole host of components that are needed for a distributed system. 

Building a scalable system does not happen by accident. Similarly a distributed system is not automatically a scalable system. So it is important to consider the effects of scale in these dimensions early on.