Single Point of Failure.
There are probably a lot of really nice definitions our there, but I'd like to use my own. In my world, a single point of failure is:
... a component, hardware or software based, which when it fails will cause the entire system, or an entire subsystem, to become unavailable to the users ...
So, let's give some examples:
- An application that only runs on a single web server has the web server as a single point of failure.
- An application which uses only a single database server (non-clustered) has the database server as a single point of failure.
- An application that relies on the Internet, but only has a single connection has their ISP connection as a single point of failure.
While we try to cover many of these different aspects when we design applications and infrastructures, sometimes things still don't work. For instance, in Production we've got clustered web servers, clustered database server, multiple Ethernet connections, redundant DNS servers, RAID disk storage and dozens of other redundant systems. Sometimes, though, things just go south really fast and in a really bad way. Recently we had an air conditioning problem with our server room. We have redundant units that have multiple air conditioners in each unit. Through a sad set of circumstances we ended up with only 1 of 4 units working.
No matter what anyone does, there is no such thing as a full proof system. There will always be some avenue whereby a single point of failure exists. The target is to identify those areas and work on putting in redundancy, one step at a time. It is a long process, but nothing worthwhile is ever accomplished quickly.
No comments:
Post a Comment