The concepts of scalability and redundancy go hand-in-hand. Building an environment that is capable of scaling out offers the ability to fine-tune how much failure you can withstand. There are a dizzying approaches to redundancy—power, network, storage, server, data, backup and replication, disaster recovery, load balancing, site redundancy—but for today we’re going to hit the basics of one of the most fundamental: storage. More specifically, RAID—a Redundant Array of Independent Disks.
RAID provides a lot of bang for the buck. For what is in most cases a small investment compared to other options, you can provide significant protection against one of the most common forms of failure.
If your power or network fails without redundancy, your site is down. Outages like this can be extremely expensive, but they’re also typically fast to recover from. Entire servers and sites are important to consider, but require significant planning to address fully. Backup and replication are important, but again bring complexity and can have extended restoration times.
However, if you have a disk that fails without RAID, you’ve just lost data.
Replacing that drive won’t bring the data back; you’ll need backups (you do have tested up-to-date backups, right?), you’ll need a plan to rebuild the OS and restore those backups, and you’ll have extended downtime to rebuild. A non-redundant drive loss in turn means you likely have a server outage, as a server without is data or OS is a very expensive brick. RAID is one of the most cost-effective improvements you can make to an environment – for the cost of disks and an adapter (all typically a fraction of the cost of a server), you can protect against failure, add capacity, and improve performance.
RAID isn’t a panacea – there’s still a lot that can go wrong that it won’t protect you from. Even just within the realm of storage, human error, application bugs, or filesystem corruption can still make your disk array useless (RAID is not a replacement for backups). Much like power conditioning, it won’t replace the need for backup systems, but it can lessen your reliance on them and exposure in case of failure. There are also some applications that don’t lend themselves to RAID by their nature – Hadoop, ZFS, and other self-managing systems typically need direct disk access and provide their own redundancy and scalability features.
Capacity, Redundancy, Performance
Single disks suffer from many problems:
- They’re low in capacity, maxing out at a “mere” few terabytes.
- They’re failure-prone, especially traditional spinning hard disks with their tight tolerances and moving parts, but even solid state drives fail over time.
- Single disks can also only provide so much performance—hard drives have been a known bottleneck for many years, but SSD’s can be constrained by their interface and internal design and still have upper limits that are easy to hit with modern applications.
All of these limitations are unacceptable for critical business infrastructure.
RAID allows us to address these concerns by spreading storage across a number of disks, harnessing their combined capacity, performance, and enabling redundancy. Multiple disks are teamed together, providing features greater than the sum of their parts, but with storage less than the sum of their GB to provide redundancy. Redundancy is provided via a number of algorithms and methods, but the ultimate goal is to write additional data that allows reconstructing any data lost due to a drive failure. Different layouts or RAID levels allow one to optimize storage for a specific purpose. There are several different RAID levels in use today, the most common being 0, 1, 5, 6, and 10.
In closing, RAID provides a lot of benefits for a relatively small investment. You’re protected against the most common type of failure, and one that has the worst consequences—loss of data. Not only does it protect data, it can enhance the scalability and performance of your underlying storage. There are many different ways to deploy RAID, each with its own set of tradeoffs, which we’ll examine next.