Quantcast
Channel: Planet Apache
Viewing all articles
Browse latest Browse all 9364

Chris Pepper: Isilon Cluster

$
0
0

Our old bulk storage is Apple Xserve RAIDs. They are discontinued and service contracts are expiring, so we have been evaluating small-to-medium storage options for some time. Our more modern stuff is a mix of Solaris 10 (ZFS) on Sun X4500/X4540 chassis (48 * 1tb SATA; discontinued), and Nexsan SATABeasts (42 SATA drives, either 1tb or 2tb) attached to Linux hosts, with ext3 filesystems. We are not buying any more Sun hardware or switching to FreeBSD for ZFS, and ext4 does not yet support filesystems over 16tb. Breaking up a nice large array into a bunch of 16tb filesystems is annoying, but moving (large) directories between filesystems is really irritating.

We eventually decided on a 4-node cluster of Isilon IQ 32000X-SSD nodes. Each ISI36 chassis is a 4U (7" tall) server with 24 3.5" drive bays on the front and 12 on the back. In our 32000X-SSD models, bays #1-4 are filled with SSDs (apparently 100gb each, currently usable only for metadata) and the other 32 bays hold 1tb SATA drives, thus the name. Each of our nodes has 2 GE ports on the motherboard and a dual-port 10GE card.

Isilon's OneFS operating system is based on FreeBSD, with their proprietary filesystem and extra bits added. Their OneFS cluster file system is cache coherent: inter-node lookups are handled over an InfiniBand (DDR?) backend, so any node can serve any request; most RAM on the nodes is used as cache. Rather than traditional RAID 5 or 6, the Isilon cluster stripes data 'vertically' across nodes, so it can continue to operate despite loss of an entire node. This means an Isilon cluster must consist of at least 3 matching nodes, just like a RAID5 must consist of at least 3 disks. Unfortunately, this increases the initial purchase cost considerably, but cost per terabyte decreases as node count grows, and the incremental system administration burden per node is much better than linear.

Routine administration is managed through the web interface, although esoteric options require the command line. Isilon put real work into the Tab completion dictionaries. This is quite helpful when exploring the command line interface, but the (zsh based) completions are not complete -- neither are the --help messages nor the manual pages, unfortunately.

There are many good things about Isilon.

Pros

  • Single filesystem & namespace. This sounds minor but is essential for coping with large data sets. Folders can be arbitrarily large and all capacity is available to all users/shares, subject to quotas.
  • Cost per terabyte decreases with node count, as parity data becomes a smaller proportion of total disk capacity.
  • Aggregate performance increases with node count -- total cache increases, and number of clients per server is reduced.
  • Administration burden is fairly flat with cluster growth.
  • The FlexProtect system (based on classic RAID striping-with-parity and mirroring, but between nodes rather than within nodes/shelves) is flexible and protects against whole-node failure.
  • NFS and CIFS servers are included in the base price.
  • Isilon's web UI is reasonably simple, but exposes significant power.
  • The command line environment is quite capable, and Tab completion improves discoverability.
  • Quotas are well designed, and flexible enough to use without too much handholding for exceptions.
  • Snapshots are straightforward and very useful. They are comparable to ZFS snapshots -- much better than Linux LVM snapshots (ext3 does not support snapshots directly).
  • The nodes include NVRAM and battery backup for safe high-speed writes.
  • Nodes are robust under load. Performance degrades predictably as load climbs, and we don't have to worry about pushing so hard the cluster falls over.
  • Isilon generally handles multiple network segments with aplomb.
  • The storage nodes provide complete services -- they do not require Linux servers to front-end services, or additional high availability support.
  • The disks are hot swap, and an entire chassis can be removed for service without disrupting cluster services.
  • Because the front end is gigabit Ethernet (or 10GE), an Isilon storage cluster can serve an arbitrarily large number of clients without expensive fibre channel HBAs and switches.

And, of course, some things are less good.

Cons

  • Initial/minimum investment is high: 3 matching nodes, 2 InfiniBand switches, and licenses.
  • Several additional licenses are required for full functionality.
  • Isilon is not perfectionistic about the documentation.
  • Isilon is not as invested in the supporting command-line environment as I had hoped.
  • The round-robin load balancing works by delegating a subdomain to the Isilon cluster. Organizationally, this might be complicated.
  • CIFS integration requires AD access for accounts. This might also be logistically difficult.
  • Usable capacity is unpredictable and varies based on data composition.
  • There are always two different disk utilization numbers: actual data size, and including protection. This is confusing compared to classic RAID, where users only see unique data size.
  • There is no good way for users to identify which node they're connected to. This is possible but awkward for administrators to determine, but it is generally not worth going beyond the basic web charts.
  • Support can be frustrating.
    • We often get responses from many people on the same case, and rehashing the background repeatedly wastes time.
    • Some reps are very good; but some are poor, with wrong answers, pointless instructions, and a disappointing lack of knowledge about the technology and products.
    • We are frequently asked for system name & serial number, and asked to upload a status report with isi_gather_info -- even when this is all already on file.
    • Minor events trigger email asking if we need help, even when we're in the middle of scheduled testing.
  • The cluster is built of off-the-shelf parts, and the integration is not always complete. For instance, we are not alerted of problems with an InfiniBand switch, because things like a faulted PSU are not visible to the nodes and not logged.
  • Many commands truncate output to 80 columns -- even when the terminal is wider. To see full output add -w.
  • When the system is fully up, the VGA console does not show a prompt. This makes it harder to determine whether a node has booted successfully.
  • There is only one bit of administrative access control: when users log in, they either have access to the full web interface and command-line tools, or they don't. There is no read-only or 'operator' mode.
  • Running out of space (or even low on space) is apparently dangerous.
  • One suggestion was to reserve one node's worth of disks as free space, so the whole cluster can run with a dead node. In a 4-node configuration, reserving 25% of raw space for robustness (in addition to 25% for parity) would mean 50% utilization at best, which is generally not feasible. In fairness, it is rare for a storage array to even attempt to work around a whole shelf failure, but most (non-Isilon) storage shelves are simple enclosures with fewer and simpler failure modes...
  • SmartConnect is implemented as a DNS server, but it's incomplete -- it only responds to A record requests, which causes errors when programs like host attempt other queries.
  • The front panels are finicky. The controls are counterintuitive, the LED system is prone to bizarre (software) failure modes, and removing the front panel to access the disks raises an obscure but scary alert.

Notes

  • On Isilon nodes, use du -Sl to get size without protection overhead. On Linux clients, use du --apparent-size.
  • Client load balancing is normally managed via DNS round robin, with the round robin addresses automatically redistributed in case of a node failure. This is less granular and balanced than you'd get from a full load balancer, but much simpler.
  • EMC has bought Isilon. I'm not sure what the impact will be, but I am not confident this will be a good thing over the long term.
  • In BIND (named), subdomain delegation is incompatible with forwarding. Workaround: Add forwarders {}; to zone containing Isilon NS record.

Future

  • All that said, we are getting more Isilon storage -- it seems like the best fit for our requirements.

Viewing all articles
Browse latest Browse all 9364

Trending Articles