Thursday, November 17, 2011

HAST and ZFS

There’s a nice tutorial on using HAST (Highly Available STorage) with UFS and ucarp.  That’s very nice, but in my failover scenario I can’t use UFS; a fsck would take too long, and a background fsck would be most likely to lose the data I’m most likely to need.  And FreeBSD comes with a kernel-side CARP implementation; why would I use the userland implementation instead?  So: the tutorial is great, except it doesn’t do what I want.  I’ll attack this problem in two phases:  one, get HAST with ZFS running, and experiment with it.  Two, get CARP failover to trigger HAST failover automatically.  (I believe I can use devd for CARP-initiated failover, but I’ll need to do further research on that.  That’ll be another posting.)  Today I’m experimenting with HAST and ZFS. 
This is for a skunkworks project, so the machines are named “skunk1″ and “skunk2″.  Each has two network interfaces:  one facing the public and the other connected with a crossover cable to the other machine.  The public interface will provide public services and CARP failover, while the private interface is reserved for HAST and pfsync.  The machines each have two disk slices:  da0s1 has the operating system, while da0s2 is reserved for ZFS.
Both machines have an identical /etc/hast.conf:
resource mirror {
on skunk1 {
local /dev/da0s2
remote 192.168.0.2
}
on skunk2 {
local /dev/da0s2
remote 192.168.0.1
}
}
On both machines, I must create the HAST resource and start the HAST daemon.
skunk1# hastctl create mirror
skunk1# hastd
Tell one node that it’s primary on the resource “mirror”, and the other that it’s secondary.
skunk1# hastctl role primary mirror
skunk2# hastctl role secondary mirror
Be sure to set hastd_enable=YES in /etc/rc.conf so HAST starts after a reboot.  Your initial sync shouldn’t take long at all: there’s no data yet.  Use hastctl status to check your mirror.
skunk1# hastctl status mirror
mirror:
role: primary
provname: mirror
localpath: /dev/da0s2
extentsize: 2097152
keepdirty: 64
remoteaddr: 192.168.0.2
replication: memsync
status: complete
dirty: 17142120448 bytes
The “status” shows complete.  We’re good.
Now put a filesystem on that device.  On the primary node ONLY, create a zpool.
skunk1# zpool create failover /dev/hast/mirror
This creates a ZFS at /failover, on top of a HAST device.
To check my work, I reboot and check on my ZFS device:
skunk1# zfs list
no datasets available
skunk1# zpool list
NAME       SIZE   USED  AVAIL    CAP  HEALTH  ALTROOT
failover      -      -      -      -  FAULTED  -
skunk1#
Oh.  This isn’t good.  It turns out that the /dev/hast directory is missing.  skunk1 doesn’t know that it’s in charge.  I rerun “hastctl role primary mirror” and /dev/hast appears.  Once I tell skunk1′s HAST that it’s the primary and import the pool, my zfs appears.  Verify that your ZFS shows up in “df” output as a final confirmation, and copy some files to that partition.
To failover, I forcibly export the ZFS pool, switch the HAST roles, and forcibly import the ZFS on the backup machine.
skunk1# zpool export -f failover
skunk1# hastctl role secondary mirror
skunk2# hastctl role primary mirror
skunk2# zpool import -f failover
skunk2# df -h
Filesystem     Size    Used   Avail Capacity  Mounted on
failover        39G     55M     39G     0%    /failover
My /failover filesystem is now on the backup node.  My test files are on it.
Making something work under carefully controlled conditions is one thing:  what happens when things go bad?  For example, let’s reboot the backup node and make some changes on the primary node while the backup is rebooting.  Does the cluster recover?  During the secondary’s reboot, “hastctl status” on the primary reports “status: degraded”, which is what I’d expect.  Once the secondary comes up, I have to log in and run “hastctl role secondary mirror” before the synchronization begins.  Again, this is what I’d expect.
Now for a worse failure scenario.  I yank the power on the primary.  What happens?  Nothing.  I have to run “hastctl role primary mirror” for the secondary to pick up, and then run “zpool import -f failover” to get my ZFS to appear.  But the files are there.  And when I restore power to the old primary and tell it to be the secondary, everything works.
HAST with ZFS seems to be just fine.
In learning about HAST, I more than once got the error “Split-brain detected.”  This meant that I failed to consistently maintain one host as primary and one as secondary.  To fix this, run “hastctl create mirror” on the backup node.  This deletes and recreates the secondary HAST node.  The primary will then sync the data to the backup, from scratch.  For a 40GB pool, this took about 25 minutes.  Both hosts were running on VMWare, however, and writing to the same disk.  Be sure to test this with real hardware in your environment for realistic numbers.  The lesson here is:  don’t mirror anything larger than you must, because in case of trouble you must drag the whole image across the network to the other machine.

No comments:

Post a Comment