Heh, yeah, I've thought the same kind of thing in the past. The
problem is that the argument doesn't really work for system admins.
As far as I'm concerned, the 7000 series is a new hardware platform,
with relatively untested drivers, running a software solution that I
know is prone to locking up when hardware faults are handled badly by
drivers. Fair enough, that actual solution is out of our price range,
but I would still be very dubious about purchasing it. At the very
least I'd be waiting a year for other people to work the kinks out of
Which is a shame, because ZFS has so many other great features it's
easily our first choice for a storage platform. The one and only
concern we have is its reliability. We have snv_106 running as a test
platform now. If I felt I could trust ZFS 100% I'd roll it out
On Thu, Feb 12, 2009 at 4:25 PM, Tim <tim@xxxxxxxxx> wrote:
> On Thu, Feb 12, 2009 at 9:25 AM, Ross <myxiplx@xxxxxxxxxxxxxx> wrote:
>> This sounds like exactly the kind of problem I've been shouting about for
>> 6 months or more. I posted a huge thread on availability on these forums
>> because I had concerns over exactly this kind of hanging.
>> ZFS doesn't trust hardware or drivers when it comes to your data -
>> everything is checksummed. However, when it comes to seeing whether devices
>> are responding, and checking for faults, it blindly trusts whatever the
>> hardware or driver tells it. Unfortunately, that means ZFS is vulnerable to
>> any unexpected bug or error in the storage chain. I've encountered at least
>> two hang conditions myself (and I'm not exactly a heavy user), and I've seen
>> several others on the forums, including a few on x4500's.
>> Now, I do accept that errors like this will be few and far between, but
>> they still means you have the risk that a badly handled error condition can
>> hang your entire server, instead of just one drive. Solaris can handle
>> things like CPU's or Memory going faulty for crying out loud. Its raid
>> storage system had better be able to handle a disk failing.
>> Sun seem to be taking the approach that these errors should be dealt with
>> in the driver layer. And while that's technically correct, a reliable
>> storage system had damn well better be able to keep the server limping along
>> while we wait for patches to the storage drivers.
>> ZFS absolutely needs an error handling layer between the volume manager
>> and the devices. It needs to timeout items that are not responding, and it
>> needs to drop bad devices if they could cause problems elsewhere.
>> And yes, I'm repeating myself, but I can't understand why this is not
>> being acted on. Right now the error checking appears to be such that if an
>> unexpected, or badly handled error condition occurs in the driver stack, the
>> pool or server hangs. Whereas the expected behavior would be for just one
>> drive to fail. The absolute worst case scenario should be that an entire
>> controller has to be taken offline (and I would hope that the controllers in
>> an x4500 would be running separate instances of the driver software).
>> None one of those conditions should be fatal, good storage designs cope
>> with them all, and good error handling at the ZFS layer is absolutely vital
>> when you have projects like Comstar introducing more and more types of
>> storage device for ZFS to work with.
>> Each extra type of storage introduces yet more software into the equation,
>> and increases the risk of finding faults like this. While they will be
>> rare, they should be expected, and ZFS should be designed to handle them.
> I'd imagine for the exact same reason short-stroking/right-sizing isn't a
> "We don't have this problem in the 7000 series, perhaps you should buy one
> of those".
zfs-discuss mailing list