On Wed, Nov 26, 2008 at 07:02:11PM -0500, Miles Nordin wrote:
> (2) The FMA model of collecting telemmetry, taking it into
> user-space, chin-strokingly contemplating it for a while, then
> decreeing a diagnosis, is actually a rather limited one. I can
> think of two kinds of limit:
As mentioned previously, this is not an accurate description of what's
going on. FMA allows diagnosis to happen at the detector when the
telemetry is conclusive and cross-domain or predictive analysis isn't
required. This is exactly what ZFS does on recent nevada builds. If a
drive is pathologically broken (i.e. a reopen fails, or reads and writes
to the label fail), it will *immediately* fail the drive and not wait
for any further diagnosis from FMA.
For drives that randomly fail I/Os or take along time, but otherwise
respond to basic requests, ZFS is often in no better position to perform
a diagnosis in the kernel. And as of build 101, ZFS behaves much better
in these circumstances by not aggressively retrying commands before
exhausting all other options.
Are you running your experiments on build 101 or later? And what
experiments are you running? Drawing conclusions from previous
experience or reports is basically pointless given the amount of change
that has occurred recently (Jeff's putback wasn't nicknamed "SPA 3.0"
for nothing). While there are no doubt more rough edges, we have
incorporated much of the previous feedback into new behavior that should
provide a much improved experience.
P.S. I'm also not sure that B_FAILFAST behaves in the way you think it
does. My reading of sd.c seems to imply that much of what you
suggest is actually how it currently behaves, but you should
probably bring up the issue on storage-discuss where you will find
more experts in this area.
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
zfs-discuss mailing list