On Jun 17, 2010, at 4:35 PM, Garrett D'Amore wrote:
> I actually started with DKIOCGSTATE as my first approach, modifying
> sd.c. But I had problems because what I found is that nothing was
> issuing this ioctl properly except for removable/hotpluggable media (and
> the SAS/SATA controllers/frameworks are not indicating this. I tried
> overriding that in sd.c but I still found that there was another bug
> where the HAL module that does the monitoring does not monitor devices
> that are present and in use (mounted filesystems) during boot. I think
> HAL was designed for removable media that would not be automatically
> mounted by zfs during boot. I didn't analyze this further.
ZFS issues the ioctl() from vdev_disk.c. It is up to the HBA drivers to
correctly represent the DEV_GONE state (and is known to work with a variety of
> Is "sd.c" considered a legacy driver? Its what is responsible for the
> vast majority of disks. That said, perhaps the problem is the HBA
It's the HBA drivers.
> So how do we distinguish "removed on purpose" as opposed to "removed by
> accident, faulted cable, or other non administrative issue?" I presume
> that a removal initiated via cfgadm or some other tool could put the ZFS
> vdev into an offline state, and this would prevent the logic from
> accidentally marking the device FAULTED. (Ideally it would also mark
> the device "REMOVED".)
If there is no physical connection (detected to the best of the driver's
ability), then it is removed (REMOVED is different from OFFLINE). Surprise
device removal is not a fault - Solaris is designed to support removal of disks
at any time without administrative intervention. A fault is defined as broken
hardware, which is not the case for a removed device.
There are projects underway to a) represent devices that are physically present
but unable to attach to generate faults and b) topology-based diagnosis to
detect bad cables, expanders, etc. This is a complicated problem and not
always tractable, but can be solved reasonably well for modern systems and
A completely orthogonal feature is the ability to represent extended periods of
device removal as a defect. While removing a disk is not itself a defect,
leaving your pool running minus one disk for hours/days/weeks is clearly broken.
If you have a solution that correctly detects devices as REMOVED for a new
class of HBAs/drivers, that'd be more than welcome. If you choose to represent
missing devices as faulted in your own third party system, that's your own
prerogative, but it's not the current Solaris FMA model.
Hope that helps,
Eric Schrock, Fishworks http://blogs.sun.com/eschrock
zfs-discuss mailing list