|
|
On 9-Nov-07, at 2:45 AM, can you guess? wrote:
>>> Au contraire: I estimate its worth quite
>> accurately from the undetected error rates reported
>> in the CERN "Data Integrity" paper published last
>> April (first hit if you Google 'cern "data
>> integrity"').
>>>
>>>> While I have yet to see any checksum error
>> reported
>>>> by ZFS on
>>>> Symmetrix arrays or FC/SAS arrays with some other
>>>> "cheap" HW I've seen
>>>> many of them
>>>
>>> While one can never properly diagnose anecdotal
>> issues off the cuff in a Web forum, given CERN's
>> experience you should probably check your
>> configuration very thoroughly for things like
>> marginal connections: unless you're dealing with a
>> far larger data set than CERN was, you shouldn't have
>> seen 'many' checksum errors.
>>
>> Well single bit error rates may be rare in normal
>> operation hard
>> drives, but from a systems perspective, data can be
>> corrupted anywhere
>> between disk and CPU.
>
> The CERN study found that such errors (if they found any at all,
> which they couldn't really be sure of) were far less common than
> the manufacturer's spec for plain old detectable but unrecoverable
> bit errors or to the one hardware problem that they discovered (a
> disk firmware bug that appeared related to the unusual demands and
> perhaps negligent error reporting of their RAID controller and
> caused errors at a rate about an order of magnitude higher than the
> nominal spec for detectable but unrecoverable errors).
>
> This suggests that in a ZFS-style installation without a hardware
> RAID controller they would have experienced at worst a bit error
> about every 10^14 bits or 12 TB
And how about FAULTS? hw/firmware/cable/controller/ram/...
> (the manufacturer's spec rate for detectable but unrecoverable
> errors) - though some studies suggest that the actual incidence of
> 'bit rot' is considerably lower than such specs. Furthermore,
> simply scrubbing the disk in the background (as I believe some open-
> source LVMs are starting to do and for that matter some disks are
> starting to do themselves) would catch virtually all such errors in
> a manner that would allow a conventional RAID to correct them,
> leaving a residue of something more like one error per PB that ZFS
> could catch better than anyone else save WAFL.
>
> I know you're not interested
>> in anecdotal
>> evidence,
>
> It's less that I'm not interested in it than that I don't find it
> very convincing when actual quantitative evidence is available that
> doesn't seem to support its importance. I know very well that
> things like lost and wild writes occur, as well as the kind of
> otherwise undetected bus errors that you describe, but the
> available evidence seems to suggest that they occur in such small
> numbers that catching them is of at most secondary importance
> compared to many other issues. All other things being equal, I'd
> certainly pick a file system that could do so, but when other
> things are *not* equal I don't think it would be a compelling
> attraction.
>
> but I had a box that was randomly
>> corrupting blocks during
>> DMA. The errors showed up when doing a ZFS scrub and
>> I caught the
>> problem in time.
>
> Yup - that's exactly the kind of error that ZFS and WAFL do a
> perhaps uniquely good job of catching.
WAFL can't catch all: It's distantly isolated from the CPU end.
> Of course, buggy hardware can cause errors that trash your data
> in RAM beyond any hope of detection by ZFS, but (again, other
> things being equal) I agree that the more ways you have to detect
> them, the better. That said, it would be interesting to know who
> made this buggy hardware.
>
> ...
>
>> Like others have said for big business; as a consumer
>> I can reasonably
>> comforably buy off the shelf cheap controllers and
>> disks, and know
>> that should any part of the system be flaky enough to
>> cause data
>> corruption the software layer will catch it which
>> both saves money and
>> creates peace of mind.
>
> CERN was using relatively cheap disks
Don't forget every other component in the chain.
> and found that they were more than adequate (at least for any
> normal consumer use) without that additional level of protection:
> the incidence of errors, even including the firmware errors which
> presumably would not have occurred in a normal consumer
> installation lacking hardware RAID, was on the order of 1 per TB -
> and given that it's really, really difficult for a consumer to come
> anywhere near that much data without most of it being video files
> (which just laugh and keep playing when they discover small errors)
> that's pretty much tantamount to saying that consumers would
> encounter no *noticeable* errors at all.
>
> Your position is similar to that of an audiophile enthused about a
> measurable but marginal increase in music quality and trying to
> convince the hoi polloi that no other system will do: while other
> audiophiles may agree with you, most people just won't consider it
> important - and in fact won't even be able to distinguish it at all.
Data integrity *is* important.
--Toby
>
> - bill
>
>
> This message posted from opensolaris.org
> _______________________________________________
> zfs-discuss mailing list
> zfs-discuss@xxxxxxxxxxxxxxx
> http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
_______________________________________________
zfs-discuss mailing list
zfs-discuss@xxxxxxxxxxxxxxx
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
|
|