On Mon, 20 Dec 2010 11:27:41 PST Erik Trimble <erik.trimble@xxxxxxxxxx> wrote:
> The problem boils down to this:
> When ZFS does a resilver, it walks the METADATA tree to determine what
> order to rebuild things from. That means, it resilvers the very first
> slab ever written, then the next oldest, etc. The problem here is that
> slab "age" has nothing to do with where that data physically resides on
> the actual disks. If you've used the zpool as a WORM device, then, sure,
> there should be a strict correlation between increasing slab age and
> locality on the disk. However, in any reasonable case, files get
> deleted regularly. This means that the probability that for a slab B,
> written immediately after slab A, it WON'T be physically near slab A.
> In the end, the problem is that using metadata order, while reducing the
> total amount of work to do in the resilver (as you only resilver live
> data, not every bit on the drive), increases the physical inefficiency
> for each slab. That is, seek time between cyclinders begins to dominate
> your slab reconstruction time. In RAIDZ, this problem is magnified by
> both the much larger average vdev size vs mirrors, and the necessity
> that all drives containing a slab information return that data before
> the corrected data can be written to the resilvering drive.
> Thus, current ZFS resilvering tends to be seek-time limited, NOT
> throughput limited. This is really the "fault" of the underlying media,
> not ZFS. For instance, if you have a raidZ of SSDs (where seek time is
> negligible, but throughput isn't), they resilver really, really fast.
> In fact, they resilver at the maximum write throughput rate. However,
> HDs are severely seek-limited, so that dominates HD resilver time.
You guys may be interested in a solution I used in a totally
different situation. There an identical tree data structure
had to be maintained on every node of a distributed system.
When a new node was added, it needed to be initialized with
an identical copy before it could be put in operation. But
this had to be done while the rest of the system was
operational and there may even be updates from a central node
during the `mirroring' operation. Some of these updates could
completely change the tree! Starting at the root was not
going to work since a subtree that was being copied may stop
existing in the middle and its space reused! In a way this is
a similar problem (but worse!). I needed something foolproof
My algorithm started copying sequentially from the start. If
N blocks were already copied when an update comes along,
updates of any block with block# > N are ignored (since the
sequential copy would get to them eventually). Updates of
any block# <= N were queued up (further update of the same
block would overwrite the old update, to reduce work).
Periodically they would be flushed out to the new node. This
was paced so at to not affect the normal operation much.
I should think a variation would work for active filesystems.
You sequentially read some amount of data from all the disks
from which data for the new disk to be prepared and write it
out sequentially. Each time read enough data so that reading
time dominates any seek time. Handle concurrent updates as
above. If you dedicate N% of time to resilvering, the total
time to complete resilver will be 100/N times sequential read
time of the whole disk. (For example, 1TB disk, 100MBps io
speed, 25% for resilver => under 12 hours). How much worse
this gets depends on the amount of updates during
At the time of resilvering your FS is more likely to be near
full than near empty so I wouldn't worry about optimizing the
mostly empty FS case.
zfs-discuss mailing list