zfs-discuss@opensolaris.org
[Top] [All Lists]

Re: [zfs-discuss] zfs streams & data corruption

Subject: Re: [zfs-discuss] zfs streams & data corruption
From: Miles Nordin
Date: Wed, 25 Feb 2009 13:08:38 -0500
>>>>> "jm" == Moore, Joe <joe.moore@xxxxxxxxxxx> writes:

    jm> This is correct.  The general term for these sorts of
    jm> point-in-time backups is "crash consistant".

phew, thanks, glad I wasn't talking out my ass again.

    jm> In-flight transactions (ones that have not been committed) at
    jm> the database level are rolled back.  Applications using the
    jm> database will be confused by this in a recovery scenario,
    jm> since the transaction was reported as committed are gone when
    jm> the database comes back.  But that's the case any time a
    jm> database moves "backward" in time.

hm.  I thought a database would not return success to the app until it
was actually certain the data was on disk with fsync() or whatever,
and this is why databases like NVRAM's and slogs.  Are you saying it's
a common ``optimisation'' for DBMS to worry about write barriers only,
not about flushing?

    jm> Snapshots of a virtual disk are also crash-consistant.  If the
    jm> VM has not committed its transactionally-committed data and is
    jm> still holding it volatile memory, that VM is not maintaining
    jm> its ACID requirements, and that's a bug in either the database
    jm> or in the OS running on the VM.

I'm betting mostly ``the OS running inside the VM'' and ``the virtualizer
itself''.  For the latter, from Toby's thread:

-----8<-----
If desired, the virtual disk images (VDI) can be flushed when the
guest issues the IDE FLUSH CACHE command. Normally these requests are
ignored for improved performance.
To enable flushing, issue the following command:
 VBoxManage setextradata VMNAME 
"VBoxInternal/Devices/piix3ide/0/LUN#[x]/Config/IgnoreFlush" 0
-----8<-----

Virtualizers are able to take snapshots themselves without help from
the host OS, so I would expect at least those to work, and host
snapshots to be fixable.  VirtualBox has a ``pause'' feature---it
could pretend it's received a flush command from the guest, and flush
whatever internal virtualizer buffers it has to the host OS when
paused.

Also a host snapshot is a little more forgiving than a host cord-yank
because the snapshot will capture things applications like VBox have
written to files but not fsync()d yet.  so it's ok for snapshots but
not cord-yanks if VBox never bothers to call fsync().  It's just not
okay that VBox might buffer data internally sometimes.

Even if that's all sorted, though, ``the OS running inside the
VM''---neither UFS nor ext3 sends these cache flush commands to
virtual drives.  At least for ext3, the story is pretty long:

 http://lwn.net/Articles/283161/
  So, for those that wish to enable them, barriers apparently are
  turned on by giving "barrier=1" as an option to the mount(8) command,
  either on the command line or in /etc/fstab:
   mount -t ext3 -o barrier=1 <device> <mount point>
  (but, does not help at all if using LVM2 because LVM2 drops the barriers)

ext3 get away with it because drive write buffers are small enough
they can mostly get away with only flushing the journal, and the
journal's written in LBA order, so except when it wraps around there's
little incentive for drives to re-order it.  But ext3's supposed
ability to mostly work ok without barriers depends on assumptions
about physical disks---the size of the write cache being <32MB, their
reordering sorting algorithm being elevator-like---that probably don't
apply to a virtual disk so a Linux guest OS very likely is ``broken''
w.r.t. taking these crash-consistent virtual disk snapshots.

And also a Solaris guest: we've been told UFS+logging expects the
write cache to be *off* for correctness.  I don't know if UFS is less
good at evading the problem than ext3, or if Solaris users are just
more conservative.  but, with a virtual disk the write cache will
always be effectively on no matter what simon-sez flags you pass to
that awful 'format' tool.  That was never on the bargaining table
because there's no other way it can have remotely reasonable
performance.

Possibly the ``pause'' command would be a workaround for this becuase
it could let you force a barrier into the write stream yourself (one
the guest OS never sent) and then take a snapshot right after the
barrier with no writes allowed between barrier and snapshot.  If the
fake barrier is inserted into the stack right at the guest/VBox
boundary, then it should make the overall system behave as well as the
guest running on a drive with the write cache disabled.  I'm not sure
such a barrier is actually implied by VBox ``pause'' but if I were
designing the pause feature it would be.
_______________________________________________
zfs-discuss mailing list
zfs-discuss@xxxxxxxxxxxxxxx
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss
<Prev in Thread] Current Thread [Next in Thread>