[email protected]
[Top] [All Lists]

[zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM

Subject: [zfs-discuss] zpool import hangs any zfs-related programs, eats all RAM and dies in swapping hell
From: Jim Klimov
Date: Fri, 10 Jun 2011 13:51:59 +0400
The subject says it all, more or less: due to some problems
with a pool (i.e. deferred deletes a month ago, possibly
similar now), the "zpool import" hangs any zfs-related
programs, including "zfs", "zpool", "bootadm", sometimes "df".

After several hours of disk-thrashing all 8Gb of RAM in the
system is consumed (by kernel I guess, because "prstat" and
"top" don't show any huge processes) and the system dies in
swapping hell (scanrates for available pages were seen to go
into millions, CPU context switches reach 200-300k/sec on a
single dualcore P4) after eating the last stable-free 1-2Gb
of RAM within a minute. After this the system responds to
nothing except the reset button.

ZDB walks were seen to take up over 20Gb of VM, but as the
ZDB is a userland process - it could swap. I guess that the
kernel is doing something similar in appetite - but can't
swap out the kernel memory.

So regarding the hanging ZFS-related programs, I think
there's some bad locking involved (i.e. I should be able
to see or config other pools beside the one being imported?),
and regarding the VM depletion without swapping - that seems
like a kernel problem.

Partial problem is - the box is on "remote support" (while
it is a home NAS, I am away from home for months - so my
neighbor assists by walking in to push reset). While I was
troubleshooting the problem I wrote a watchdog program
based on vmstat, which catches bad conditions and calls
uadmin(2) to force an ungraceful software reboot. Quite
often it has not enough time to react, though - 1-second
strobes into kernel VM stats are a very long period :(

The least I can say is that this is very annoying, to the
point that I am not sure what variant of Solaris to build
my customers' and friends' NASes with. This box is curently
on OI_148a with the updated ZFS package from Mar 2011, and
while I am away I am not sure I can safely remotely update
this box.

Actually I wrote about this situation in detail on the forums,
but that was before web-posts were forwarded to email so I
never got any feedback. There's a lot of detailed text in
these threads so I wouldn't go over all of it again now:
* http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0 <http://opensolaris.org/jive/thread.jspa?threadID=138604&tstart=0> * http://opensolaris.org/jive/thread.jspa?threadID=138740&tstart=0 <http://opensolaris.org/jive/thread.jspa?threadID=138740&tstart=0>

Back then it took about a week of reboots for the "pool" to get finally
imported, with no visible progress-tracker except running ZDB to see
that deferred-free list is decreasing, and wondering if maybe it was
the culprit (in the end of that problem, it was). I was also lucky
that this ZFS cleanup from deferred-free blocks was cumulative and
the gained progress survived over reboots. Currently I have little
idea what is the problem with my "dcpool" (lives in a volume in
"pool" and mounts over iSCSI) - ZDB did not finish yet, and two
days of reboots every 3 hours did not fix the problem, the "dcpool"
does not import yet.

Since my box's OS is OpenIndiana, I started a few bugs to track
these problems as well, with little activity from other posters:
* https://www.illumos.org/issues/841
* https://www.illumos.org/issues/956

The current version of my software watchdog which saves some
trouble for my assistant by catching near-freeze conditions,
is here:

* http://thumper.cos.ru/~jim/freeram-watchdog-20110610-v0.11.tgz

I guess it is time for questions now :)

What methods can I use (beside 20-hour-long ZDB walks) to
gain a quick insight on the cause of problems - why doesn't
the pool import quickly? Does it make and keep any progress
while trying to import over numerous reboots? How much is left?

Are there any tunables I did not try yet? Currently I have
the following settings to remedy different performance and
stability problems of this box:

# cat /etc/system | egrep -v '\*|^$'
set zfs:aok = 1
set zfs:zfs_recover = 1
set zfs:zfs_resilver_delay = 0
set zfs:zfs_resilver_min_time_ms = 20000
set zfs:zfs_scrub_delay = 0
set zfs:zfs_arc_max=0x1a0000000
set zfs:arc_meta_limit = 0x180000000
set zfs:zfs_arc_meta_limit = 0x180000000
set zfs:metaslab_min_alloc_size = 0x8000
set zfs:metaslab_smo_bonus_pct = 0xc8
set zfs:zfs_write_limit_override = 0x18000000
set zfs:zfs_txg_timeout = 30
set zfs:zfs_txg_synctime = 30
set zfs:zfs_vdev_max_pending = 5

Are my guesses about "this is a kernel problem" anyhow correct?

Did by chance any related fixes make way into the development
versions of newer OpenIndianas (148b, 151, pkg-dev repository)

Thanks for any comments, condolenscences, insights, bugfixes ;)
//Jim Klimov

zfs-discuss mailing list
[email protected]

<Prev in Thread] Current Thread [Next in Thread>