netbsd-bugs@netbsd.org
[Top] [All Lists]

kern/38019: some kind of undetected deadlock slowly kills NetBSD-4.0_STA

Subject: kern/38019: some kind of undetected deadlock slowly kills NetBSD-4.0_STABLE GENERIC.MP
From: "Greg A. Woods"
Date: Wed, 13 Feb 2008 17:40:01 +0000 UTC
>Number:         38019
>Category:       kern
>Synopsis:       some kind of undetected deadlock slowly kills 
>NetBSD-4.0_STABLE GENERIC.MP
>Confidential:   no
>Severity:       critical
>Priority:       high
>Responsible:    kern-bug-people
>State:          open
>Class:          sw-bug
>Submitter-Id:   net
>Arrival-Date:   Wed Feb 13 17:40:01 +0000 2008
>Originator:     Greg A. Woods
>Release:        NetBSD 4.0_STABLE 2008/02/10
>Organization:
Planix, Inc.; Toronto, Ontario; Canada
>Environment:
System: NetBSD 4.0_STABLE GENERIC.MP
Architecture: i386
Machine: i386
>Description:

        I've been experiencing regular hangs and unkillable processes on
        my Dell PE2650 running NetBSD-4.0_STABLE GENERIC.MP.

        The most regular problem is triggered during the big "find" runs
        invoked by /etc/daily et al.

        The system managed to make it through its nightly cron jobs
        without hanging last night and I managed to use it (lightly) for
        several hours this morning before problems began to appear.  I
        had been investigating problems with the apcupsd package and had
        unpacked it and built it once or twice, then suddenly the
        "extract" phase hung, but the make process was interruptable, so
        I tried it again, with the same result.  Soon I discovered the
        gzcat and tar processes from both attempts were still present,
        and they were unkillable.

        Interestingly it doesn't seem to be access to the file gzcat is
        reading which causes problems.  Clearly the second run of
        "digest" didn't hang, and manual access with 'cat' and 'dd'
        works without hanging too.

        When the nightly "find" is one that locks it seems all access to
        the same device (and filesystem?) soon causes all (useful)
        processes to lock up.

        The common denomiator with the oft-nightly hangs and this
        currently more minor hang is that some of the stuck processes
        are in "vmmapva" and they are unkillable.

        This may, or may not, be related to the same causes of the
        problem my PR#37993.

        One difference between this machine and others I know are
        running very similar kernels is that this machine's local
        filesystems are all accessed via the built-in Dell PERC/3Di
        controller and the aac(4) (and ld(4)) drivers.  The aac(4)
        driver is known to be rather buggy, the worst problem of which
        is that it seems to miss some interrupts, but a regular job
        reading a block from the raw /dev/ld0d device seems to wake it
        up enough to keep things running normally.  See the 'sh' process
        running 'dd' in a loop in the info below.

        I'll try to capture a crash dump from the system as I reboot it.

        I'll also try to find and back-port some of the aac(4) fixes
        that were discussed some time ago when I first noted that the
        driver is buggy and is als way out of date w.r.t. its original
        sources in FreeBSD.

12:07 [2291] # df                              
Filesystem                    512-blocks       Used      Avail %Cap Mounted on
/dev/ld0a                        4032824    2294492    1536692  59% /
/dev/ld0e                       10089944    2127724    7457724  22% /var
/dev/ld0f                       16130008   10325496    4998012  67% /usr/pkg
/dev/ld0g                      231522520  131320032   88626364  59% /rest
mfs:413                          2024220     142308    1780704   7% /tmp
kernfs                                 2          2          0 100% /kern
most:/var/spool/ftp/pub/mirror   8749688    1896624    6415576  22% 
/var/package-distfiles
12:08 [2292] # mount  
/dev/ld0a on / type ffs (local)
/dev/ld0e on /var type ffs (nosuid, nodev, NFS exported, local)
/dev/ld0f on /usr/pkg type ffs (nodev, soft dependencies, local)
/dev/ld0g on /rest type ffs (nosuid, nodev, NFS exported, local)
mfs:413 on /tmp type mfs (synchronous, nosuid, nodev, local)
kernfs on /kern type kernfs (local)
most:/var/spool/ftp/pub/mirror on /var/package-distfiles type nfs (nosuid, 
nodev)
12:08 [2293] # ps -la
 UID   PID  PPID  CPU PRI NI  VSZ  RSS WCHAN   STAT TTY       TIME COMMAND
1000 12225 10492    0   3  0  604    4 ttyin   IWs+ ttyp0  0:00.48 -ksh 
1000  3359  3354    0  18  0  568    4 pause   IWs  ttyp1  0:00.09 -ksh 
   0 11549  3359    0   3  0  604  588 ttyin   I+   ttyp1  0:00.12 ksh 
1000 11156 10187   91   3  0  572  584 ttyin   Is+  ttyp2  0:00.28 -ksh 
1000 10696 10694    0   3  0  580  592 ttyin   Is+  ttyp3  0:00.20 -ksh 
   0  7610 17451    0 -18  0  424  368 vmmapva D    ttyp4  0:00.74 
/usr/bin/gzcat /var/package-distfiles//apcupsd-3.
   0 10067 27688    0  28  0  364  324 -       R+   ttyp4  0:00.00 ps -la 
1000 12435 29241 1746  18  0  568    4 pause   IWs  ttyp4  0:00.12 -ksh 
   0 15005 27796    0 -18  0  424  368 vmmapva D    ttyp4  0:00.83 
/usr/bin/gzcat /var/package-distfiles//apcupsd-3.
   0 15626 27796    0   2  0  424  360 pipecl  D    ttyp4  0:00.01 /bin/tar -xf 
- 
   0 17451     1    0  10  0  472  436 wait    I    ttyp4  0:00.00 (sh)
   0 17645 17451    0   2  0  424  360 pipecl  D    ttyp4  0:00.00 /bin/tar -xf 
- 
   0 27688 12435    0  18  0  684  704 pause   S    ttyp4  0:00.62 ksh 
   0 27796     1    0  10  0  472  436 wait    I    ttyp4  0:00.00 (sh)
1000   532  2485 1258  18  0  568    4 pause   IWs  ttyp5  0:00.13 -ksh 
1000  7340   532    0   2  0 1512 1072 select  I+   ttyp5  0:01.18 slogin 
whome.planix.com (ssh2)
1000 22241 12282    0   3  0  564  560 ttyin   Is+  ttyp6  0:00.31 -ksh 
1000  8897  6756    0  18  0  568  572 pause   Is   ttyp7  0:00.12 -ksh 
   0 27674  8897    0   3  0  604  624 ttyin   I+   ttyp7  0:00.08 ksh 
   0   350     1    0  10  0  456  364 wait    S    tty00- 1:37.04 sh -c while 
: ; do dd if=/dev/rld0d of=/dev/null 
   0  2098     1    0   3  0  256    4 ttyin   IWs+ tty00  0:00.07 
/usr/libexec/getty default constty 
   0  1582     1 1348   3  0  256    4 ttyin   IWs+ ttyE0  0:00.07 
/usr/libexec/getty Pc ttyE0 
   0  1511     1    0   3  0  436    4 ttyin   IWs+ ttyE1  0:00.16 -ksh 
   0  2103     1 1348   3  0  256    4 ttyin   IWs+ ttyE2  0:00.28 
/usr/libexec/getty Pc ttyE2 
   0  2066     1 1348   3  0  256    4 ttyin   IWs+ ttyE3  0:00.08 
/usr/libexec/getty Pc ttyE3 
   0  2041     1 1348   3  0  256    4 ttyin   IWs+ ttyE4  0:00.08 
/usr/libexec/getty Pc ttyE4 
   0  1980     1 1348   3  0  256    4 ttyin   IWs+ ttyE5  0:00.08 
/usr/libexec/getty Pc ttyE5 
   0  1981     1 1348   3  0  256    4 ttyin   IWs+ ttyE6  0:00.15 
/usr/libexec/getty Pc ttyE6 
12:09 [2294] # top
12:10 [2295] # fstat -p 7610
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
root     gzcat       7610   wd /rest    9070540 drwxr-xr-x     512 r 
root     gzcat       7610    0 /         162324 crw-------   ttyp4 rw
root     gzcat       7610    1* pipe 0xd7b282bc -> 0xd7b284b0 w
root     gzcat       7610    2 /         162324 crw-------   ttyp4 rw
root     gzcat       7610    3 /var/package-distfiles     121 -rw-r--r--  
4356614 r 
root     gzcat       7610    5 -         -        none    -
12:16 [2296] # fstat -p 15005
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
root     gzcat      15005   wd /rest    9070539 drwxr-xr-x       0 r 
root     gzcat      15005    0 /         162324 crw-------   ttyp4 rw
root     gzcat      15005    1* pipe 0xd7b28258 -> 0xd7b28064 w
root     gzcat      15005    2 /         162324 crw-------   ttyp4 rw
root     gzcat      15005    3 /var/package-distfiles     121 -rw-r--r--  
4356614 r 
root     gzcat      15005    5 -         -        none    -
12:16 [2297] # kill -9  15626 7610 15005 17645 
12:18 [2298] # kill -9  15626 7610 15005 17645 
12:18 [2399] # kill -9  15626 7610 15005 17645 
12:18 [2300] # fstat -p 15626      
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
root     tar        15626   wd /rest    9070539 drwxr-xr-x       0 r 
root     tar        15626    1 /         162324 crw-------   ttyp4 rw
root     tar        15626    2 /         162324 crw-------   ttyp4 rw
root     tar        15626    3 /         160798 crw-rw-rw-     tty rw
root     tar        15626    4 /rest    9070539 drwxr-xr-x       0 r 
root     tar        15626    5 -         -        none    -
12:20 [2301] # fstat -p 17645
USER     CMD          PID   FD MOUNT       INUM MODE         SZ|DV R/W
root     tar        17645   wd /rest    9070540 drwxr-xr-x     512 r 
root     tar        17645    1 /         162324 crw-------   ttyp4 rw
root     tar        17645    2 /         162324 crw-------   ttyp4 rw
root     tar        17645    3 /         160798 crw-rw-rw-     tty rw
root     tar        17645    4 /rest    9070540 drwxr-xr-x     512 r 
root     tar        17645    5 -         -        none    -
12:18 [2302] # 

>How-To-Repeat:

>Fix:

<Prev in Thread] Current Thread [Next in Thread>