|
|
Marco Peereboom wrote:
Hmmm yeah this one of those "should never happen" things. I need to
think about this because the firmware has a completion guarantee that
apparently isn't guaranteed. Can I see the dmesg so that I know what
ballpark the firmware level is in?
I'd certainly agree that it really ought not to be happening. I couldn't
find anyone else reporting this sort of problem. I don't know if that's
an indication that it's a symptom of something else screwing up on my
end or simply that there's not many people running OpenBSD with these
controllers under a certain kind of load while also running bioctl with
great frequency.
Below is the relevant bits of the dmesg on two of the machines that I've
seen it on. I didn't personally do the firmware update, but I'm under
the impression that the second one is the most recent available.
mfi0 at pci4 dev 14 function 0 "Symbios Logic MegaRAID SAS 1064R" rev
0x00: apic 8 int 18 (irq 5), 0x35078086
mfi0: logical drives 1, version 5.1.1-0038, 128MB RAM
scsibus0 at mfi0: 1 targets, initiator 64
sd0 at scsibus0 targ 0 lun 0: <LSI, MegaRAID SAS RMB, 1.03> SCSI3
0/direct fixed
mfi0 at pci4 dev 14 function 0 "Symbios Logic MegaRAID SAS 1064R" rev
0x00: apic 8 int 18 (irq 5)
mfi0: logical drives 1, version 7.0.1-0054, 128MB RAM
scsibus0 at mfi0: 1 targets
sd0 at scsibus0 targ 0 lun 0: <INTEL, SROMBSAS18E, 1.12> SCSI3 0/direct
fixed
Thanks
On Thu, Mar 12, 2009 at 06:11:30PM -0400, tanner wrote:
Recently I observed an undesirable behavior on a controller that uses
the mfi driver, a Symbios Logic MegaRAID SAS 1064R.
The machine has a cron job that runs bioctl periodically and logs the
state of the RAID. Under disk load, bioctl would get caught in biowait.
Subsequent calls to bioctl would result in yet another bioctl process
stuck in biowait. This eventual led the machine to do Bad Things after
a couple of days of automatically generating bioctl processes. Disk
operation continued unimpeded, other than that.
It appeared that the controller would sometimes not respond to a
command, so the tsleep in mfi_mgmt never woke up. On top of that, it
sleeps with sc->sc_lock, so subsequent calls to bioctl would jam in
mfi_ioctl, waiting for the rwlock.
I wrote a quick patch to add a timeout to the tsleep call, which
prints a debugging message and goes to done if tsleep returns
EWOULDBLOCK. This was on multiple machines with different firmware,
so it doesn't appear that dropping commands is a flaw with one piece of
hardware. The patch has ran on several heavily loaded machines for about
a week now with no issues.
That said, this may not be the Right Solution. There's a risk that,
after we stick the ccb back on the freeq, it could be reused and the
controller could subsequently complete the command and hose things.
Thanks,
Tanner
--- sys/dev/ic/mfi.c Tue Nov 25 11:44:59 2008
+++ sys/dev/ic/mfi.c Wed Mar 11 17:38:08 2009
@@ -1187,9 +1187,15 @@ mfi_mgmt(struct mfi_softc *sc, uint32_t opc,
uint32_t mfi_post(sc, ccb);
DNPRINTF(MFI_D_MISC, "%s: mfi_mgmt sleeping\n", DEVNAME(sc));
- while (ccb->ccb_state != MFI_CCB_DONE)
- tsleep(ccb, PRIBIO, "mfi_mgmt", 0);
-
+ while (ccb->ccb_state != MFI_CCB_DONE) {
+ if(tsleep(ccb, PRIBIO, "mfi_mgmt",
+ MFI_MGMT_TIMEOUT * hz) == EWOULDBLOCK) {
+#ifdef MFI_DEBUG
+ printf("mfi_mgmt time out ccb");
+#endif
+ goto done;
+ }
+ }
if (ccb->ccb_flags & MFI_CCB_F_ERR)
goto done;
}
--- sys/dev/ic/mfivar.h Tue Nov 25 11:44:59 2008
+++ sys/dev/ic/mfivar.h Wed Mar 11 17:39:07 2009
@@ -35,6 +35,8 @@ extern uint32_t mfi_debug;
#define DNPRINTF(n,x...)
#endif
+#define MFI_MGMT_TIMEOUT 30
+
struct mfi_mem {
bus_dmamap_t am_map;
bus_dma_segment_t am_seg;
--
Tanner Beck
The Linux Box
734.761.4689
|
|