ciss: Don't panic on null CR ciss_dequeue_notify

Apparently, sometimes on hot plug/unplug, a null cr comes back from ciss_dequeue_notify. This is clearly a bug, and by ignoring it we're papering over that bug. We only ever wake the thread after enqueing a notification or setting a bit about killing the thread, so once we check the bit isn't the cause, cr can't be NULL unless something else has dequeued it. Ideally, this would be fixed, rather than papered over, but this makes a very old card somewhat more useable for external enclosures. I suspect it's a race when we set CISS_THREAD_SHUT and another flag (the latter w/o ciss_mtx held), but I don't see it and w/o hardware to reproduce it would be hard to know for sure. PR: 246279 Reviewed by: imp Tested by: Marek Zarychta Differential Revision: https://reviews.freebsd.org/D25155
2024-11-30 04:22:44 +00:00 · 2024-10-13 22:01:33 -06:00 · 2024-10-13 22:01:33 -06:00 · b339ab1491
commit b339ab1491
parent fd95966af5
1 changed files with 14 additions and 2 deletions
--- a/sys/dev/ciss/ciss.c
+++ b/sys/dev/ciss/ciss.c
@ -4207,8 +4207,20 @@ ciss_notify_thread(void *arg)

 	cr = ciss_dequeue_notify(sc);

-	if (cr == NULL)
-		panic("cr null");
+	if (cr == NULL) {
+		/*
+		 * We get a NULL message sometimes when unplugging/replugging
+		 * stuff But this indicates a bug, since we only wake this thread
+		 * when we (a) set the THREAD_SHUT flag, or (b) we have enqueued
+		 * something. Since it's reported around errors, it may be a
+		 * locking bug related to ciss_flags being modified in multiple
+		 * threads some without ciss_mtx held. Or there's some other
+		 * way we either fail to sleep or corrupt the ciss_flags.
+		 */
+		ciss_printf(sc, "Driver bug: NULL notify event received\n");
+		continue;
+	}
+
 	cn = (struct ciss_notify *)cr->cr_data;

 	switch (cn->class) {