Unexplained reboots and kernel panic

Molecular5869@feddit.org · edit-2 3 days ago

Unexplained reboots and kernel panic

Gyroplast@pawb.social · edit-2 3 days ago

Looking at the call trace:

[ 1641.073507] RIP: 0010:rb_erase+0x199/0x3b0
...
[ 1641.073601] Call Trace:
[ 1641.073608]  <TASK>
[ 1641.073615]  timerqueue_del+0x2e/0x50
[ 1641.073632]  tmigr_update_events+0x1b5/0x340
[ 1641.073650]  tmigr_inactive_up+0x84/0x120
[ 1641.073663]  tmigr_cpu_deactivate+0xc2/0x190
[ 1641.073680]  __get_next_timer_interrupt+0x1c2/0x2e0
[ 1641.073698]  tick_nohz_stop_tick+0x5f/0x230
[ 1641.073714]  tick_nohz_idle_stop_tick+0x70/0xd0
[ 1641.073728]  do_idle+0x19f/0x210
[ 1641.073745]  cpu_startup_entry+0x29/0x30
[ 1641.073757]  start_secondary+0x11e/0x140
[ 1641.073768]  common_startup_64+0x13e/0x141
[ 1641.073794]  </TASK>

What’s happening here leading up to the panic is start_secondary followed by cpu_startup_entry, eventually ending up in CPU idle time management (tmigr), giving a context of “waking up/sleeping an idle CPU”. I’ve had a few systems in my life where somewhat aggressive power-saving settings in the BIOS were not cleanly communicated to Linux, so to say, causing such issues.

This area is notorious for being subtly borked, but you can test this hypothesis easily by either disabling a setting akin to “Global C States” in your BIOS, which effectively disables power-saving for your CPUs, or try an equivalent setting of the kernel arguments processor.max_cstate=1 intel_idle.max_cstate=0, or even a cpuidle.off=1.

This is obviously losing your power-saving capability of the CPUs, but if your system runs stable that way, you’re likely in the right ballpark and find a specific solution for that issue, possibly in a BIOS/Fimware update. Here’s a not too shabby gist roughly explaining what c-states are. Don’t read too many of the comments, they’re more confusing than enlightening.

The kernel docs I linked to above are comprehensive, and utterly indecipherable for a layperson. Instead of fumbling about in sysfs, try the cpupower tool/package to visualize the CPU idle settings, and try increasing enabled idle states until your system crashes again, to find out if a specific (deep) sleep state triggers your issue, and disable exactly that if you cannot find a bugfix/BIOS update.

If this is your problem, to reproduce the panic, try leaving your system as idle as possible after bootup. If a panic happens regularly that way, try starting processes exercising all your CPUs - if the hypothesis holds, this should not panic at any time, as no CPU is ever idle.

Molecular5869@feddit.org · 3 days ago

Thanks, please check my updated post. I have disabled the relevant setting in my BIOS, installed cpupower and increased the idle state to the maximum value of 2. I have also tried states 0 & 1. Do I need to run the machine for longer or should it have crashed right away according to your hypothesis? I also can’t tell you if the BIOS setting already fixed my issue since I still can’t reproduce it.

About your last paragraph, the system has had these issues mostly while idle, but that’s probably because my system is running idle most of time anyways. I have also had the issue during low to medium loads, like transcoding audio via jellyfin. But I haven’t methodically run a process on all cpus. How would I go about running a load that uses all cores? I don’t particularly want to run a stress test for hours (because loud), but at this time I’m really open to trying anything.

I have also enabled an option in my BIOS that generates a dummy load some time ago, because some forum post had suggested a PSU issue is at fault for unexplained reboots. I have a 500W PSU that is way overkill for my components, and some users suggested that some PSU’s can turn of when the load is to low. The option did not fix my problem. I have since connected a weaker 220W PSU, which also didn’t help.

Ænima@lemm.ee · 2 days ago

Just my two cents as someone who does this a lot, myself, only change one thing at a time when testing troubleshooting suggestions. I know the reply suggested a few things in succession, but that was showing progressive steps to confirm and identify the underlaying cause. Doing them all at once fails to correctly identify the root-cause at best, and at worst may have introduced new problems.

I say this again, as someone who notoriously does this all the time. It’s a time-saver reflex, but one that will bite you in the ass eventually.

Molecular5869@feddit.org · 2 days ago

Yes, I went to fast because I have been sitting on this for months now. Normally I would only change one thing at a time, but with this situation it can take everywhere from 5 minutes to multiple days to test one single thing. If it doesn’t crash for 48 hours, it might be because I fixed the issues, or it might just be a coincidence and it will crash in hour 49 ¯_(ツ)_/¯.

But your right, I will attempt it the right way when I find the time, even though it will probably take weeks 😮‍💨.

Ænima@lemm.ee · 2 days ago

I know it sucks but I’m glad you seen to have corrected the problem. As someone who does more harm than good with Linux systems, myself, to fix a Linux issue without completely reinstalling the OS, is impressive and you should be proud to have accomplished such a feat!

Molecular5869@feddit.org · 2 days ago

Well I’ve not fixed anything yet😅. It was sadly just a hypothetical. Sorry if that wasn’t clear from the comment.

Ænima@lemm.ee · 2 days ago

Well I’m still rooting for your success!

Gyroplast@pawb.social · 2 days ago

Do I need to run the machine for longer or should it have crashed right away according to your hypothesis?

Sorry for mudding the waters with my verbosity. It should not crash anymore. I believe your kernel panic was caused when an idle CPU 6 was sent to sleep. Disabling C-states, or limiting them to C0 or C1, prevents your CPUs from going into (deep) sleep. Thusly, by disabling or limiting c-states, a kernel panic should not happen anymore.

I haven’t found a way to explicitly put a core into a specific c-state of your choosing, so best I can recommend now is to keep your c-states disabled or limited to C1, and just normally use your computer. If this kernel panic shows up again, and you’re sure your c-state setting was effective, then I would consider my c-state hypothesis as falsified.

If, however, your system runs normally for a few days, or “long enough for you to feel good about it” with disabled c-states, that would be a strong indication for having some kind of issue when entering deeper sleep modes. You may then try increasing the c-state limit again until your system becomes unstable. Then you know at least a workaround at the cost of some loss of power savings, and you can try to find specific issues with your CPU or mainboard concerning the faulty sleep mode on Linux.

Best of luck!

Molecular5869@feddit.org · 2 days ago

Thank you very much for your help so far, I will test the different methods and settings suggested in this thread over the next few weeks. I probably won’t find the time or motivation to methodically figure out the specific issue. That means that if at some point my system seems stable again, I will just leave everything as is and try to just be happy about it.

But when my life gets less busy I’ll maybe have time to see this completely through.

Anyways thanks to everyone, especially you, for taking the time to help me. I will update this post should I ever figure it out.