Four Months Debugging a DRAM-less NVMe on a Raspberry Pi 5

In January I mounted a Crucial P3 Plus 4TB NVMe on a Raspberry Pi 5 inside an Argon Neo 5 M.2 case and moved everything onto it — Immich photo library, motionEye camera recordings, Docker volumes, a few databases, the odd SFTP sync target. It worked. Then it froze. A week later, it froze again. Then every few days. Then every few hours.

What followed was four months of hunting a ghost. If you are running a DRAM-less NVMe (P3 Plus, WD Blue SN580, Kingston NV2/NV3, HP EX900, many others) on a Pi 5, this post is written for you.

TL;DR. The Pi 5’s firmware disables Host Memory Buffer by default; DRAM-less drives need HMB to stay alive under load. There is no fix that makes a DRAM-less drive truly stable on a Pi 5 — only mitigations (HMB on, PCIe Gen 1, throttle stack) that reduce the wedge rate without eliminating it. The Practical Checklist below is the mitigation list. The full post is the four-month version of how I got there — and why the only real answer is to replace the drive with one that has on-board DRAM.

The Symptom

Every so often, under no obvious load pattern, the kernel would log this:

nvme nvme0: controller is down; will reset: CSTS=0xffffffff, PCI_STATUS=0x10
nvme nvme0: Does your device have a faulty power saving mode enabled?
nvme nvme0: Try "nvme_core.default_ps_max_latency_us=0 pcie_aspm=off" and report a bug
Buffer I/O error on dev nvme0n1p1, async page read
EXT4-fs (nvme0n1p1): Remounting filesystem read-only

The drive would drop off the PCIe bus entirely. lsblk showed nvme0n1 at 0 bytes. Admin commands returned “Resource temporarily unavailable.” The only recovery was a power cycle. Because the root filesystem lives on the SD card and only /data is on the NVMe, SSH stayed up during these wedges — which was the single luckiest decision I made in this whole story.

Chasing the Wrong Suspects (Three Months)

The kernel’s suggestion — disable NVMe power management, disable PCIe ASPM — is the first thing everyone tries. I tried it. It did not help. So I went down the usual list, more or less in this order, chasing one plausible theory at a time:

The thermal pad was crushing the controller. The Argon Neo 5 case ships with a thermal pad that presses the NVMe controller against the case as a heatsink. It seemed conceivable that the clamping pressure was flexing the board and causing intermittent contact. I opened the case, inspected the pad, re-seated it. No change.
Temperature. With the new pad I watched the drive temperature during wedges — it sat around 40°C at crash time. Way under the drive’s throttling threshold. Not thermal.
The HAT (the case’s PCIe carrier board). I read forum threads about incompatible M.2 HATs, looked at mine, talked myself into ordering a second opinion, then talked myself out of it because nothing said the Argon Neo 5 was known-bad.
The ribbon cable. Replaced the PCIe flex cable between the Pi and the HAT. The Pi 5’s M.2 interface uses a short flat flex; a damaged cable is a common cause of link errors. No change.
PCIe generation. People online claimed Gen 3 works on the Pi 5 if you force it in config.txt. I had Gen 3 set. I stepped back to Gen 2. The drive still crashed, this time after thirty-one minutes instead of forty-four seconds.
The drive itself. SMART was pristine: zero media errors, 100% spare, 0% used, only a few terabytes of writes. The drive worked perfectly in an external USB enclosure attached to a laptop. Not the drive.
Power supply. Official 27W USB-C adapter. Fine.

Each of these felt plausible for the hour I tested it. Each failed to hold. The wedges got less frequent whenever I restarted — which I now recognise as the giveaway that the controller was recovering after a cold reset and then slowly running out of internal state again — but I kept misreading it as “the last change helped a little.”

After three months I stepped down to PCIe Gen 1. The drive lasted longer. Not forever — but longer. I told myself this was good enough and moved on.

The Real Root Cause: DRAM-less + PCIe Gen 2/3 + No HMB

The answer, when it finally arrived, was a three-ingredient cocktail that by itself explains every one of those three months of wedges:

The drive is DRAM-less.
The Pi 5’s default PCIe generation is Gen 2 or Gen 3, fast enough to flood the controller.
The Pi 5 firmware ships with HMB disabled by default.

Most consumer NVMe drives have a small DRAM chip on board that caches the flash translation layer — the table mapping logical blocks to physical NAND pages. The P3 Plus does not. It is a DRAM-less design. Instead, it uses a feature called Host Memory Buffer (HMB), where the drive borrows 32–64 MiB of system RAM through a DMA region and uses it as its FTL cache.

Without HMB, every random write has to fetch its FTL entry from NAND before it can proceed. At Gen 1 (approx. 500 MB/s link speed), the controller can just about keep up if the workload stays light. At Gen 2 (1 GB/s) or Gen 3 (2 GB/s), the host happily pushes commands faster than the controller can resolve them without a cache. The internal command queue fills faster than it drains. Eventually it deadlocks.

From the outside this looks exactly like a PCIe link drop — same CSTS=0xffffffff error, same failed controller reset, same need for a power cycle.

The Raspberry Pi 5 firmware injects nvme.max_host_mem_size_mb=0 into the kernel command line by default, disabling HMB allocation. A Raspberry Pi engineer on the forums explained the reason: in their tests HMB did not noticeably improve performance and it consumed scarce CMA (contiguous) memory. On most drives this is merely a small performance hit. On a DRAM-less drive like the P3 Plus it is the difference between a working SSD and a drive that routinely commits suicide under load.

The first mitigation is to override the firmware-injected parameter by appending a later value to /boot/firmware/cmdline.txt. The kernel takes the last occurrence of each parameter:

nvme.max_host_mem_size_mb=128

The P3 Plus advertises an HMB preference of 32 MiB; 128 is a safe cap for this drive and others. After a reboot, dmesg confirmed allocated 32 MiB host memory buffer. A 170 GB torture test (fifty gigabyte sequential write through the QLC fold, ten minutes of QD256 4K random, five of mixed reads and writes) passed with zero errors.

The drive is not defective. It works fine in a desktop because desktop kernels allocate HMB by default. It does not work fine on a Pi 5 because the Pi 5 firmware explicitly disables the feature the drive depends on. The problem is at the boundary between three things nobody owns individually: Crucial’s firmware assumes HMB, Raspberry Pi’s firmware disables HMB, and the NVMe subsystem’s failure mode for “FTL deadlock” looks identical to “drive fell off the bus.” Three months of wrong suspects because the failure hid the cause.

Three More Wedges, Three More Lessons

HMB enabled, Gen 1 locked in, power management disabled, everything green. For about four days. Then the wedges started again — less frequent, but the same fingerprint. The next month was a cascade of increasingly specific failure modes, each one peeling off one more layer of the onion.

Wedge Three: ext4 Lazy Init

The controller started dying roughly fifty seconds after every boot, deterministically. The trigger turned out to be ext4lazyinit, the background process that finishes initialising block group bitmaps on a freshly mounted filesystem. It reads the bitmaps in a tight loop — on a 4TB partition that is hundreds of 2 GB-spaced bitmap reads queued back-to-back. The P3 Plus controller could not sustain the burst.

The mitigation is a mount option I had never seen before:

UUID=... /data ext4 defaults,noatime,init_itable=3000,nofail 0 2

init_itable=3000 multiplies the per-block-group lazyinit delay by 3000. The filesystem still gets fully initialised, just over hours instead of a thirty-second burst. The controller stopped wedging at boot.

Wedge Four: rsync Over SSH

A few days later, during a large rsync pull from my laptop to the Pi, the controller went down again. Same fingerprint. init_itable only helped at mount time; user-driven write bursts were a different problem with the same shape. If boot bitmap reads could wedge the drive in thirty seconds, sustained rsync writes at line speed could do the same.

The only viable path was to throttle writes to something the controller could actually digest. That became a multi-layer project.

The Throttle Stack

Linux has several independent layers where you can cap I/O. Each has its own gotchas.

Writeback sysctls. The default kernel allows up to 20% of RAM (~1.6 GB on this host) to accumulate as dirty pages before it is forced to flush. When that flush happens, it is a huge burst. Shrinking the dirty pool to 64 MiB soft / 256 MiB hard spreads writes out in time.
Block-layer queue depth. The default queue depth (nr_requests=1023) lets the kernel queue far more commands than the controller’s internal queue can track. Lowering it to 128 and switching from none to mq-deadline caps concurrency and adds fair dispatch.
Ext4 commit interval. The default journal commits every five seconds. Under continuous writes this is twelve flush events per minute. commit=60 drops it to one.
cgroup I/O limits per service. systemd supports per-unit IOReadBandwidthMax and IOWriteBandwidthMax, enforced via the cgroup v2 io controller. Docker gets 40/100 MB/s, motionEye 10/10, cron and ssh 10/10 each.

This stack caught everything running as a system service. It did not catch my rsync.

Wedges Five and Six: systemd Has Sharp Edges

I had put a 10 MB/s cap on ssh.service, confident that this would throttle my rsync transfers. It did not. The controller wedged again thirty minutes after my next test.

The reason is structural. When OpenSSH accepts a login, systemd-logind moves the session out of ssh.service and into a new transient scope under the user’s own slice: /user.slice/user-1000.slice/session-N.scope. Any process you launch from that SSH session — rsync, scp, sftp-server, your shell — runs under user-1000.slice, not ssh.service. My carefully-configured cap applied to sshd itself and nothing else.

The correct target is a drop-in on the template user slice:

# /etc/systemd/system/user-.slice.d/nvme-throttle.conf
[Slice]
IOAccounting=yes
IOReadBandwidthMax=/dev/nvme0n1 10485760
IOWriteBandwidthMax=/dev/nvme0n1 10485760

The user-.slice.d directory is systemd’s template pattern for all user-UID.slice instances.

And then — because the controller wedged again after applying this — I discovered the second hidden trap: slice units require explicit IOAccounting=yes. Service units implicitly enable accounting when you set any IO* directive, but slice units do not. Without the accounting line, the IO*BandwidthMax= values are parsed, stored, and silently ignored.

systemctl show shows them configured. /sys/fs/cgroup/... shows an empty io.max. No errors. No warnings. Six hours of thinking I had fixed the problem when nothing was enforced.

A 200 MB dd from an SSH session now took twenty seconds. The cap was real. The wedges got rarer — not gone, but rarer.

The Final Polish

With the stack in place I went back to the forums and rpi-linux issue tracker to check whether I had missed anything. The community consensus for “controller is down” wedges on a Pi 5 is essentially the two kernel parameters the kernel itself suggests, the HMB workaround, and Gen 1 as a last resort. Nobody else seemed to have diagnosed the controller queue flooding pattern. I found nobody with a throttle stack anywhere near this depth. Two small gaps remained.

NVMe firmware. My drive shipped with firmware P9CR40A. Crucial’s current revision is P9CR40D. The release notes only advertise a Lenovo ThinkPad BIOS fix and “error handling enhancement” — not a ringing endorsement for my specific bug — but flashing is quick, and the error-handling change might make recovery a bit more graceful. The P3 Plus has only one firmware slot, so the flash is single-shot with no rollback. Update via nvme fw-download --xfer=16384 (the default transfer chunk is rejected by the drive) and nvme fw-commit --slot=0 --action=3. The -v verbose flag that one popular walkthrough uses is not accepted by Debian’s nvme-cli; I learned this the expensive way by rebooting with a staged-but-not-committed firmware image, which is promptly discarded, and ending up exactly where I started.
CMA pool size. The Pi 5’s default contiguous memory pool is 64 MiB. HMB’s 32 MiB plus the VC4 display driver’s allocation left just 22 MiB free. One memory spike and the 32 MiB HMB allocation could fail at boot — which would take me straight back to the original deadlock scenario. Adding dtoverlay=cma,cma-192 to config.txt triples the pool to 192 MiB. Post-reboot: 153 MiB free, HMB still allocates cleanly.

Practical Checklist for DRAM-less NVMe on a Pi 5

If you are reading this because a P3 Plus, Kingston NV2/NV3, WD Blue SN580, HP EX900, or any other DRAM-less drive is misbehaving on your Pi 5, here is the order I wish someone had handed me in January. Each step is safe in isolation; applying them in order is cheap compared to three months of guessing.

Confirm the drive is DRAM-less. Search the datasheet for “HMB” or “DRAM cache.” If HMB is mentioned, it is DRAM-less and needs HMB to perform reliably under load.
Enable HMB. Append nvme.max_host_mem_size_mb=128 to the end of /boot/firmware/cmdline.txt. The kernel takes the last occurrence of the parameter, overriding the firmware’s default of zero. Verify with dmesg | grep "host memory buffer" — you want to see allocated N MiB host memory buffer.
Drop to PCIe Gen 1. Add dtparam=pciex1_gen=1 to /boot/firmware/config.txt. Yes, you are giving up bandwidth. In exchange you are giving the controller room to breathe. On a home server doing photo ingestion, surveillance footage, and the occasional rsync, you will not notice the difference.
Disable NVMe and PCIe power management. Append nvme_core.default_ps_max_latency_us=0 pcie_aspm=off pcie_port_pm=off to the same cmdline.txt. The Pi is always on; you have nothing to gain from power-saving transitions and plenty to lose when a transition coincides with a command in flight.
Bump the CMA pool. Add dtoverlay=cma,cma-192 to config.txt. The default 64 MiB is enough for HMB most of the time, but leaves no margin if the display driver grows its allocation. 192 MiB is a free safety net.
Update the drive firmware. For the P3 Plus: download the Crucial P9CR40D package, unzip, run sudo nvme fw-download --fw=P9CR40D/1.bin --xfer=16384 /dev/nvme0 (the --xfer is mandatory — the default chunk size is rejected with error 0x2002), then sudo nvme fw-commit --slot=0 --action=3 /dev/nvme0, then reboot. Do not pass -v; Debian’s nvme-cli rejects it and your staged firmware is lost on reboot.
Mount with init_itable=3000. Only relevant on large (>2 TB) ext4 partitions. Prevents the lazy-init bitmap burst from wedging the controller roughly fifty seconds after boot.
Throttle writers that push sustained bursts. If you run Docker, a surveillance recorder, or accept large rsync transfers, add IOReadBandwidthMax= and IOWriteBandwidthMax= drop-ins. For SSH-initiated transfers, target the template slice user-.slice.d/ — with IOAccounting=yes explicitly set, or the cap is silently ignored. Verify the enforcement landed with cat /sys/fs/cgroup/user.slice/user-1000.slice/io.max.

The first four are the essentials. If your workload is light, those alone may be enough. The last four are what you reach for if the wedges return under real load.

What I Would Do Differently

Three months of misdiagnosis is painful to write down, but the lessons are specific enough to be useful next time.

Read the datasheet. The instant I learned that the P3 Plus is DRAM-less, the whole story would have unlocked. DRAM-less drives have well-known HMB dependencies. The Pi 5 disables HMB by default. Those two facts belong in the same sentence from day one. I never connected them because I bought the drive for price per terabyte and did not read past the capacity.

Trust SMART. The drive reported zero media errors, 100% spare, and 0% wear throughout the entire saga. Every wedge was a link-level or controller-level event, never a media event. If SMART is clean, the bytes you wrote are safe, and you are looking for a transport or firmware problem — not a failing drive.

The failure mode is not the cause. “Controller is down” looks like a drive failure. It can also be the controller deadlocking on its own queue, or the PCIe link resetting because the controller stopped answering, or the FTL not being reachable because HMB was never allocated. Same error text, completely different causes. Whenever the immediate error is ambiguous, spend an hour on the upstream state (queue depth, HMB, CMA, cgroup limits) before buying replacement hardware.

systemd slice semantics are not service semantics. I had used IOWriteBandwidthMax= on services before and assumed it “just worked.” On slices, it needs explicit IOAccounting=yes or it silently no-ops. This is not in the man page examples, and systemctl show happily reports the configured values even when the cgroup enforcement is empty. Always verify /sys/fs/cgroup/.../io.max directly.

Mitigations stack up fast when the underlying problem is unfixable. The path from “add one kernel parameter” to “six-layer I/O throttle stack with cgroup accounting, sysctl tuning, udev block queue configuration, ext4 mount options, NVMe firmware flash, and CMA pool resizing” is not obvious at step one. Each layer was individually justified by a specific wedge. None of them, alone or together, made the drive truly stable — they only stretched the time between wedges.

As of today the Pi has been up for three minutes with every layer active and the new firmware in place. Ask me in a month whether it still is.

Update: I Sold It

You asked. The stack did not hold. The wedges came back — less often, but the same fingerprint. The throttle layers reduced the rate; they did not eliminate it.

To retire the drive gracefully I moved it to an external USB-NVMe enclosure (RTL9210 bridge) and plugged it into a laptop, expecting at least a clean copy-off. It wedged there too. Same fingerprint, different host. Which is when I learned the thing I should have known months earlier: USB-NVMe bridges do not pass HMB through. The bridge controller talks to the drive in a basic command set that has no concept of host memory buffers, so the drive runs DRAM-less and HMB-less anywhere it sits behind a USB bridge — and eventually deadlocks under sustained writes, exactly as it did on the Pi before I enabled HMB. The earlier “worked perfectly in a USB enclosure” test in this very post turned out to be a short test that did not run long enough to provoke the failure.

That moved my mental model from “the Pi 5 firmware breaks this drive” to “this drive is fragile anywhere HMB is not allocated, and that includes a non-trivial chunk of the real world.” Which is when I gave up.

I sold it. Listed it on the local marketplace with full disclosure of the bug, and replaced it with a drive that has DRAM on board. No kernel parameters, no slice drop-ins, no init_itable=3000, no firmware acrobatics. The throttle stack came back out. The Pi is boring again.

If you are reading this because your DRAM-less drive is wedging on a Pi 5, the eight-step checklist above is honest work and it will reduce your wedge rate. But the one-line answer, after four months and a USB enclosure that also failed, is: sell the drive and buy one with DRAM. I wish someone had told me that in January.