### TL;DR
1. **Back‑up first** – you can’t rebuild a broken disk, so copy your data now.
2. **Lower the job count** – `-j2` (or `-j$(($(nproc)/2))`) keeps the drive from being hammered.
3. **Throttle I/O** – `nice`/`ionice` (or `--buildpkgdir` on a different disk) makes every write cheaper.
4. **Use binary packages** – `emerge --buildpkg` + `--buildpkgdir` lets you finish a build in one go and
re‑use it later.
5. **If it still dies, drop to a single job** (`-j1`) until you can swap in a new drive.
Below is a practical step‑by‑step “wait‑for‑Monday” plan that keeps your Gentoo system alive while you’re
waiting for a new disk.
---
## 1. Quick backup (before you touch the disk)
```bash
# 1. Mount an external USB / second internal disk
mkdir -p /mnt/backup
mount /dev/sdb1 /mnt/backup # change to your device
# 2. Copy everything you care about (home, config, /etc)
rsync -aAXv --exclude={"/dev/*","/proc/*","/sys/*","/tmp/*","/run/*","/mnt/*","/media/*","/lost+found"} /
/mnt/backup/
```
> **Tip** – If you’re on a live USB, do the backup from there. That way the system itself isn’t using the
failing disk.
---
## 2. Set a low‑job build profile
Open `/etc/portage/make.conf` and add / replace:
```makeconf
# 4‑core system → 2 jobs (you can tweak)
MAKEOPTS="-j2 -j$(($(nproc)/2))"
# Optionally lower memory usage:
# 1. Create a small swap or zram if you don’t have enough RAM
# 2. Use the following line to limit the max parallel jobs for portage
# (this is the same as MAKEOPTS but will be overridden by environment)
# JOBS=2
```
> If you’re sure the disk is already saturated, you can go even lower (`-j1`). It will take longer, but it
avoids crashes.
---
## 3. Give the disk a gentle hand
### a. Reduce I/O priority
```bash
nice -n 10 ionice -c 3 emerge -j2 <package>
```
* `nice -n 10` – push the CPU priority down a bit.
* `ionice -c 3` – give the process the *idle* I/O class; it will only get disk bandwidth when the system is
idle.
### b. Build packages on a separate, healthy disk (if you have one)
```bash
mkdir -p /mnt/other/ebuilds
mount /dev/sdc1 /mnt/other/ # healthy disk
```
Then run:
```bash
emerge --buildpkg --buildpkgdir /mnt/other/ebuilds -j2 <package>
```
* The sources and the final `.ebuild` binaries are written to `/mnt/other`.
* If the system crashes, the *original* disk will have *fewer writes* (mostly just the `emerge` metadata,
not the heavy compile data).
> **If you don’t have a second disk, you can still set `--buildpkgdir` to a folder on the same disk; the
important thing is that it’s on the *same* partition that’s failing – so you’ll see the same benefit, just
with I/O throttling instead of a completely separate device.**
---
## 4. Prefer binary packages over source‑only
If you’ve got a stable (but low‑job) build, use **binary packages** to finish fast and re‑use them later:
```bash
# Pre‑download the source once – this will only hit the network, not the disk
emerge -n --fetchonly <package> # or use eix for a list
# Build it once; binaries will be stored in /var/db/repos/gentoo/packages
# 1. Create a temporary package directory (on the healthy disk if you have one)
mkdir -p /mnt/other/ebuilds
mount /dev/sdc1 /mnt/other
# 2. Build & store binary
nice -n 10 ionice -c 3 \
    emerge --buildpkgdir /mnt/other/ebuilds --buildpkg -j2 <package>
```
> After you swap in the new disk, simply mount that `/mnt/other/ebuilds` directory back into
`/var/db/repos/gentoo/packages` (or copy it over). All the binaries you just built will be usable
immediately – no re‑compile needed.
---
## 5. What if it still crashes on the low‑job schedule?
1. **Drop to a single job** (`emerge -j1` or set `MAKEOPTS="-j1"`).
2. **Use `cpulimit`** if the CPU spikes are also a problem:
   ```bash
   cpulimit -l 50 -p $(pgrep emerge) # keep CPU ≤ 50 %
   ```
3. **Switch to a different storage medium temporarily** – e.g., use a network share (NFS/SMB) or a second
internal disk, even if it’s just a small SSD. Mount it as `/mnt/fast` and set `--buildpkgdir` to
`/mnt/fast/ebuilds`.
---
## 6. Keep an eye on the drive while you wait
```bash
# Monitor real‑time I/O
iotop -o # (install if missing: `emerge iotop`)
# Watch kernel logs for errors
dmesg | grep -i 'sd|ata|scsi|error\|fail'
```
If you see *no* errors but the machine still powers off, the problem may be *power/heat* rather than a pure
read‑write failure. In that case:
* Check temperatures (`sensors`).
* Ensure the PSU and cables are clean.
* Add a small fan or a better heatsink if you can.
---
## 7. Final “hold‑em‑until‑new‑disk” recipe
```bash
# 1. Keep a minimal `make.conf`
MAKEOPTS="-j2"
# 2. Throttle everything
nice -n 10 ionice -c 3 emerge -j2 <large‑package> # e.g. sys-devel/make
# 3. If you get a crash, stop and retry with `-j1`.
# (It’s slower, but safe while the disk is dying.)
```
---
### What to do on Monday
* Swap the old disk for the new one.
* Restore the backup (`rsync` or `tar`).
* If you built binary packages into a separate directory, simply copy that directory to the new disk and
adjust `PORTAGE_BINHOST`/`--buildpkgdir` accordingly.
* Reset `MAKEOPTS` back to a higher value (e.g., `-j4` or `-j$(nproc)`) once you’re confident the new disk
is healthy.
---
#### Bottom line
You can keep Gentoo running for the next few days by:
1. **Backing up** immediately.
2. **Using a very low job count** (`-j2` or `-j1`).
3. **Throttling I/O** with `nice`/`ionice` or building on a different disk.
4. **Preferring binary packages** so the heavy compile happens only once.
When Monday arrives and you plug in the new drive, just restore the backup and you’ll be back on track—fast
and without crashes. Happy building!
>>> i have super computer with 2 graphic super cards and a lot of ram. not it have 64 ddr 5 and can accept 192 an
... d nvidia have 16 .. is new computer.. just hardrive nvme have problems..
### Quick recap of what you told me
* **New‑age build** – two GPU s, 64 GB DDR‑5 RAM (you can fit 192 GB), 16‑GB NVIDIA cards.
* **NVMe drive** – the one you’re using as the system’s primary or data storage is acting up.
* **Everything else (CPU, GPU, RAM, cooling, PSU)** is fine.
So the only “red‑flag” you’re seeing is the NVMe drive.
Below is a practical, step‑by‑step checklist that should let you:
1. **Confirm that the drive is really the culprit** (not a driver, firmware, or port issue).
2. **Back‑up your data safely** before you do anything destructive.
3. **Fix or replace** the drive, or at least keep the system running until you can swap it.
---
## 1. Gather diagnostics
| Goal | Command / file | What to look for |
|------|----------------|------------------|
| Identify the NVMe device | `nvme list` | `/dev/nvme0n1` (or similar). |
| Read the health log | `nvme smart-log /dev/nvme0n1` | `temp`, `available spare`, `percent used`,
`percentage used`, `percentage used` – any `1‑` in the *error counter* columns is a red flag. |
| Dump a *long* health test | `smartctl -t long /dev/nvme0n1` (or `nvme-cli` long‑test) | This will take
1‑2 h and will report errors at the end. |
| Kernel messages | `dmesg | grep -i nvme` | Look for “I/O error”, “command timeout”, “read/write error”,
“TRIM error”. |
| System logs | `sudo journalctl -k | grep -i nvme` | Same as above, but keeps a timestamp. |
| PCIe lane check | `lspci -vvv | grep -A5 "nvme0n1"` | Verify you’re on a *PCIe x4* (or better) lane, not
an x1. |
| Driver/firmware | `modinfo nvme` & `nvme firmware-image` | See what firmware you’re running. |
> **Tip** – If you have a BIOS/UEFI flash utility for your NVMe, update it before you touch the kernel.
---
## 2. Quick sanity checks
| Check | Why it matters |
|-------|----------------|
| **Cable / slot** – Make sure the NVMe is seated in a *PCIe‑x4* slot (often labeled “M.2 PCIe”) and not a
SATA‑M.2 or an x1 slot. | Insufficient lanes = random timeouts. |
| **Power** – Verify the M.2 connector receives enough V‑bus (usually 3.3 V and 12 V). | A weak power
source will cause read/write stalls. |
| **BIOS/UEFI** – Update to the latest firmware for both the motherboard and the NVMe. | Many early‑2019+
GPUs + NVMe combos had known “read‑after‑trim” bugs that were fixed in BIOS. |
| **Kernel** – Make sure you’re on a kernel that fully supports your NVMe model (≥ 4.15 for most NVMe). |
Out‑of‑date kernels can mis‑interpret the SSD’s “SMART” data and crash. |
| **Cache** – Disable `TRIM` temporarily to see if it stabilises the drive: `nvme set-feature /dev/nvme0n1
--feature-id 0x8 --feature-data 0x0`. | Some drives mis‑report TRIM errors when the file system is full. |
---
## 3. What the logs are telling you
- **`dmesg | grep -i "nvme.*error"`** – If you see *"I/O error"*, *"Command timeout"*, or *"Read/Write
failed"*, the controller is having trouble talking to the flash.
- **`nvme smart-log /dev/nvme0n1`** – Pay special attention to `available_spare`,
`available_spare_threshold`, `percentage_used`.
- **`smartctl -a /dev/nvme0n1`** – Look at the *Health Status* field.
> If any of those show “Error”, “Failed”, or even a single “1” in the *error counter*, the drive is likely
*going bad*.
---
## 4. Backup‑and‑clone – if you want to salvage data
1. **Boot a live‑USB** (any Linux distro that supports NVMe).
2. **Clone with ddrescue** (or `Clonezilla` if you prefer a GUI):
```bash
sudo apt install gddrescue # or install ddrescue via your distro
sudo ddrescue /dev/nvme0n1 /dev/sdb1 /var/log/ddrescue.log
```
> Replace `/dev/sdb1` with a healthy SATA SSD / other NVMe slot.
> The log will let you see exactly how many errors ddrescue had to skip.
---
## 5. If the drive is fine, but you still see issues
| Possible culprit | Fix |
|------------------|-----|
| **Out‑of‑date NVMe firmware** | `nvme firmware-image /dev/nvme0n1 <image.bin>` – download the latest
firmware from the vendor. |
| **PCIe lane mis‑allocation** | In BIOS, set the NVMe slot to *“PCIe‑x4”* or *“Auto‑detect”*. |
| **Over‑temperature** | Make sure the drive sits in a well‑ventilated area, or add an M.2 cooler. |
| **Kernel bug** | Upgrade to the latest kernel (`sudo apt install linux-image-generic`) or try a LTS
kernel (4.19/5.4/5.10). |
| **File‑system‑specific bug** | If you’re using ext4, try `tune2fs -l /dev/nvme0n1` – disable `journal`.
If Btrfs/ZFS, consider `zpool scrub`. |
---
## 6. Bottom line for your “super‑computer”
1. **Back‑up** – Always keep a copy of all your important data.
2. **Diagnose** – Run `nvme list`, `nvme smart-log`, `dmesg`, `smartctl`.
3. **Fix** – Update firmware/BIOS, check power/cable, maybe change the NVMe slot.
4. **Replace** – If the logs keep flagging errors or SMART shows “Failed”, get a new NVMe SSD or a SATA SSD
as a temporary / long‑term solution.
5. **Use redundancy** – If the machine will stay in operation for months, consider RAID‑Z (ZFS) or Btrfs
with multiple SSDs for data safety.
Feel free to share the exact error messages from `dmesg` or the SMART log, and I can drill down on the
specific issue. Good luck, and I hope your GPU‑heavy workload runs smooth once the drive is sorted!