learning_ai_common_plat/AI.dev/CHEATSHEETS/long-running-jobs.md

9.0 KiB

🌙 Overnight / Long-Running Agent Runs — Cheat Sheet

What it is: how to run a multi-hour, unattended AI coding job (Devin / Claude Code / Codex / Copilot CLI) non-stop without it dying when the machine sleeps or the terminal closes — plus the ByteLyst best practices that make an unattended run safe. Best for: delegating a scoped roadmap (e.g. a 10-hour Phase build) and walking away overnight. Companion docs: the per-CLI sheets (devin-cli.md, claude-code-cli.md, codex-cli.md) and the canonical ../SKILLS/agent-behavior-guidelines.md.

🧠 The whole problem in one line: an unattended run has two enemies — the machine sleeping and the terminal/session dying. caffeinate kills the first, tmux the second, and a checkpoint file saves you from everything else (reboot, crash, power loss).


The two failure modes (and what fixes each)

Failure mode What happens Fix
Machine sleeps (idle timeout / lid closed) CPU + network suspend → job freezes mid-step, connections drop caffeinate (macOS) / systemd-inhibit (Linux) — keep it plugged in, lid open
Terminal/session dies (window closed, app crash/update, SSH drop, logout) Job receives SIGHUP and is killed tmux / screen — the job is owned by the tmux server, not the window
Reboot / crash / power loss Everything dies, including tmux Checkpoint + resume — the job writes progress to a file; re-running resumes it

caffeinate and tmux are complementary, not alternatives — use both for a real overnight run.

caffeinate (macOS) — stop the Mac from sleeping

caffeinate -dimsu <your-command>      # stays awake only while <command> runs
Flag Prevents
-d display sleep
-i idle sleep
-m disk sleep
-s system sleep (on AC power)
-u declares user active

⚠️ Lid caveat: caffeinate keeps the Mac awake, but closing the lid still sleeps most Macs unless on AC power with an external display (clamshell) or the right pmset. For overnight: keep it plugged in and the lid open.

tmux — survive a closed terminal / disconnect

tmux new -s phase3        # start a named, detachable session
#  ... launch your job inside it ...
# Ctrl-b  then  d         # DETACH — job keeps running with no terminal attached
tmux attach -t phase3     # reattach later (after reconnect, or a new window)
tmux ls                   # list running sessions
  • Local Mac: still useful — the job survives closing the terminal app, an app crash/auto-update, or logout (a plain terminal job dies on all of these). It does nothing for sleep — that's caffeinate's job.
  • Over SSH: essential — the job survives the SSH connection dropping.

Linux equivalents: systemd-inhibit --what=idle:sleep <cmd> (sleep), screen or tmux (session), or nohup <cmd> & + disown (cheapest detach, no reattach).

Putting it together

tmux new -s phase3
# inside the session:
caffeinate -dimsu <agent-cli> <all-allowed flag> "<the overnight prompt>" 2>&1 | tee ~/phase3.log
# Ctrl-b d to detach and walk away;  tmux attach -t phase3 in the morning

| tee ~/phase3.log captures the full run to disk so you have the log even if the scrollback is gone.


Best practices for an unattended agent run

  1. Green baseline first. Run the verify gate (pnpm build && pnpm test, browsers installed) before launching, so the agent starts from a known-clean state — never debug a pre-existing red suite at 2am.
  2. Self-contained roadmap, slice-by-slice. Point the agent at a doc that encodes scope, per-slice verify commands, and a done-definition (see ../PROMPTS/). One slice = one commit + push.
  3. Checkpoint + resume. The job must write progress to a file (e.g. docs/<phase>-progress.md: slice, status, commit SHA, gate result). If it dies for any reason, re-running the same prompt resumes from the first not-DONE slice. This is your safety net for the failures caffeinate/tmux can't catch.
  4. Failure protocol, not thrash. Tell it: max N honest attempts per slice, then commit wip(...) BLOCKED: + mark FAILED + move to the next independent slice. Order slices so the independent ones come first.
  5. Never merge unattended. Push to a feature branch and open one PR — a human reviews + merges in the morning. The agent must never touch main.
  6. Tests are sacred. Never weaken/skip a test to go green — fix the code (see the canonical guidelines). State this explicitly in the prompt.
  7. Scope + push-auth ready. Cache git credentials / gh auth login first so the overnight push never blocks on a prompt. Confirm push rights to the target remote.
  8. All-allowed flag. Launch with the CLI's auto-approve flag so it never pauses for permission (--permission-mode dangerous for Devin, --dangerously-skip-permissions for Claude Code, --dangerously-bypass-approvals-and-sandbox for Codex — confirm with <cli> --help).

ByteLyst-specific gotchas (the ones that silently fail overnight)

  • Corp proxy. pnpm install and Playwright browser downloads will TLS-fail behind the intercepting proxy unless NODE_EXTRA_CA_CERTS (proxy CA) is set. Prefer running off-corp for overnight jobs — simplest reliable path. (A proxy-blocked pnpm install is the classic cause of an overnight job that "did nothing".)
  • Playwright e2e. Slices with browser tests need npx playwright install --with-deps before the run, or the e2e gate fails.
  • Workspace, not registry. @bytelyst/* link via workspace:* from sibling packages/* — run from the monorepo so they resolve locally; no registry needed for the build itself.
  • Side-by-side repos. If the prompt file lives in a sibling repo (e.g. the gigafactory job lives in learning_ai_devops_tools but runs in learning_ai_common_plat), clone both under the same parent so the ../ relative path resolves.

Worked example — the gigafactory Phase 3 overnight run

# 1) bootstrap a clean baseline (once)
cd ~/code/mygh/learning_ai_common_plat && git pull && corepack enable \
  && pnpm install && pnpm build && pnpm test \
  && (cd dashboards/tracker-web && npx playwright install --with-deps)
cd ~/code/mygh/learning_ai_devops_tools && git pull   # prompt + roadmap repo

# 2) launch non-stop
cd ~/code/mygh/learning_ai_common_plat
tmux new -s phase3
caffeinate -dimsu <agent-cli> <all-allowed flag> \
  "Read ~/code/mygh/learning_ai_devops_tools/agent-queue/docs/jobs/phase3-overnight.md and execute it exactly: branch feat/gigafactory-phase3 off origin/main, implement Phase 3 slice-by-slice, verify each slice green before the next, checkpoint to docs/gigafactory-phase3-progress.md, commit+push per slice, open ONE PR and NEVER merge, never weaken tests (use the 3-attempt failure protocol), end with the consolidated report." \
  2>&1 | tee ~/phase3.log
# Ctrl-b d to detach

In the morning: tmux attach -t phase3, read the consolidated report, then review + merge the PR slice-by-slice.

Troubleshooting

Symptom Likely cause Fix
Job froze partway, no error Mac slept (idle or lid closed) Use caffeinate -dimsu; plug in; lid open
Job vanished when I closed Terminal No tmux — SIGHUP killed it Run inside tmux; reattach with tmux attach
"Did nothing" / 0 commits overnight pnpm install TLS-failed on corp proxy Run off-corp, or set NODE_EXTRA_CA_CERTS
e2e slice failed immediately Playwright browsers not installed npx playwright install --with-deps before launch
Push at end blocked / hung git/gh auth not cached gh auth login / cache credentials first
Re-ran prompt, it redid finished work No checkpoint file read Ensure the job reads *-progress.md and resumes

Quick-reference card

caffeinate -dimsu <cmd>          # macOS: stay awake while <cmd> runs (lid open + on AC!)
systemd-inhibit --what=idle:sleep <cmd>   # Linux equivalent
tmux new -s <name>               # detachable session  (Ctrl-b d = detach)
tmux attach -t <name>            # reattach            (tmux ls = list)
... | tee ~/run.log              # capture full output to disk
# launch:  tmux  ->  caffeinate -dimsu <cli> <auto-approve flag> "<prompt>" | tee log
# safety net:  checkpoint file + re-run prompt to resume;  push to branch, never merge

Related: devin-cli.md · claude-code-cli.md · codex-cli.md · ../PROMPTS/ · ../SKILLS/agent-behavior-guidelines.md

Last updated: 2026-05-30 · confirm CLI auto-approve flags against your installed version (<cli> --help).