learning_ai_common_plat/docs/runbooks/GITEA_VM_SETUP.md
saravanakumardb1 77b074f3c0 feat(gitea): docker-mode env hygiene + document containerized job migration
- add-host-runner.sh docker mode now strips host-specific envs (HOME, PATH,
  PNPM_HOME) that leak macOS paths into Linux containers and override workflow
  env (broke $HOME-relative writes)
- GITEA_VM_SETUP.md 11.5: reference pattern + 5 gotchas for migrating a real
  job (docker-lint) onto the docker runner: Actions secret (not token file),
  doctor.sh token-file requirement, host-env leakage, env_file token override,
  proxy bypass. Validated green on M-…-4.
2026-05-28 19:16:52 -07:00

23 KiB
Raw Permalink Blame History

Gitea Cloud VM Setup — Runbook

Status: Active runbook · Last verified: 2026-05-28 Use this when: You have provisioned a cloud VM (Azure / wherever), Gitea is installed and running on :3300, repos are cloned, and you need to wire the npm registry end-to-end with your laptop.

Assumes you're SSH'd into the VM (or running commands on the VM) and the sibling learning_ai_common_plat repo is at ~/code/mygh/learning_ai_common_plat/ on both the VM and your laptop. Adjust paths as needed.


Prerequisites checklist

Before starting, confirm all of these on the VM:

# 1. Gitea container is running and healthy
sudo docker ps | grep gitea
curl -fsS http://localhost:3300/api/v1/version
# Expected: {"version":"1.X.X"}

# 2. Port 3300 is reachable from your laptop
#    (run this FROM YOUR LAPTOP, not the VM)
#    curl -fsS http://<VM_HOST>:3300/api/v1/version

# 3. Repos cloned on the VM
ls ~/code/mygh/learning_ai_common_plat
# Expected: packages/  services/  scripts/  ...

If any of these fail, fix them first. Common gotchas:

  • Port 3300 blocked: Azure NSG → VM → Networking → "Add inbound port rule" → TCP 3300 from your home IP
  • Gitea registry disabled: Edit app.ini inside /var/lib/gitea/conf/, add [packages]\nENABLED = true, then sudo docker restart gitea
  • hostname resolves to localhost in the VM but you reach it via public DNS — note both, the API only needs localhost from inside the VM

Step 1 — Create Gitea admin user (skip if you already have one)

Run on the VM:

# Check if any admin exists
sudo docker exec gitea gitea admin user list

# If empty (or you don't have admin creds), create one:
ADMIN_USER="gitea-admin"
ADMIN_PASS="$(openssl rand -base64 24 | tr -dc 'A-Za-z0-9' | head -c 24)"
echo "  Admin password (save this!): $ADMIN_PASS"

sudo docker exec gitea gitea admin user create \
  --username "$ADMIN_USER" \
  --password "$ADMIN_PASS" \
  --email "admin@bytelyst.local" \
  --admin \
  --must-change-password=false

SAVE the admin password somewhere safe (1Password / Bitwarden / macOS Keychain). You'll need it only for token rotation and bootstrapping; day-to-day work uses the npm token.

Optionally store in macOS Keychain on your laptop:

# Run on your LAPTOP (Mac)
security add-generic-password \
  -s 'gitea-admin' \
  -a 'gitea-admin' \
  -w '<paste-the-password>'

Then scripts/gitea/token.sh rotate can auto-discover it later.


Step 2 — Create the npm owner user

The npm registry is namespaced by owner. The canonical owner is learning_ai_user.

Run on the VM:

ADMIN_USER="gitea-admin"
ADMIN_PASS="<paste-admin-password-from-step-1>"
NPM_USER="learning_ai_user"
NPM_PASS="$(openssl rand -base64 24 | tr -dc 'A-Za-z0-9' | head -c 24)"

# Create the user via admin API
curl -fsS -u "$ADMIN_USER:$ADMIN_PASS" \
  -X POST "http://localhost:3300/api/v1/admin/users" \
  -H 'Content-Type: application/json' \
  -d "{
    \"username\": \"$NPM_USER\",
    \"email\": \"npm@bytelyst.local\",
    \"password\": \"$NPM_PASS\",
    \"must_change_password\": false
  }"
echo ""
echo "  NPM user '$NPM_USER' created"
echo "  NPM password (needed only to mint tokens): $NPM_PASS"

Save NPM_PASS temporarily — you need it for Step 3. After that the npm token replaces it for all day-to-day use.

If you see {"message":"user already exists"} — that's fine, skip ahead. Use the existing NPM_PASS (or reset it via sudo docker exec gitea gitea admin user change-password --username learning_ai_user --password <new>).


Step 3 — Mint the npm token

Run on the VM:

NPM_USER="learning_ai_user"
NPM_PASS="<paste-from-step-2>"
TOKEN_NAME="npm-$(date +%Y%m%d-%H%M%S)-$(hostname -s)"

RESPONSE=$(curl -fsS -u "$NPM_USER:$NPM_PASS" \
  -X POST "http://localhost:3300/api/v1/users/$NPM_USER/tokens" \
  -H 'Content-Type: application/json' \
  -d "{
    \"name\": \"$TOKEN_NAME\",
    \"scopes\": [\"write:package\", \"read:package\"]
  }")
echo "$RESPONSE"

# Extract token (works for both newer "token" and older "sha1" field names)
TOKEN=$(echo "$RESPONSE" | grep -oE '"(sha1|token)":"[^"]+' | head -1 | sed 's/.*":"//')
echo ""
echo "════════════════════════════════════════"
echo "  NPM TOKEN (copy this NOW):"
echo "  $TOKEN"
echo "════════════════════════════════════════"

Copy the token immediately. Gitea never displays a token's secret value after the first response.


Step 4 — Wire the token into your laptop

Run on your LAPTOP (not the VM):

# Paste the token from Step 3 here:
TOKEN="<paste-token-from-step-3>"

# Write to the home-network-specific token file
echo -n "$TOKEN" > ~/.gitea_npm_token_home
chmod 600 ~/.gitea_npm_token_home

# Also update the catch-all file (used by some scripts as fallback)
echo -n "$TOKEN" > ~/.gitea_npm_token
chmod 600 ~/.gitea_npm_token

ls -la ~/.gitea_npm_token*

Expected output: two files, both -rw-------, 40 chars.


Step 5 — Tell switch-network.sh about the VM hostname

Run on your LAPTOP:

# Replace with your VM's public DNS or IP
echo "<VM_HOST>" > ~/.gitea_vm_host
# Example: echo "bytelyst-vm.eastus.cloudapp.azure.com" > ~/.gitea_vm_host

cat ~/.gitea_vm_host

Then refresh your shell environment so GITEA_NPM_HOST is exported:

source ~/.zshrc
echo "NETWORK=$NETWORK HOST=$GITEA_NPM_HOST OWNER=$GITEA_NPM_OWNER"

If you're on home network, expected output:

NETWORK=home HOST=<VM_HOST> OWNER=learning_ai_user

If NETWORK=corp, the host will be localhost (SSH tunnel mode). That's expected for corp; the VM workflow assumes you're on home network.


Step 6 — Pre-flight verification

Run on your LAPTOP:

bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/doctor.sh --probe @bytelyst/errors

Expected output:

✓ NETWORK=home
✓ GITEA_NPM_HOST=<VM_HOST>
✓ GITEA_NPM_OWNER=learning_ai_user
✓ Token consistent (env matches ~/.gitea_npm_token_home, 40 chars)
✓ Registry HTTP 200 on @bytelyst/errors
✗ @bytelyst/errors not found in registry (HTTP 404)

The 404 on @bytelyst/errors is EXPECTED at this point — the registry is empty. We fix that in Step 7.

If you see any other failure (token rejected, registry unreachable, owner 404), debug before moving on:

Failure Likely cause Fix
Registry unreachable Port 3300 not open to your laptop Open NSG inbound rule for TCP 3300
Token rejected (HTTP 401) Token typo or scope missing Re-run Step 3
Owner 'learning_ai_user' not found Step 2 was skipped or used wrong name Re-run Step 2
DNS does not resolve ~/.gitea_vm_host has typo Re-check value

Step 7 — Publish @bytelyst/* packages to the new VM

Run on the VM (so we publish from canonical sources, not your laptop):

cd ~/code/mygh/learning_ai_common_plat

# Build all packages first
pnpm install --frozen-lockfile
pnpm build

# Set env so publish targets the local Gitea
export GITEA_NPM_HOST=localhost
export GITEA_NPM_OWNER=learning_ai_user
export GITEA_NPM_TOKEN="<paste-token-from-step-3>"

# Publish every @bytelyst/* package
bash scripts/gitea/publish-local-packages.sh

Takes ~2-3 min for ~60 packages. You'll see one + @bytelyst/<name>@<version> line per package.

If you see npm ERR! 409 Conflict lines, that's fine — those packages were already published (idempotent).


Step 8 — End-to-end verification

Run on your LAPTOP:

# Re-run doctor — package probe should now succeed
bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/doctor.sh --probe @bytelyst/errors

Expected:

✓ @bytelyst/errors resolvable (latest versions: 0.1.10)
✅ All Gitea pre-flight checks passed

Then smoke-test a real product install:

cd ~/code/mygh/learning_ai_notes
rm -rf node_modules backend/node_modules web/node_modules
pnpm install

Should complete in ~15-30s. If you see ERR_PNPM_NO_MATCHING_VERSION, you've hit the historical-version gap — proceed to Step 9.


Step 9 — (Optional) Backfill historical versions

Some @bytelyst/* packages pin older versions transitively (e.g. @bytelyst/auth@0.1.5 pins @bytelyst/errors@0.1.5). The publish script only publishes the current version of each package; older versions need backfilling.

Run on the VM:

cd ~/code/mygh/learning_ai_common_plat
export GITEA_NPM_HOST=localhost
export GITEA_NPM_OWNER=learning_ai_user
export GITEA_NPM_TOKEN="<paste-token-from-step-3>"

bash scripts/gitea/publish-outdated-packages.sh

This walks pnpm view <pkg> versions --json for every @bytelyst/* package, checks out the matching git tag, builds, and publishes any version not yet in the registry. Slow (~10-15 min for full backfill) but only runs once.

After this, the learning_ai_notes install should complete without errors.


Step 10 — Persist environment for future shells

Add to your laptop's ~/.zshrc (or confirm these are already there from switch-network.sh sourcing):

# Already in switch-network.sh, but verify they're picked up
echo $GITEA_NPM_HOST     # should be your VM hostname on home network
echo $GITEA_NPM_OWNER    # should be learning_ai_user
echo $GITEA_NPM_TOKEN    # should be the 40-char token

If any are missing, ensure your ~/.zshrc has:

export NETWORK=home   # or corp; switch-network.sh keys on this
source "$HOME/code/mygh/learning_ai_common_plat/scripts/switch-network.sh"

Step 11 — Gitea Actions runner (CI)

The npm registry (Steps 110) is independent of CI. To run the docker-lint job (and the rest of each repo's .gitea/workflows/*.yml) you need an Actions runner registered against Gitea. This section makes that reproducible — the original runner was registered by hand.

11.1 — Enable Actions on the instance

In app.ini (inside the Gitea container/conf dir):

[actions]
ENABLED = true

Then sudo docker restart gitea. Confirm with curl -fsS http://localhost:3300/api/v1/version and that the repo Settings → Actions toggle is available.

11.2 — Install and register the runner

# macOS host runner (laptop) — install
brew install act_runner

# Register reproducibly (fetches a registration token via the admin API,
# registers with the agreed labels + capacity). Host mode is the default.
GITEA_ADMIN_USER=gitea-admin GITEA_ADMIN_PASS='<admin-pass>' \
  bash scripts/gitea/register-runner.sh --name bytelyst-mac --capacity 2

# Containerized runner (better isolation; requires Docker on the host):
GITEA_ADMIN_USER=gitea-admin GITEA_ADMIN_PASS='<admin-pass>' \
  bash scripts/gitea/register-runner.sh --mode docker --capacity 2

register-runner.sh is idempotent: if a runner is already registered it prints the current identity and exits. Pass --force to re-register (this invalidates the old runner row in Gitea).

11.3 — Host mode vs. containerized mode

Host mode (ubuntu-latest:host) Docker mode (ubuntu-latest:docker://…)
Isolation None — jobs run directly on macOS Each job in a fresh container
Speed Fast (no image pull) Slower first run (pulls catthehacker/ubuntu)
Reproducibility Depends on host toolchain Pinned image, matches GitHub closely
Best for Single-operator laptop / corp proxy Shared/VM runners, untrusted PRs

We run host mode on the laptop because the corp proxy + Docker-in-Docker is fragile, and the jobs are trusted (own repos). Prefer docker mode on the VM.

11.4 — Secrets: never inline the token

The npm token must not live inline in config.yaml. Externalise it into a gitignored runner.env referenced by runner.env_file:

# /opt/homebrew/etc/act_runner/config.yaml
runner:
  capacity: 2 # parallel jobs
  env_file: '/opt/homebrew/etc/act_runner/runner.env'
  envs:
    # GITEA_NPM_TOKEN is loaded from env_file — never inline here.
    NODE_ENV: test
# /opt/homebrew/etc/act_runner/runner.env  (chmod 600, never committed)
GITEA_NPM_TOKEN=<token>

After editing config, reload the daemon:

brew services restart act_runner            # or, if brew name mismatches:
launchctl kickstart -k gui/$(id -u)/homebrew.mxcl.act_runner
tail -f /opt/homebrew/var/log/act_runner.log   # expect "declare successfully"

11.5 — Concurrency

runner.capacity controls parallel jobs on one runner. With capacity: 1 the lightweight docker-lint job queues behind slow backend/web/mobile/E2E jobs (observed: ~13 min wait). capacity: 2 lets docker-lint run alongside one heavy job. For more parallelism, register additional runners rather than pushing capacity high on a single laptop — each runner gets its own workdir and process, so failures/timeouts stay isolated.

Add more host runners (reproducible):

# Stand up runners #2 and #3 (each capacity 2) as their own launchd services.
# Shares the canonical runner.env token; separate config/.runner/workdir.
bash scripts/gitea/add-host-runner.sh 2 2
bash scripts/gitea/add-host-runner.sh 3 2

add-host-runner.sh <N> [capacity]:

  • derives a per-runner config.yaml from the canonical one (preserves proxy env + env_file), overriding runner.file, runner.capacity, and a unique host.workdir_parent (~/.cache/act-<N>)
  • fetches a one-time registration token via the admin API (~/.gitea_c5_pat)
  • registers as $(hostname -s)-<N> with host-mode labels
  • writes + loads ~/Library/LaunchAgents/com.bytelyst.act_runner-<N>.plist (RunAtLoad + KeepAlive)
  • idempotent: re-running just reloads the service

The Homebrew act_runner service is runner #1; add-host-runner.sh adds #2, #3, … Current fleet: 3 host runners × capacity 3 ≈ 9 parallel host slots. Verified: pushing a multi-job workflow distributes jobs across all three runners simultaneously.

Add a docker-mode runner (stronger isolation):

# Pull the act image once (≈2.3 GB; works through the corp proxy):
docker pull catthehacker/ubuntu:act-latest

# Stand up runner #4 in docker mode (capacity 1):
bash scripts/gitea/add-host-runner.sh 4 1 docker

Docker mode advertises a dedicated docker label (not ubuntu-latest), so it does not hijack the host-mode ubuntu-latest jobs. Opt a job in with runs-on: docker. The generated config:

  • sets runner.labels to docker:docker://<image> (act_runner reads labels from the config file, not register --labels)
  • container.docker_host: "-", force_pull: false (use the locally-pulled image), options: --add-host=host.docker.internal:host-gateway
  • adds host.docker.internal to NO_PROXY/no_proxy — without this, containerized jobs inherit the corp proxy env and route host.docker.internal:3300 through the proxy, getting an HTTP 504. Jobs must reach Gitea via host.docker.internal:3300 (not localhost) from inside the container.

Validated end-to-end: a runs-on: docker job runs in an Ubuntu 24.04 container and reaches Gitea (GET /api/v1/version{"version":"…"}).

Migrating a real job onto the docker runner. The host-clone CI model (working-directory: /Users/… + git pull) does not work in a container. A containerized job must instead clone what it needs and reach Gitea via host.docker.internal. Reference pattern (the docker-lint job in learning_ai_clock):

docker-lint:
  runs-on: docker
  env:
    GITEA_NPM_HOST: host.docker.internal
    GITEA_NPM_OWNER: learning_ai_user
    GITEA_NPM_TOKEN: ${{ secrets.NPM_REGISTRY_TOKEN }} # GITEA_-prefixed names are reserved
    NO_PROXY: host.docker.internal,localhost,127.0.0.1
    no_proxy: host.docker.internal,localhost,127.0.0.1
  steps:
    - name: Fetch repo + canonical doctor scripts
      run: |
        G="http://host.docker.internal:3300/learning_ai_user"
        git clone --depth 1 "$G/${GITHUB_REPOSITORY##*/}.git" repo
        git clone --depth 1 "$G/learning_ai_common_plat.git" common-plat        
    - name: gitea-doctor
      run: |
        printf '%s' "$GITEA_NPM_TOKEN" > "$HOME/.gitea_npm_token"   # doctor.sh needs a token file
        bash common-plat/scripts/gitea/doctor.sh --quiet        
    - name: docker-doctor
      run: bash common-plat/scripts/docker-doctor.sh --repo "$PWD/repo" --quiet

Gotchas that cost real debugging time (all now handled by add-host-runner.sh docker mode + this pattern):

  1. Secret, not file. Job containers do not see the runner's ~/.gitea_npm_token. Provide the registry token as a Gitea Actions secret (NPM_REGISTRY_TOKEN, set at repo or user level via PUT /api/v1/repos/{owner}/{repo}/actions/secrets/{name}). GITEA_/GITHUB_ prefixes are reserved → secret create returns HTTP 400.
  2. doctor.sh requires a token file. Its stale-shell check errors if ~/.gitea_npm_token is absent even when the env token is set — so the job writes the secret to the file first.
  3. Host envs leak into the container. The runner injects HOME, PATH, PNPM_HOME (macOS paths) which override workflow env: and break $HOME-relative writes. add-host-runner.sh docker mode strips these so the container uses its image defaults (/root).
  4. env_file token overrides the secret. The runner's runner.env GITEA_NPM_TOKEN is injected into jobs and wins over ${{ secrets.* }}. Keep runner.env in sync with the current token (see §11.6) — a stale value there causes registry HTTP 401 even with a correct secret.
  5. Proxy bypass. host.docker.internal must be in NO_PROXY/no_proxy (handled by the runner config) or the corp proxy returns HTTP 504.

Validated: docker-lint runs green on the docker runner (M-…-4) — git clone over host.docker.internal, gitea-doctor registry probe → 200, docker-doctor lint pass.

List + prune the fleet:

PAT=$(cat ~/.gitea_c5_pat)
curl -s -H "Authorization: token $PAT" \
  http://localhost:3300/api/v1/admin/actions/runners \
  | python3 -c "import json,sys; [print(r['id'], r['name'], r['status']) for r in json.load(sys.stdin)['runners']]"
# Delete a stale/offline runner by id:
curl -s -X DELETE -H "Authorization: token $PAT" \
  http://localhost:3300/api/v1/admin/actions/runners/<id>

Remove an extra runner entirely:

launchctl bootout "gui/$(id -u)/com.bytelyst.act_runner-2" 2>/dev/null || true
rm -f ~/Library/LaunchAgents/com.bytelyst.act_runner-2.plist
rm -rf "$HOME/Library/Application Support/act_runner-2" ~/.cache/act-2
# then DELETE its row via the admin API (above)

11.6 — Runner token rotation

The registration token (used once at register time) is separate from the npm token (used by jobs). To rotate:

# Registration token — just re-register; old token is single-use anyway:
bash scripts/gitea/register-runner.sh --force --name bytelyst-mac

# npm token — rotate via the existing helper, then update runner.env:
bash scripts/gitea/token.sh rotate
TOKEN=$(cat ~/.gitea_npm_token)
printf 'GITEA_NPM_TOKEN=%s\n' "$TOKEN" > /opt/homebrew/etc/act_runner/runner.env
chmod 600 /opt/homebrew/etc/act_runner/runner.env
brew services restart act_runner

11.7 — Verify CI end-to-end

Push any repo that has a docker-lint job, then:

PAT=$(cat ~/.gitea_c5_pat)
R=learning_ai_clock
RID=$(curl -s -H "Authorization: token $PAT" \
  "http://localhost:3300/api/v1/repos/learning_ai_user/$R/actions/runs?limit=1" \
  | python3 -c "import json,sys; print(json.load(sys.stdin)['workflow_runs'][0]['id'])")
curl -s -H "Authorization: token $PAT" \
  "http://localhost:3300/api/v1/repos/learning_ai_user/$R/actions/runs/$RID/jobs" \
  | python3 -c "import json,sys; [print(j['status'], j.get('conclusion'), j['name']) for j in json.load(sys.stdin)['jobs']]"
# Expect: completed success Docker lint — gitea-doctor + docker-doctor

Troubleshooting

Doctor reports STALE TOKEN: env GITEA_NPM_TOKEN ≠ file

Your shell has an old token cached. Fix:

source ~/.zshrc
# Or to refresh just the token without sourcing everything:
eval "$(bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/token.sh print --export)"

pnpm install fails with EAI_AGAIN or ETIMEDOUT

DNS or network. Verify:

ping <VM_HOST>
nslookup <VM_HOST>
curl -v http://<VM_HOST>:3300/api/v1/version

Need to rotate the token

# On laptop (assumes Keychain entry from Step 1):
bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/token.sh rotate

Or manually re-run Step 3 on the VM.

Need to start over (nuke the Gitea data)

On the VM (destructive):

sudo docker stop gitea
sudo rm -rf /var/lib/gitea/*    # or wherever your data volume is
sudo docker start gitea
# Then re-run Steps 1-7

What's persistent vs. ephemeral

Item Where Survives VM reboot? Survives VM rebuild?
Gitea database /var/lib/gitea/data/gitea.db (snapshot the disk)
Published packages /var/lib/gitea/data/packages/ (re-publish via Step 7)
Admin/npm users inside Gitea DB (re-run Steps 1-2)
NPM tokens inside Gitea DB + your ~/.gitea_npm_token_home (re-run Step 3)
~/.gitea_vm_host your laptop n/a
~/.gitea_npm_token_home your laptop n/a
Actions runner registration Gitea DB + .runner file (re-run register-runner.sh)
Runner secrets act_runner/runner.env (chmod 600) (recreate from token)

For VM rebuilds: snapshot /var/lib/gitea to Azure Disk Snapshot weekly, restore on rebuild. Avoids re-running Steps 1-7.


See also

  • scripts/gitea/doctor.sh — pre-flight validation (run before every deploy)
  • scripts/gitea/token.sh — token rotation helper
  • scripts/gitea/register-runner.sh — reproducible Actions runner registration (Step 11)
  • scripts/gitea/bootstrap-vm.sh — automates Steps 1-3 on a fresh VM
  • scripts/switch-network.sh — exports GITEA_NPM_* env vars per network
  • docker-build-optimization-roadmap.md (in learning_ai_devops_tools/docs/) — ecosystem-wide Docker build hardening that depends on this setup