learning_ai_common_plat/docs/runbooks/GITEA_VM_SETUP.md
saravanakumardb1 7e1a2ad660 feat(gitea): add-host-runner.sh for multi-runner CI parallelism
- scripts/gitea/add-host-runner.sh: stand up Nth independent host-mode runner
  as its own launchd service (separate config/.runner/workdir, shared
  runner.env token, admin-API registration token, idempotent reload)
- GITEA_VM_SETUP.md 11.5: document multi-runner setup, fleet list/prune,
  and removal; 3 runners x capacity 2 ~= 6 parallel slots (verified)

Live fleet: learning-ai-mac (brew) + 2 added runners, all online; stale
offline registrations pruned.
2026-05-28 18:31:57 -07:00

19 KiB
Raw Blame History

Gitea Cloud VM Setup — Runbook

Status: Active runbook · Last verified: 2026-05-28 Use this when: You have provisioned a cloud VM (Azure / wherever), Gitea is installed and running on :3300, repos are cloned, and you need to wire the npm registry end-to-end with your laptop.

Assumes you're SSH'd into the VM (or running commands on the VM) and the sibling learning_ai_common_plat repo is at ~/code/mygh/learning_ai_common_plat/ on both the VM and your laptop. Adjust paths as needed.


Prerequisites checklist

Before starting, confirm all of these on the VM:

# 1. Gitea container is running and healthy
sudo docker ps | grep gitea
curl -fsS http://localhost:3300/api/v1/version
# Expected: {"version":"1.X.X"}

# 2. Port 3300 is reachable from your laptop
#    (run this FROM YOUR LAPTOP, not the VM)
#    curl -fsS http://<VM_HOST>:3300/api/v1/version

# 3. Repos cloned on the VM
ls ~/code/mygh/learning_ai_common_plat
# Expected: packages/  services/  scripts/  ...

If any of these fail, fix them first. Common gotchas:

  • Port 3300 blocked: Azure NSG → VM → Networking → "Add inbound port rule" → TCP 3300 from your home IP
  • Gitea registry disabled: Edit app.ini inside /var/lib/gitea/conf/, add [packages]\nENABLED = true, then sudo docker restart gitea
  • hostname resolves to localhost in the VM but you reach it via public DNS — note both, the API only needs localhost from inside the VM

Step 1 — Create Gitea admin user (skip if you already have one)

Run on the VM:

# Check if any admin exists
sudo docker exec gitea gitea admin user list

# If empty (or you don't have admin creds), create one:
ADMIN_USER="gitea-admin"
ADMIN_PASS="$(openssl rand -base64 24 | tr -dc 'A-Za-z0-9' | head -c 24)"
echo "  Admin password (save this!): $ADMIN_PASS"

sudo docker exec gitea gitea admin user create \
  --username "$ADMIN_USER" \
  --password "$ADMIN_PASS" \
  --email "admin@bytelyst.local" \
  --admin \
  --must-change-password=false

SAVE the admin password somewhere safe (1Password / Bitwarden / macOS Keychain). You'll need it only for token rotation and bootstrapping; day-to-day work uses the npm token.

Optionally store in macOS Keychain on your laptop:

# Run on your LAPTOP (Mac)
security add-generic-password \
  -s 'gitea-admin' \
  -a 'gitea-admin' \
  -w '<paste-the-password>'

Then scripts/gitea/token.sh rotate can auto-discover it later.


Step 2 — Create the npm owner user

The npm registry is namespaced by owner. The canonical owner is learning_ai_user.

Run on the VM:

ADMIN_USER="gitea-admin"
ADMIN_PASS="<paste-admin-password-from-step-1>"
NPM_USER="learning_ai_user"
NPM_PASS="$(openssl rand -base64 24 | tr -dc 'A-Za-z0-9' | head -c 24)"

# Create the user via admin API
curl -fsS -u "$ADMIN_USER:$ADMIN_PASS" \
  -X POST "http://localhost:3300/api/v1/admin/users" \
  -H 'Content-Type: application/json' \
  -d "{
    \"username\": \"$NPM_USER\",
    \"email\": \"npm@bytelyst.local\",
    \"password\": \"$NPM_PASS\",
    \"must_change_password\": false
  }"
echo ""
echo "  NPM user '$NPM_USER' created"
echo "  NPM password (needed only to mint tokens): $NPM_PASS"

Save NPM_PASS temporarily — you need it for Step 3. After that the npm token replaces it for all day-to-day use.

If you see {"message":"user already exists"} — that's fine, skip ahead. Use the existing NPM_PASS (or reset it via sudo docker exec gitea gitea admin user change-password --username learning_ai_user --password <new>).


Step 3 — Mint the npm token

Run on the VM:

NPM_USER="learning_ai_user"
NPM_PASS="<paste-from-step-2>"
TOKEN_NAME="npm-$(date +%Y%m%d-%H%M%S)-$(hostname -s)"

RESPONSE=$(curl -fsS -u "$NPM_USER:$NPM_PASS" \
  -X POST "http://localhost:3300/api/v1/users/$NPM_USER/tokens" \
  -H 'Content-Type: application/json' \
  -d "{
    \"name\": \"$TOKEN_NAME\",
    \"scopes\": [\"write:package\", \"read:package\"]
  }")
echo "$RESPONSE"

# Extract token (works for both newer "token" and older "sha1" field names)
TOKEN=$(echo "$RESPONSE" | grep -oE '"(sha1|token)":"[^"]+' | head -1 | sed 's/.*":"//')
echo ""
echo "════════════════════════════════════════"
echo "  NPM TOKEN (copy this NOW):"
echo "  $TOKEN"
echo "════════════════════════════════════════"

Copy the token immediately. Gitea never displays a token's secret value after the first response.


Step 4 — Wire the token into your laptop

Run on your LAPTOP (not the VM):

# Paste the token from Step 3 here:
TOKEN="<paste-token-from-step-3>"

# Write to the home-network-specific token file
echo -n "$TOKEN" > ~/.gitea_npm_token_home
chmod 600 ~/.gitea_npm_token_home

# Also update the catch-all file (used by some scripts as fallback)
echo -n "$TOKEN" > ~/.gitea_npm_token
chmod 600 ~/.gitea_npm_token

ls -la ~/.gitea_npm_token*

Expected output: two files, both -rw-------, 40 chars.


Step 5 — Tell switch-network.sh about the VM hostname

Run on your LAPTOP:

# Replace with your VM's public DNS or IP
echo "<VM_HOST>" > ~/.gitea_vm_host
# Example: echo "bytelyst-vm.eastus.cloudapp.azure.com" > ~/.gitea_vm_host

cat ~/.gitea_vm_host

Then refresh your shell environment so GITEA_NPM_HOST is exported:

source ~/.zshrc
echo "NETWORK=$NETWORK HOST=$GITEA_NPM_HOST OWNER=$GITEA_NPM_OWNER"

If you're on home network, expected output:

NETWORK=home HOST=<VM_HOST> OWNER=learning_ai_user

If NETWORK=corp, the host will be localhost (SSH tunnel mode). That's expected for corp; the VM workflow assumes you're on home network.


Step 6 — Pre-flight verification

Run on your LAPTOP:

bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/doctor.sh --probe @bytelyst/errors

Expected output:

✓ NETWORK=home
✓ GITEA_NPM_HOST=<VM_HOST>
✓ GITEA_NPM_OWNER=learning_ai_user
✓ Token consistent (env matches ~/.gitea_npm_token_home, 40 chars)
✓ Registry HTTP 200 on @bytelyst/errors
✗ @bytelyst/errors not found in registry (HTTP 404)

The 404 on @bytelyst/errors is EXPECTED at this point — the registry is empty. We fix that in Step 7.

If you see any other failure (token rejected, registry unreachable, owner 404), debug before moving on:

Failure Likely cause Fix
Registry unreachable Port 3300 not open to your laptop Open NSG inbound rule for TCP 3300
Token rejected (HTTP 401) Token typo or scope missing Re-run Step 3
Owner 'learning_ai_user' not found Step 2 was skipped or used wrong name Re-run Step 2
DNS does not resolve ~/.gitea_vm_host has typo Re-check value

Step 7 — Publish @bytelyst/* packages to the new VM

Run on the VM (so we publish from canonical sources, not your laptop):

cd ~/code/mygh/learning_ai_common_plat

# Build all packages first
pnpm install --frozen-lockfile
pnpm build

# Set env so publish targets the local Gitea
export GITEA_NPM_HOST=localhost
export GITEA_NPM_OWNER=learning_ai_user
export GITEA_NPM_TOKEN="<paste-token-from-step-3>"

# Publish every @bytelyst/* package
bash scripts/gitea/publish-local-packages.sh

Takes ~2-3 min for ~60 packages. You'll see one + @bytelyst/<name>@<version> line per package.

If you see npm ERR! 409 Conflict lines, that's fine — those packages were already published (idempotent).


Step 8 — End-to-end verification

Run on your LAPTOP:

# Re-run doctor — package probe should now succeed
bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/doctor.sh --probe @bytelyst/errors

Expected:

✓ @bytelyst/errors resolvable (latest versions: 0.1.10)
✅ All Gitea pre-flight checks passed

Then smoke-test a real product install:

cd ~/code/mygh/learning_ai_notes
rm -rf node_modules backend/node_modules web/node_modules
pnpm install

Should complete in ~15-30s. If you see ERR_PNPM_NO_MATCHING_VERSION, you've hit the historical-version gap — proceed to Step 9.


Step 9 — (Optional) Backfill historical versions

Some @bytelyst/* packages pin older versions transitively (e.g. @bytelyst/auth@0.1.5 pins @bytelyst/errors@0.1.5). The publish script only publishes the current version of each package; older versions need backfilling.

Run on the VM:

cd ~/code/mygh/learning_ai_common_plat
export GITEA_NPM_HOST=localhost
export GITEA_NPM_OWNER=learning_ai_user
export GITEA_NPM_TOKEN="<paste-token-from-step-3>"

bash scripts/gitea/publish-outdated-packages.sh

This walks pnpm view <pkg> versions --json for every @bytelyst/* package, checks out the matching git tag, builds, and publishes any version not yet in the registry. Slow (~10-15 min for full backfill) but only runs once.

After this, the learning_ai_notes install should complete without errors.


Step 10 — Persist environment for future shells

Add to your laptop's ~/.zshrc (or confirm these are already there from switch-network.sh sourcing):

# Already in switch-network.sh, but verify they're picked up
echo $GITEA_NPM_HOST     # should be your VM hostname on home network
echo $GITEA_NPM_OWNER    # should be learning_ai_user
echo $GITEA_NPM_TOKEN    # should be the 40-char token

If any are missing, ensure your ~/.zshrc has:

export NETWORK=home   # or corp; switch-network.sh keys on this
source "$HOME/code/mygh/learning_ai_common_plat/scripts/switch-network.sh"

Step 11 — Gitea Actions runner (CI)

The npm registry (Steps 110) is independent of CI. To run the docker-lint job (and the rest of each repo's .gitea/workflows/*.yml) you need an Actions runner registered against Gitea. This section makes that reproducible — the original runner was registered by hand.

11.1 — Enable Actions on the instance

In app.ini (inside the Gitea container/conf dir):

[actions]
ENABLED = true

Then sudo docker restart gitea. Confirm with curl -fsS http://localhost:3300/api/v1/version and that the repo Settings → Actions toggle is available.

11.2 — Install and register the runner

# macOS host runner (laptop) — install
brew install act_runner

# Register reproducibly (fetches a registration token via the admin API,
# registers with the agreed labels + capacity). Host mode is the default.
GITEA_ADMIN_USER=gitea-admin GITEA_ADMIN_PASS='<admin-pass>' \
  bash scripts/gitea/register-runner.sh --name bytelyst-mac --capacity 2

# Containerized runner (better isolation; requires Docker on the host):
GITEA_ADMIN_USER=gitea-admin GITEA_ADMIN_PASS='<admin-pass>' \
  bash scripts/gitea/register-runner.sh --mode docker --capacity 2

register-runner.sh is idempotent: if a runner is already registered it prints the current identity and exits. Pass --force to re-register (this invalidates the old runner row in Gitea).

11.3 — Host mode vs. containerized mode

Host mode (ubuntu-latest:host) Docker mode (ubuntu-latest:docker://…)
Isolation None — jobs run directly on macOS Each job in a fresh container
Speed Fast (no image pull) Slower first run (pulls catthehacker/ubuntu)
Reproducibility Depends on host toolchain Pinned image, matches GitHub closely
Best for Single-operator laptop / corp proxy Shared/VM runners, untrusted PRs

We run host mode on the laptop because the corp proxy + Docker-in-Docker is fragile, and the jobs are trusted (own repos). Prefer docker mode on the VM.

11.4 — Secrets: never inline the token

The npm token must not live inline in config.yaml. Externalise it into a gitignored runner.env referenced by runner.env_file:

# /opt/homebrew/etc/act_runner/config.yaml
runner:
  capacity: 2 # parallel jobs
  env_file: '/opt/homebrew/etc/act_runner/runner.env'
  envs:
    # GITEA_NPM_TOKEN is loaded from env_file — never inline here.
    NODE_ENV: test
# /opt/homebrew/etc/act_runner/runner.env  (chmod 600, never committed)
GITEA_NPM_TOKEN=<token>

After editing config, reload the daemon:

brew services restart act_runner            # or, if brew name mismatches:
launchctl kickstart -k gui/$(id -u)/homebrew.mxcl.act_runner
tail -f /opt/homebrew/var/log/act_runner.log   # expect "declare successfully"

11.5 — Concurrency

runner.capacity controls parallel jobs on one runner. With capacity: 1 the lightweight docker-lint job queues behind slow backend/web/mobile/E2E jobs (observed: ~13 min wait). capacity: 2 lets docker-lint run alongside one heavy job. For more parallelism, register additional runners rather than pushing capacity high on a single laptop — each runner gets its own workdir and process, so failures/timeouts stay isolated.

Add more host runners (reproducible):

# Stand up runners #2 and #3 (each capacity 2) as their own launchd services.
# Shares the canonical runner.env token; separate config/.runner/workdir.
bash scripts/gitea/add-host-runner.sh 2 2
bash scripts/gitea/add-host-runner.sh 3 2

add-host-runner.sh <N> [capacity]:

  • derives a per-runner config.yaml from the canonical one (preserves proxy env + env_file), overriding runner.file, runner.capacity, and a unique host.workdir_parent (~/.cache/act-<N>)
  • fetches a one-time registration token via the admin API (~/.gitea_c5_pat)
  • registers as $(hostname -s)-<N> with host-mode labels
  • writes + loads ~/Library/LaunchAgents/com.bytelyst.act_runner-<N>.plist (RunAtLoad + KeepAlive)
  • idempotent: re-running just reloads the service

The Homebrew act_runner service is runner #1; add-host-runner.sh adds #2, #3, … Three runners × capacity 2 ≈ 6 parallel job slots. Verified: pushing a multi-job workflow lights up all three runners simultaneously.

List + prune the fleet:

PAT=$(cat ~/.gitea_c5_pat)
curl -s -H "Authorization: token $PAT" \
  http://localhost:3300/api/v1/admin/actions/runners \
  | python3 -c "import json,sys; [print(r['id'], r['name'], r['status']) for r in json.load(sys.stdin)['runners']]"
# Delete a stale/offline runner by id:
curl -s -X DELETE -H "Authorization: token $PAT" \
  http://localhost:3300/api/v1/admin/actions/runners/<id>

Remove an extra runner entirely:

launchctl bootout "gui/$(id -u)/com.bytelyst.act_runner-2" 2>/dev/null || true
rm -f ~/Library/LaunchAgents/com.bytelyst.act_runner-2.plist
rm -rf "$HOME/Library/Application Support/act_runner-2" ~/.cache/act-2
# then DELETE its row via the admin API (above)

11.6 — Runner token rotation

The registration token (used once at register time) is separate from the npm token (used by jobs). To rotate:

# Registration token — just re-register; old token is single-use anyway:
bash scripts/gitea/register-runner.sh --force --name bytelyst-mac

# npm token — rotate via the existing helper, then update runner.env:
bash scripts/gitea/token.sh rotate
TOKEN=$(cat ~/.gitea_npm_token)
printf 'GITEA_NPM_TOKEN=%s\n' "$TOKEN" > /opt/homebrew/etc/act_runner/runner.env
chmod 600 /opt/homebrew/etc/act_runner/runner.env
brew services restart act_runner

11.7 — Verify CI end-to-end

Push any repo that has a docker-lint job, then:

PAT=$(cat ~/.gitea_c5_pat)
R=learning_ai_clock
RID=$(curl -s -H "Authorization: token $PAT" \
  "http://localhost:3300/api/v1/repos/learning_ai_user/$R/actions/runs?limit=1" \
  | python3 -c "import json,sys; print(json.load(sys.stdin)['workflow_runs'][0]['id'])")
curl -s -H "Authorization: token $PAT" \
  "http://localhost:3300/api/v1/repos/learning_ai_user/$R/actions/runs/$RID/jobs" \
  | python3 -c "import json,sys; [print(j['status'], j.get('conclusion'), j['name']) for j in json.load(sys.stdin)['jobs']]"
# Expect: completed success Docker lint — gitea-doctor + docker-doctor

Troubleshooting

Doctor reports STALE TOKEN: env GITEA_NPM_TOKEN ≠ file

Your shell has an old token cached. Fix:

source ~/.zshrc
# Or to refresh just the token without sourcing everything:
eval "$(bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/token.sh print --export)"

pnpm install fails with EAI_AGAIN or ETIMEDOUT

DNS or network. Verify:

ping <VM_HOST>
nslookup <VM_HOST>
curl -v http://<VM_HOST>:3300/api/v1/version

Need to rotate the token

# On laptop (assumes Keychain entry from Step 1):
bash ~/code/mygh/learning_ai_common_plat/scripts/gitea/token.sh rotate

Or manually re-run Step 3 on the VM.

Need to start over (nuke the Gitea data)

On the VM (destructive):

sudo docker stop gitea
sudo rm -rf /var/lib/gitea/*    # or wherever your data volume is
sudo docker start gitea
# Then re-run Steps 1-7

What's persistent vs. ephemeral

Item Where Survives VM reboot? Survives VM rebuild?
Gitea database /var/lib/gitea/data/gitea.db (snapshot the disk)
Published packages /var/lib/gitea/data/packages/ (re-publish via Step 7)
Admin/npm users inside Gitea DB (re-run Steps 1-2)
NPM tokens inside Gitea DB + your ~/.gitea_npm_token_home (re-run Step 3)
~/.gitea_vm_host your laptop n/a
~/.gitea_npm_token_home your laptop n/a
Actions runner registration Gitea DB + .runner file (re-run register-runner.sh)
Runner secrets act_runner/runner.env (chmod 600) (recreate from token)

For VM rebuilds: snapshot /var/lib/gitea to Azure Disk Snapshot weekly, restore on rebuild. Avoids re-running Steps 1-7.


See also

  • scripts/gitea/doctor.sh — pre-flight validation (run before every deploy)
  • scripts/gitea/token.sh — token rotation helper
  • scripts/gitea/register-runner.sh — reproducible Actions runner registration (Step 11)
  • scripts/gitea/bootstrap-vm.sh — automates Steps 1-3 on a fresh VM
  • scripts/switch-network.sh — exports GITEA_NPM_* env vars per network
  • docker-build-optimization-roadmap.md (in learning_ai_devops_tools/docs/) — ecosystem-wide Docker build hardening that depends on this setup