learning_ai_common_plat/AI.dev/SKILLS/debug-service.md

4.0 KiB

Debug Service Skill

Description: Systematic methodology for diagnosing and fixing failing services or endpoints across the entire stack.

When to Use

  • Any service is returning errors or unexpected behavior
  • Health checks are failing
  • Tests are failing in CI or locally
  • Users report broken functionality

Prerequisites

  • Access to service logs
  • Terminal with curl installed
  • Ability to run tests locally

Steps

1. Identify the Failing Service

Quickly locate which service is affected:

# Service locations reference:
# Backend API (Python/FastAPI) → backend/src/
# Billing Service (Fastify) → ../learning_ai_common_plat/services/billing-service/src/
# Growth Service (Fastify) → ../learning_ai_common_plat/services/growth-service/src/
# Platform Service (Fastify) → ../learning_ai_common_plat/services/platform-service/src/
# Tracker Service (Fastify) → ../learning_ai_common_plat/services/tracker-service/src/
# Admin Dashboard (Next.js) → admin-dashboard-web/src/
# User Dashboard (Next.js) → user-dashboard-web/src/
# Tracker Dashboard (Next.js) → tracker-dashboard-web/src/

2. Check Health Status

# Check all services at once
curl -s http://localhost:8000/health && \
curl -s http://localhost:4003/health

3. Examine Logs

For local development:

# Check recent logs from all services
tail -50 .logs/backend.log .logs/platform-service.log 2>/dev/null | head -100

For Docker:

docker compose logs --tail=50 <service-name>

4. Reproduce the Issue

  • API errors: Use curl with verbose output

    curl -v -X POST http://localhost:8000/api/endpoint -H "Content-Type: application/json" -d '{"key":"value"}'
    
  • UI errors: Check browser console and network tab

  • Test failures: Run specific tests with verbose output

    # Python
    python -m pytest tests/test_specific.py -v -x
    
    # TypeScript/Vitest
    pnpm test --reporter=verbose specific.test.ts
    

5. Fix Methodology

Follow this order to avoid common pitfalls:

  1. Read the test first - Understand what the expected behavior is
  2. Read the source code - Trace the execution path
  3. Fix the source, NOT the test - Unless the test itself is wrong
  4. Add a regression test - If none exists for this bug
  5. Run the full test suite - Ensure no new issues were introduced

6. Verify the Fix

# Python tests
python -m pytest tests/ backend/tests/ -v --tb=short -x

# TypeScript services
cd ../learning_ai_common_plat && pnpm --filter @lysnrai/<service-name> test

# Dashboard builds
cd admin-dashboard-web && npm run build

7. Commit with Proper Format

git add .
git commit -m "fix(<scope>): <description>"
# Examples:
# fix(billing): handle null subscription in usage endpoint
# fix(platform): validate JWT token in auth middleware
# fix(admin): resolve dashboard loading state issue
git push

Common Patterns

Database Connection Issues

# Check Cosmos DB connectivity
curl -s http://localhost:4003/health | jq .
# Look for database errors in logs
grep -i "database\|cosmos\|connection" .logs/*.log

Authentication Issues

# Decode JWT to check contents
echo "<jwt-token>" | cut -d. -f2 | base64 -d | jq .
# Check auth service health
curl -s http://localhost:4003/api/auth/me -H "Authorization: Bearer <token>"

Service Dependencies

# Check if dependent services are running
docker compose ps
# Verify service communication
curl -s http://localhost:4003/health | jq .  # consolidated platform service

Notes

  • Always check logs first - Most issues have clear error messages
  • Isolate the problem - Don't change multiple things at once
  • Document the fix - Add comments if the issue was non-obvious
  • Consider edge cases - Think about what might cause this to fail again