learning_ai_invt_trdg/docs/planned/TROUBLESHOOTING_GUIDE.md
Saravana Achu Mac 83bc3af260 docs: add comprehensive documentation for coding agents
Created detailed documentation files to guide coding agents in implementing
the fixes and improvements identified in the functionality review:

1. API_DOCUMENTATION_GUIDE.md
   - Complete API endpoint catalog
   - Authentication documentation
   - REST API endpoints (health, user, trading, orders, market data, research, backtesting, feature flags, admin)
   - WebSocket namespaces (/trading, /admin, /)
   - Error responses and error codes
   - Rate limiting recommendations
   - Request ID propagation
   - Deprecation policy
   - OpenAPI/Swagger generation guide
   - Documentation maintenance process

2. ARCHITECTURE_DOCUMENTATION.md
   - System overview and high-level architecture
   - Monorepo structure and directory layout
   - Backend architecture (component structure, data flow, trading loop)
   - Web architecture (component structure, data flow, UI architecture)
   - Mobile architecture (component structure, data flow)
   - Shared code architecture (shared modules, platform integration)
   - Data architecture (Cosmos DB containers, data flow)
   - Service boundaries (backend, platform-service, web, mobile responsibilities)
   - Integration points (backend ↔ platform-service, web ↔ backend, mobile ↔ backend, etc.)
   - Security architecture (authentication, authorization, tenant isolation)
   - Monitoring & observability (telemetry, logging)
   - Deployment architecture (Docker, environment variables)
   - Scalability considerations (horizontal scaling, WebSocket scaling)

3. TROUBLESHOOTING_GUIDE.md
   - Backend issues (won't start, health check failing, trading loop not running, orders not executing, reconciliation failures)
   - Web issues (won't load, authentication failing, WebSocket connection failing, data not updating)
   - Mobile issues (won't launch, authentication failing, WebSocket connection failing, data not updating)
   - Database issues (Cosmos DB connection failing, slow performance, data inconsistency)
   - Authentication issues (platform-service unreachable, JWT token invalid)
   - WebSocket issues (connection drops, not receiving events)
   - Performance issues (backend slow response times, web slow load times)
   - Deployment issues (Docker build failing, container won't start)
   - Escalation procedures (when to escalate, escalation steps, on-call contact, incident response)
   - Common error messages (backend, web, mobile)
   - Monitoring and alerts (key metrics, alert thresholds, monitoring tools)

4. ONBOARDING_GUIDE.md
   - Project overview (what is the trading dashboard, tech stack, key concepts)
   - Prerequisites (required software, required accounts, optional tools)
   - Development setup (clone repo, install dependencies, configure environment, verify setup)
   - Project structure (monorepo layout, key files)
   - Development workflow (starting development, making changes, commit message format, pushing changes)
   - Testing (backend tests, web tests, mobile tests, verification script)
   - Code review process (PR process, PR checklist, code review guidelines)
   - Common tasks (adding backend endpoint, adding web component, adding mobile screen, adding feature flag)
   - Resources (documentation, external documentation, tools)
   - Getting help (internal resources, common issues, asking questions)
   - Best practices
   - Next steps (first week, first month, ongoing)

These documentation files provide detailed guidance for coding agents to implement
the fixes and improvements identified in FUNCTIONALITY_REVIEW.md without needing
to understand the entire codebase from scratch.
2026-05-09 17:02:05 -07:00

24 KiB

Troubleshooting Guide

Purpose: Comprehensive troubleshooting guide for the trading dashboard, covering common issues, resolution steps, and escalation procedures.

Target Audience: Developers, operators, support engineers.


Table of Contents


Backend Issues

Backend Won't Start

Symptoms:

  • Backend fails to start
  • Error message on startup
  • Process exits immediately

Common Causes:

  1. Missing environment variables
  2. Invalid configuration
  3. Database connection failure
  4. Port already in use

Resolution Steps:

  1. Check environment variables:
cd backend
cat .env
# Ensure all required variables are set
  1. Verify configuration:
cd backend
npm run dev
# Check error messages for specific issues
  1. Check database connection:
# Test Cosmos DB connection
curl -I $COSMOS_ENDPOINT
# Check if Cosmos DB is accessible
  1. Check port availability:
lsof -i :4018
# Kill process using port 4018 if needed
kill -9 <PID>
  1. Check logs:
# Backend logs should show startup errors
# Look for specific error messages

Prevention:

  • Use .env.example as template
  • Validate environment variables at startup
  • Add configuration validation script

Backend Health Check Failing

Symptoms:

  • GET /health/live returns non-200 status
  • Health check endpoint unreachable
  • Docker healthcheck failing

Common Causes:

  1. Backend not running
  2. Health check endpoint not implemented
  3. Database connection issues
  4. Dependencies unhealthy

Resolution Steps:

  1. Check if backend is running:
curl http://localhost:4018/health/live
  1. Check backend logs:
# Look for health check errors
# Check for database connection errors
  1. Check dependencies:
curl http://localhost:4003/health
# Check platform-service health
  1. Check database:
# Verify Cosmos DB is accessible
# Check database container exists

Prevention:

  • Implement comprehensive health checks
  • Add dependency health checks
  • Monitor health check failures

Trading Loop Not Running

Symptoms:

  • No trades executing
  • Bot state shows "idle"
  • Positions not updating

Common Causes:

  1. Trading disabled globally
  2. Profile disabled
  3. Market closed
  4. Configuration error
  5. Exchange API issue

Resolution Steps:

  1. Check trading control state:
curl -H "Authorization: Bearer <token>" http://localhost:4018/api/trading/control
# Check if globalTradingEnabled is true
  1. Check profile status:
# Check if specific profile is enabled
# Look for profile-level disable
  1. Check market hours:
# Verify SessionRule is not blocking
# Check if market is open
  1. Check exchange API:
# Verify exchange API credentials
# Test exchange API connectivity
  1. Check backend logs:
# Look for trading loop errors
# Check for strategy rule failures

Prevention:

  • Add trading loop status monitoring
  • Add exchange API health checks
  • Implement trading loop alerts

Orders Not Executing

Symptoms:

  • Orders submitted but not executed
  • Orders stuck in "pending" state
  • Exchange API errors

Common Causes:

  1. Exchange API issue
  2. Insufficient capital
  3. Risk management blocking
  4. Exchange maintenance
  5. Invalid order parameters

Resolution Steps:

  1. Check exchange API:
# Test exchange API connectivity
# Verify API credentials
# Check exchange status page
  1. Check capital:
# Verify sufficient capital available
# Check capital ledger
  1. Check risk engine:
# Check if risk management is blocking
# Look for risk rule failures
  1. Check order parameters:
# Verify order parameters are valid
# Check symbol, quantity, price
  1. Check backend logs:
# Look for order execution errors
# Check for exchange API errors

Prevention:

  • Add order execution monitoring
  • Add exchange API health checks
  • Implement order execution alerts

Reconciliation Failures

Symptoms:

  • Reconciliation service failing
  • Data inconsistencies
  • Reconciliation errors in logs

Common Causes:

  1. Exchange API issue
  2. Database connection issue
  3. Data format mismatch
  4. Missing data
  5. Logic error

Resolution Steps:

  1. Check reconciliation logs:
# Look for reconciliation errors
# Check for specific failure reasons
  1. Run reconciliation manually:
cd backend
npm run reconcile:lifecycle-history
# Run specific reconciliation script
  1. Check exchange data:
# Verify exchange API is returning correct data
# Check for data format changes
  1. Check database:
# Verify database is accessible
# Check for data corruption

Prevention:

  • Add reconciliation monitoring
  • Add reconciliation alerts
  • Implement reconciliation retry logic

Web Issues

Web Won't Load

Symptoms:

  • Blank screen
  • Loading spinner indefinitely
  • Browser console errors

Common Causes:

  1. Backend API unreachable
  2. Authentication failure
  3. Build error
  4. JavaScript error
  5. Network issue

Resolution Steps:

  1. Check browser console:
// Open browser devtools
// Check for JavaScript errors
// Check for network errors
  1. Check backend API:
curl http://localhost:4018/health/live
# Verify backend is running
  1. Check authentication:
# Verify auth token is valid
# Check platform-service is reachable
  1. Check build:
cd web
pnpm build
# Verify build succeeds
  1. Check network:
# Check network connectivity
# Verify CORS settings

Prevention:

  • Add error boundaries
  • Add loading states
  • Implement graceful degradation

Authentication Failing

Symptoms:

  • Login fails
  • Session expires immediately
  • Auth token invalid
  • 401/403 errors

Common Causes:

  1. Platform-service unreachable
  2. Invalid credentials
  3. Token expired
  4. Platform-service down
  5. Configuration error

Resolution Steps:

  1. Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
  1. Check credentials:
# Verify username/password are correct
# Check platform-service user exists
  1. Check token:
# Verify JWT token is valid
# Check token expiration
  1. Check configuration:
cat web/.env.local
# Verify VITE_PLATFORM_URL is correct
# Verify auth configuration
  1. Check logs:
# Check platform-service logs
# Check backend auth logs

Prevention:

  • Add auth error handling
  • Implement token refresh
  • Add auth monitoring

WebSocket Connection Failing

Symptoms:

  • WebSocket won't connect
  • Connection drops repeatedly
  • No real-time updates
  • WebSocket errors in console

Common Causes:

  1. Backend unreachable
  2. Auth token invalid
  3. WebSocket blocked by firewall
  4. Namespace mismatch
  5. Connection limit exceeded

Resolution Steps:

  1. Check backend WebSocket:
# Verify backend is running
# Check WebSocket port is accessible
  1. Check auth token:
# Verify JWT token is valid
# Check token is included in connection
  1. Check firewall:
# Verify WebSocket port is not blocked
# Check network settings
  1. Check namespace:
# Verify correct namespace (/trading)
# Check namespace exists in backend
  1. Check logs:
# Check backend WebSocket logs
# Look for connection errors

Prevention:

  • Add WebSocket error handling
  • Implement reconnection logic
  • Add WebSocket monitoring

Data Not Updating

Symptoms:

  • Data stale
  • No real-time updates
  • Manual refresh required
  • WebSocket connected but no updates

Common Causes:

  1. WebSocket not receiving events
  2. Backend not broadcasting
  3. Event subscription issue
  4. Data cache issue
  5. Filter/mask issue

Resolution Steps:

  1. Check WebSocket connection:
// Check WebSocket is connected
// Check for connection errors
  1. Check backend broadcasting:
# Check backend logs for broadcast events
# Verify events are being sent
  1. Check event subscription:
// Verify correct events are subscribed
// Check for subscription errors
  1. Check data cache:
// Clear cache if needed
// Verify cache is not stale
  1. Check filters:
// Verify data filters are correct
# Check for masking issues

Prevention:

  • Add data staleness monitoring
  • Implement cache invalidation
  • Add data update alerts

Mobile Issues

Mobile App Won't Launch

Symptoms:

  • App crashes on launch
  • Blank screen
  • Loading spinner indefinitely
  • Expo Go won't load

Common Causes:

  1. Build error
  2. Configuration error
  3. Platform-service unreachable
  4. Backend unreachable
  5. Expo Go version mismatch

Resolution Steps:

  1. Check Expo logs:
# Check Expo dev server logs
# Look for build errors
  1. Check configuration:
cat mobile/.env.local
# Verify EXPO_PUBLIC_* variables are set
  1. Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
  1. Check backend:
curl http://localhost:4018/health/live
# Verify backend is running
  1. Check Expo Go:
# Verify Expo Go is latest version
# Try clearing Expo Go cache

Prevention:

  • Add error boundaries
  • Add launch error handling
  • Implement graceful degradation

Mobile Authentication Failing

Symptoms:

  • Login fails
  • Session won't restore
  • Auth token invalid
  • Secure storage issue

Common Causes:

  1. Platform-service unreachable
  2. Invalid credentials
  3. Token expired
  4. Secure storage error
  5. Platform-service down

Resolution Steps:

  1. Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
  1. Check credentials:
# Verify credentials are correct
# Check platform-service user exists
  1. Check token:
# Verify JWT token is valid
# Check token expiration
  1. Check secure storage:
# Verify secure storage is working
# Check for storage errors
  1. Check logs:
# Check mobile logs
# Look for auth errors

Prevention:

  • Add auth error handling
  • Implement token refresh
  • Add secure storage error handling

Mobile WebSocket Connection Failing

Symptoms:

  • WebSocket won't connect
  • Connection drops repeatedly
  • No real-time updates
  • WebSocket errors in logs

Common Causes:

  1. Backend unreachable
  2. Auth token invalid
  3. WebSocket blocked by network
  4. Namespace mismatch
  5. Connection limit exceeded

Resolution Steps:

  1. Check backend WebSocket:
# Verify backend is running
# Check WebSocket port is accessible
  1. Check auth token:
# Verify JWT token is valid
# Check token is included in connection
  1. Check network:
# Verify network allows WebSocket
# Check mobile network settings
  1. Check namespace:
# Verify correct namespace (/trading)
# Check namespace exists in backend
  1. Check logs:
# Check mobile logs
# Look for connection errors

Prevention:

  • Add WebSocket error handling
  • Implement reconnection logic
  • Add polling fallback

Mobile Data Not Updating

Symptoms:

  • Data stale
  • No real-time updates
  • Manual refresh required
  • WebSocket connected but no updates

Common Causes:

  1. WebSocket not receiving events
  2. Backend not broadcasting
  3. Event subscription issue
  4. Data cache issue
  5. Polling fallback not working

Resolution Steps:

  1. Check WebSocket connection:
// Check WebSocket is connected
// Check for connection errors
  1. Check backend broadcasting:
# Check backend logs for broadcast events
# Verify events are being sent
  1. Check polling fallback:
// Verify polling is working
// Check polling interval
  1. Check data cache:
// Clear cache if needed
# Verify cache is not stale
  1. Check logs:
# Check mobile logs
# Look for update errors

Prevention:

  • Add data staleness monitoring
  • Implement polling fallback
  • Add data update alerts

Database Issues

Cosmos DB Connection Failing

Symptoms:

  • Backend can't connect to Cosmos
  • Connection timeout
  • Authentication error
  • Database not found

Common Causes:

  1. Invalid credentials
  2. Network issue
  3. Cosmos DB down
  4. Container not found
  5. Firewall blocking

Resolution Steps:

  1. Check credentials:
cat backend/.env
# Verify COSMOS_ENDPOINT and COSMOS_KEY are correct
  1. Test connection:
curl -I $COSMOS_ENDPOINT
# Verify Cosmos DB is accessible
  1. Check database:
# Verify database exists
# Check container exists
  1. Check network:
# Verify network allows connection
# Check firewall settings
  1. Check logs:
# Check backend logs for Cosmos errors
# Look for connection errors

Prevention:

  • Add connection retry logic
  • Implement connection pooling
  • Add connection monitoring

Cosmos DB Slow Performance

Symptoms:

  • Slow query response times
  • Timeout errors
  • Performance degradation

Common Causes:

  1. Large result sets
  2. Missing indexes
  3. High RU consumption
  4. Network latency
  5. Database throttling

Resolution Steps:

  1. Check query performance:
# Check query execution time
# Look for slow queries
  1. Check indexes:
# Verify indexes exist
# Check index usage
  1. Check RU consumption:
# Monitor RU consumption
# Check for throttling
  1. Optimize queries:
# Add query filters
# Use pagination
# Reduce result set size

Prevention:

  • Add query performance monitoring
  • Implement query optimization
  • Add performance alerts

Data Inconsistency

Symptoms:

  • Data mismatch between surfaces
  • Stale data
  • Missing data
  • Duplicate data

Common Causes:

  1. Reconciliation failure
  2. Cache issue
  3. Race condition
  4. Data corruption
  5. Sync issue

Resolution Steps:

  1. Run reconciliation:
cd backend
npm run reconcile:lifecycle-history
# Run reconciliation script
  1. Check cache:
# Clear cache if needed
# Verify cache is not stale
  1. Check sync:
# Verify sync is working
# Check for sync errors
  1. Check data integrity:
# Verify data integrity
# Check for corruption

Prevention:

  • Add data consistency monitoring
  • Implement reconciliation
  • Add data integrity checks

Authentication Issues

Platform-Service Unreachable

Symptoms:

  • Can't authenticate
  • Token validation fails
  • Platform-service health check fails

Common Causes:

  1. Platform-service down
  2. Network issue
  3. Wrong URL
  4. Firewall blocking
  5. DNS issue

Resolution Steps:

  1. Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
  1. Check configuration:
cat backend/.env
# Verify PLATFORM_API_URL is correct
cat web/.env.local
# Verify VITE_PLATFORM_URL is correct
  1. Check network:
# Verify network allows connection
# Check firewall settings
  1. Check DNS:
# Verify DNS resolution
# Check host file

Prevention:

  • Add platform-service monitoring
  • Implement fallback auth
  • Add auth monitoring

JWT Token Invalid

Symptoms:

  • 401 errors
  • Token validation fails
  • Session expires immediately

Common Causes:

  1. Token expired
  2. Token malformed
  3. Wrong signing key
  4. Token revoked
  5. Clock skew

Resolution Steps:

  1. Check token expiration:
# Decode JWT token
# Check expiration time
  1. Check token format:
# Verify JWT format is correct
# Check for token errors
  1. Check signing key:
# Verify signing key is correct
# Check key rotation
  1. Check clock skew:
# Verify system time is correct
# Check NTP sync

Prevention:

  • Implement token refresh
  • Add token validation
  • Add token monitoring

WebSocket Issues

WebSocket Connection Drops

Symptoms:

  • Connection drops repeatedly
  • Reconnection fails
  • Connection timeout

Common Causes:

  1. Network instability
  2. Backend restart
  3. Auth token expired
  4. Connection limit
  5. Firewall timeout

Resolution Steps:

  1. Check network:
# Verify network stability
# Check for packet loss
  1. Check backend:
# Verify backend is not restarting
# Check backend logs
  1. Check auth token:
# Verify JWT token is valid
# Check token expiration
  1. Check connection limit:
# Verify connection limit not exceeded
# Check for connection leaks

Prevention:

  • Implement reconnection logic
  • Add connection monitoring
  • Implement heartbeat

WebSocket Not Receiving Events

Symptoms:

  • WebSocket connected but no events
  • Events not arriving
  • Event subscription issue

Common Causes:

  1. Not subscribed to events
  2. Namespace mismatch
  3. Backend not broadcasting
  4. Event filtering
  5. Room mismatch

Resolution Steps:

  1. Check subscription:
// Verify correct events are subscribed
// Check for subscription errors
  1. Check namespace:
# Verify correct namespace
# Check namespace exists
  1. Check backend:
# Verify backend is broadcasting
# Check backend logs
  1. Check filtering:
// Verify event filters are correct
# Check for filtering errors

Prevention:

  • Add event monitoring
  • Implement event acknowledgment
  • Add event logging

Performance Issues

Backend Slow Response Times

Symptoms:

  • API responses slow
  • High latency
  • Timeout errors

Common Causes:

  1. Database query slow
  2. External API slow
  3. CPU bottleneck
  4. Memory bottleneck
  5. Network latency

Resolution Steps:

  1. Check database queries:
# Check query performance
# Look for slow queries
  1. Check external APIs:
# Check exchange API latency
# Verify API performance
  1. Check system resources:
# Check CPU usage
# Check memory usage
  1. Check network:
# Check network latency
# Verify network bandwidth

Prevention:

  • Add performance monitoring
  • Implement caching
  • Add performance alerts

Web Slow Load Times

Symptoms:

  • Slow page load
  • Large bundle size
  • Slow initial render

Common Causes:

  1. Large bundle size
  2. Too many requests
  3. Slow API responses
  4. Unoptimized assets
  5. No caching

Resolution Steps:

  1. Check bundle size:
cd web
pnpm build
# Check bundle size
  1. Check network requests:
// Check network tab
// Look for large requests
  1. Check API responses:
# Check API response times
# Look for slow endpoints
  1. Optimize assets:
# Optimize images
# Minify assets
# Implement caching

Prevention:

  • Add performance monitoring
  • Implement code splitting
  • Add caching

Deployment Issues

Docker Build Failing

Symptoms:

  • Docker build fails
  • Build timeout
  • Dependency install fails

Common Causes:

  1. Invalid Dockerfile
  2. Missing dependencies
  3. Network issue
  4. Insufficient resources
  5. Build context too large

Resolution Steps:

  1. Check Dockerfile:
# Verify Dockerfile syntax
# Check for errors
  1. Check dependencies:
# Verify package.json is valid
# Check for missing dependencies
  1. Check network:
# Verify network connectivity
# Check registry access
  1. Check resources:
# Verify sufficient disk space
# Check memory availability

Prevention:

  • Add Docker build validation
  • Implement build caching
  • Add build monitoring

Docker Container Won't Start

Symptoms:

  • Container won't start
  • Container exits immediately
  • Container crashes

Common Causes:

  1. Invalid configuration
  2. Missing environment variables
  3. Port conflict
  4. Dependency unavailable
  5. Health check failing

Resolution Steps:

  1. Check container logs:
docker logs <container-id>
# Look for startup errors
  1. Check configuration:
# Verify environment variables
# Check configuration files
  1. Check ports:
# Verify port availability
# Check for conflicts
  1. Check dependencies:
# Verify dependencies are available
# Check service health

Prevention:

  • Add container health checks
  • Implement dependency checks
  • Add startup validation

Escalation Procedures

When to Escalate

Escalate to on-call if:

  • Production outage
  • Data loss or corruption
  • Security breach
  • Critical bug affecting multiple users
  • Issue cannot be resolved within 30 minutes

Escalation Steps

  1. Document the issue:

    • Describe the problem
    • Include error messages
    • Include logs
    • Include request IDs
  2. Attempt resolution:

    • Follow troubleshooting steps
    • Document attempted solutions
    • Note what worked/didn't work
  3. Escalate:

    • Contact on-call engineer
    • Provide documentation
    • Provide context
    • Provide urgency
  4. Follow up:

    • Monitor resolution
    • Document resolution
    • Update runbooks
    • Share learnings

On-Call Contact

  • Primary: [On-call engineer]
  • Secondary: [Backup engineer]
  • Escalation: [Engineering manager]

Incident Response

  1. Detect: Monitoring alert or user report
  2. Acknowledge: Acknowledge alert within 5 minutes
  3. Investigate: Gather information, check logs
  4. Mitigate: Implement temporary fix if needed
  5. Resolve: Implement permanent fix
  6. Post-mortem: Document incident, learnings, improvements

Common Error Messages

Backend Errors

Error Cause Resolution
ECONNREFUSED Port not available Check port, kill process
ETIMEDOUT Connection timeout Check network, increase timeout
Unauthorized Invalid auth Check credentials, token
Forbidden Insufficient permissions Check user role, permissions
InternalServerError Server error Check logs, fix bug

Web Errors

Error Cause Resolution
Network Error Backend unreachable Check backend, network
401 Unauthorized Invalid token Refresh token, re-auth
403 Forbidden Insufficient permissions Check user role
404 Not Found Resource not found Check URL, resource exists
500 Internal Server Error Server error Check logs, fix bug

Mobile Errors

Error Cause Resolution
Network request failed Network issue Check network, backend
Auth failed Invalid credentials Check credentials, re-auth
WebSocket error Connection issue Check backend, network
Storage error Secure storage issue Check storage permissions
App crashed Runtime error Check logs, fix bug

Monitoring and Alerts

Key Metrics to Monitor

Backend:

  • CPU usage
  • Memory usage
  • API response times
  • Error rate
  • Database query times
  • WebSocket connection count

Web:

  • Page load time
  • Bundle size
  • API response times
  • Error rate
  • WebSocket connection count

Mobile:

  • App launch time
  • API response times
  • Error rate
  • WebSocket connection count
  • Crash rate

Alert Thresholds

Metric Warning Critical
CPU usage 70% 90%
Memory usage 70% 90%
API response time 1s 5s
Error rate 5% 10%
Database query time 500ms 2s

Monitoring Tools

  • Backend: Winston logging, Prometheus metrics
  • Web: Browser devtools, Lighthouse
  • Mobile: Expo logs, Sentry

References

  • README.md - Project overview
  • ARCHITECTURE_DOCUMENTATION.md - System architecture
  • API_DOCUMENTATION_GUIDE.md - API documentation
  • FUNCTIONALITY_REVIEW.md - Functional gaps and issues
  • CUTOVER_WEB.md - Web cutover procedures
  • CUTOVER_MOBILE.md - Mobile cutover procedures