# Troubleshooting Guide **Purpose:** Comprehensive troubleshooting guide for the trading dashboard, covering common issues, resolution steps, and escalation procedures. **Target Audience:** Developers, operators, support engineers. --- ## Table of Contents - [Backend Issues](#backend-issues) - [Web Issues](#web-issues) - [Mobile Issues](#mobile-issues) - [Database Issues](#database-issues) - [Authentication Issues](#authentication-issues) - [WebSocket Issues](#websocket-issues) - [Performance Issues](#performance-issues) - [Deployment Issues](#deployment-issues) - [Escalation Procedures](#escalation-procedures) --- ## Backend Issues ### Backend Won't Start **Symptoms:** - Backend fails to start - Error message on startup - Process exits immediately **Common Causes:** 1. Missing environment variables 2. Invalid configuration 3. Database connection failure 4. Port already in use **Resolution Steps:** 1. **Check environment variables:** ```bash cd backend cat .env # Ensure all required variables are set ``` 2. **Verify configuration:** ```bash cd backend npm run dev # Check error messages for specific issues ``` 3. **Check database connection:** ```bash # Test Cosmos DB connection curl -I $COSMOS_ENDPOINT # Check if Cosmos DB is accessible ``` 4. **Check port availability:** ```bash lsof -i :4018 # Kill process using port 4018 if needed kill -9 ``` 5. **Check logs:** ```bash # Backend logs should show startup errors # Look for specific error messages ``` **Prevention:** - Use `.env.example` as template - Validate environment variables at startup - Add configuration validation script --- ### Backend Health Check Failing **Symptoms:** - `GET /health/live` returns non-200 status - Health check endpoint unreachable - Docker healthcheck failing **Common Causes:** 1. Backend not running 2. Health check endpoint not implemented 3. Database connection issues 4. Dependencies unhealthy **Resolution Steps:** 1. **Check if backend is running:** ```bash curl http://localhost:4018/health/live ``` 2. **Check backend logs:** ```bash # Look for health check errors # Check for database connection errors ``` 3. **Check dependencies:** ```bash curl http://localhost:4003/health # Check platform-service health ``` 4. **Check database:** ```bash # Verify Cosmos DB is accessible # Check database container exists ``` **Prevention:** - Implement comprehensive health checks - Add dependency health checks - Monitor health check failures --- ### Trading Loop Not Running **Symptoms:** - No trades executing - Bot state shows "idle" - Positions not updating **Common Causes:** 1. Trading disabled globally 2. Profile disabled 3. Market closed 4. Configuration error 5. Exchange API issue **Resolution Steps:** 1. **Check trading control state:** ```bash curl -H "Authorization: Bearer " http://localhost:4018/api/trading/control # Check if globalTradingEnabled is true ``` 2. **Check profile status:** ```bash # Check if specific profile is enabled # Look for profile-level disable ``` 3. **Check market hours:** ```bash # Verify SessionRule is not blocking # Check if market is open ``` 4. **Check exchange API:** ```bash # Verify exchange API credentials # Test exchange API connectivity ``` 5. **Check backend logs:** ```bash # Look for trading loop errors # Check for strategy rule failures ``` **Prevention:** - Add trading loop status monitoring - Add exchange API health checks - Implement trading loop alerts --- ### Orders Not Executing **Symptoms:** - Orders submitted but not executed - Orders stuck in "pending" state - Exchange API errors **Common Causes:** 1. Exchange API issue 2. Insufficient capital 3. Risk management blocking 4. Exchange maintenance 5. Invalid order parameters **Resolution Steps:** 1. **Check exchange API:** ```bash # Test exchange API connectivity # Verify API credentials # Check exchange status page ``` 2. **Check capital:** ```bash # Verify sufficient capital available # Check capital ledger ``` 3. **Check risk engine:** ```bash # Check if risk management is blocking # Look for risk rule failures ``` 4. **Check order parameters:** ```bash # Verify order parameters are valid # Check symbol, quantity, price ``` 5. **Check backend logs:** ```bash # Look for order execution errors # Check for exchange API errors ``` **Prevention:** - Add order execution monitoring - Add exchange API health checks - Implement order execution alerts --- ### Reconciliation Failures **Symptoms:** - Reconciliation service failing - Data inconsistencies - Reconciliation errors in logs **Common Causes:** 1. Exchange API issue 2. Database connection issue 3. Data format mismatch 4. Missing data 5. Logic error **Resolution Steps:** 1. **Check reconciliation logs:** ```bash # Look for reconciliation errors # Check for specific failure reasons ``` 2. **Run reconciliation manually:** ```bash cd backend npm run reconcile:lifecycle-history # Run specific reconciliation script ``` 3. **Check exchange data:** ```bash # Verify exchange API is returning correct data # Check for data format changes ``` 4. **Check database:** ```bash # Verify database is accessible # Check for data corruption ``` **Prevention:** - Add reconciliation monitoring - Add reconciliation alerts - Implement reconciliation retry logic --- ## Web Issues ### Web Won't Load **Symptoms:** - Blank screen - Loading spinner indefinitely - Browser console errors **Common Causes:** 1. Backend API unreachable 2. Authentication failure 3. Build error 4. JavaScript error 5. Network issue **Resolution Steps:** 1. **Check browser console:** ```javascript // Open browser devtools // Check for JavaScript errors // Check for network errors ``` 2. **Check backend API:** ```bash curl http://localhost:4018/health/live # Verify backend is running ``` 3. **Check authentication:** ```bash # Verify auth token is valid # Check platform-service is reachable ``` 4. **Check build:** ```bash cd web pnpm build # Verify build succeeds ``` 5. **Check network:** ```bash # Check network connectivity # Verify CORS settings ``` **Prevention:** - Add error boundaries - Add loading states - Implement graceful degradation --- ### Authentication Failing **Symptoms:** - Login fails - Session expires immediately - Auth token invalid - 401/403 errors **Common Causes:** 1. Platform-service unreachable 2. Invalid credentials 3. Token expired 4. Platform-service down 5. Configuration error **Resolution Steps:** 1. **Check platform-service:** ```bash curl http://localhost:4003/health # Verify platform-service is running ``` 2. **Check credentials:** ```bash # Verify username/password are correct # Check platform-service user exists ``` 3. **Check token:** ```bash # Verify JWT token is valid # Check token expiration ``` 4. **Check configuration:** ```bash cat web/.env.local # Verify VITE_PLATFORM_URL is correct # Verify auth configuration ``` 5. **Check logs:** ```bash # Check platform-service logs # Check backend auth logs ``` **Prevention:** - Add auth error handling - Implement token refresh - Add auth monitoring --- ### WebSocket Connection Failing **Symptoms:** - WebSocket won't connect - Connection drops repeatedly - No real-time updates - WebSocket errors in console **Common Causes:** 1. Backend unreachable 2. Auth token invalid 3. WebSocket blocked by firewall 4. Namespace mismatch 5. Connection limit exceeded **Resolution Steps:** 1. **Check backend WebSocket:** ```bash # Verify backend is running # Check WebSocket port is accessible ``` 2. **Check auth token:** ```bash # Verify JWT token is valid # Check token is included in connection ``` 3. **Check firewall:** ```bash # Verify WebSocket port is not blocked # Check network settings ``` 4. **Check namespace:** ```bash # Verify correct namespace (/trading) # Check namespace exists in backend ``` 5. **Check logs:** ```bash # Check backend WebSocket logs # Look for connection errors ``` **Prevention:** - Add WebSocket error handling - Implement reconnection logic - Add WebSocket monitoring --- ### Data Not Updating **Symptoms:** - Data stale - No real-time updates - Manual refresh required - WebSocket connected but no updates **Common Causes:** 1. WebSocket not receiving events 2. Backend not broadcasting 3. Event subscription issue 4. Data cache issue 5. Filter/mask issue **Resolution Steps:** 1. **Check WebSocket connection:** ```javascript // Check WebSocket is connected // Check for connection errors ``` 2. **Check backend broadcasting:** ```bash # Check backend logs for broadcast events # Verify events are being sent ``` 3. **Check event subscription:** ```javascript // Verify correct events are subscribed // Check for subscription errors ``` 4. **Check data cache:** ```javascript // Clear cache if needed // Verify cache is not stale ``` 5. **Check filters:** ```javascript // Verify data filters are correct # Check for masking issues ``` **Prevention:** - Add data staleness monitoring - Implement cache invalidation - Add data update alerts --- ## Mobile Issues ### Mobile App Won't Launch **Symptoms:** - App crashes on launch - Blank screen - Loading spinner indefinitely - Expo Go won't load **Common Causes:** 1. Build error 2. Configuration error 3. Platform-service unreachable 4. Backend unreachable 5. Expo Go version mismatch **Resolution Steps:** 1. **Check Expo logs:** ```bash # Check Expo dev server logs # Look for build errors ``` 2. **Check configuration:** ```bash cat mobile/.env.local # Verify EXPO_PUBLIC_* variables are set ``` 3. **Check platform-service:** ```bash curl http://localhost:4003/health # Verify platform-service is running ``` 4. **Check backend:** ```bash curl http://localhost:4018/health/live # Verify backend is running ``` 5. **Check Expo Go:** ```bash # Verify Expo Go is latest version # Try clearing Expo Go cache ``` **Prevention:** - Add error boundaries - Add launch error handling - Implement graceful degradation --- ### Mobile Authentication Failing **Symptoms:** - Login fails - Session won't restore - Auth token invalid - Secure storage issue **Common Causes:** 1. Platform-service unreachable 2. Invalid credentials 3. Token expired 4. Secure storage error 5. Platform-service down **Resolution Steps:** 1. **Check platform-service:** ```bash curl http://localhost:4003/health # Verify platform-service is running ``` 2. **Check credentials:** ```bash # Verify credentials are correct # Check platform-service user exists ``` 3. **Check token:** ```bash # Verify JWT token is valid # Check token expiration ``` 4. **Check secure storage:** ```bash # Verify secure storage is working # Check for storage errors ``` 5. **Check logs:** ```bash # Check mobile logs # Look for auth errors ``` **Prevention:** - Add auth error handling - Implement token refresh - Add secure storage error handling --- ### Mobile WebSocket Connection Failing **Symptoms:** - WebSocket won't connect - Connection drops repeatedly - No real-time updates - WebSocket errors in logs **Common Causes:** 1. Backend unreachable 2. Auth token invalid 3. WebSocket blocked by network 4. Namespace mismatch 5. Connection limit exceeded **Resolution Steps:** 1. **Check backend WebSocket:** ```bash # Verify backend is running # Check WebSocket port is accessible ``` 2. **Check auth token:** ```bash # Verify JWT token is valid # Check token is included in connection ``` 3. **Check network:** ```bash # Verify network allows WebSocket # Check mobile network settings ``` 4. **Check namespace:** ```bash # Verify correct namespace (/trading) # Check namespace exists in backend ``` 5. **Check logs:** ```bash # Check mobile logs # Look for connection errors ``` **Prevention:** - Add WebSocket error handling - Implement reconnection logic - Add polling fallback --- ### Mobile Data Not Updating **Symptoms:** - Data stale - No real-time updates - Manual refresh required - WebSocket connected but no updates **Common Causes:** 1. WebSocket not receiving events 2. Backend not broadcasting 3. Event subscription issue 4. Data cache issue 5. Polling fallback not working **Resolution Steps:** 1. **Check WebSocket connection:** ```javascript // Check WebSocket is connected // Check for connection errors ``` 2. **Check backend broadcasting:** ```bash # Check backend logs for broadcast events # Verify events are being sent ``` 3. **Check polling fallback:** ```javascript // Verify polling is working // Check polling interval ``` 4. **Check data cache:** ```javascript // Clear cache if needed # Verify cache is not stale ``` 5. **Check logs:** ```bash # Check mobile logs # Look for update errors ``` **Prevention:** - Add data staleness monitoring - Implement polling fallback - Add data update alerts --- ## Database Issues ### Cosmos DB Connection Failing **Symptoms:** - Backend can't connect to Cosmos - Connection timeout - Authentication error - Database not found **Common Causes:** 1. Invalid credentials 2. Network issue 3. Cosmos DB down 4. Container not found 5. Firewall blocking **Resolution Steps:** 1. **Check credentials:** ```bash cat backend/.env # Verify COSMOS_ENDPOINT and COSMOS_KEY are correct ``` 2. **Test connection:** ```bash curl -I $COSMOS_ENDPOINT # Verify Cosmos DB is accessible ``` 3. **Check database:** ```bash # Verify database exists # Check container exists ``` 4. **Check network:** ```bash # Verify network allows connection # Check firewall settings ``` 5. **Check logs:** ```bash # Check backend logs for Cosmos errors # Look for connection errors ``` **Prevention:** - Add connection retry logic - Implement connection pooling - Add connection monitoring --- ### Cosmos DB Slow Performance **Symptoms:** - Slow query response times - Timeout errors - Performance degradation **Common Causes:** 1. Large result sets 2. Missing indexes 3. High RU consumption 4. Network latency 5. Database throttling **Resolution Steps:** 1. **Check query performance:** ```bash # Check query execution time # Look for slow queries ``` 2. **Check indexes:** ```bash # Verify indexes exist # Check index usage ``` 3. **Check RU consumption:** ```bash # Monitor RU consumption # Check for throttling ``` 4. **Optimize queries:** ```bash # Add query filters # Use pagination # Reduce result set size ``` **Prevention:** - Add query performance monitoring - Implement query optimization - Add performance alerts --- ### Data Inconsistency **Symptoms:** - Data mismatch between surfaces - Stale data - Missing data - Duplicate data **Common Causes:** 1. Reconciliation failure 2. Cache issue 3. Race condition 4. Data corruption 5. Sync issue **Resolution Steps:** 1. **Run reconciliation:** ```bash cd backend npm run reconcile:lifecycle-history # Run reconciliation script ``` 2. **Check cache:** ```bash # Clear cache if needed # Verify cache is not stale ``` 3. **Check sync:** ```bash # Verify sync is working # Check for sync errors ``` 4. **Check data integrity:** ```bash # Verify data integrity # Check for corruption ``` **Prevention:** - Add data consistency monitoring - Implement reconciliation - Add data integrity checks --- ## Authentication Issues ### Platform-Service Unreachable **Symptoms:** - Can't authenticate - Token validation fails - Platform-service health check fails **Common Causes:** 1. Platform-service down 2. Network issue 3. Wrong URL 4. Firewall blocking 5. DNS issue **Resolution Steps:** 1. **Check platform-service:** ```bash curl http://localhost:4003/health # Verify platform-service is running ``` 2. **Check configuration:** ```bash cat backend/.env # Verify PLATFORM_API_URL is correct cat web/.env.local # Verify VITE_PLATFORM_URL is correct ``` 3. **Check network:** ```bash # Verify network allows connection # Check firewall settings ``` 4. **Check DNS:** ```bash # Verify DNS resolution # Check host file ``` **Prevention:** - Add platform-service monitoring - Implement fallback auth - Add auth monitoring --- ### JWT Token Invalid **Symptoms:** - 401 errors - Token validation fails - Session expires immediately **Common Causes:** 1. Token expired 2. Token malformed 3. Wrong signing key 4. Token revoked 5. Clock skew **Resolution Steps:** 1. **Check token expiration:** ```bash # Decode JWT token # Check expiration time ``` 2. **Check token format:** ```bash # Verify JWT format is correct # Check for token errors ``` 3. **Check signing key:** ```bash # Verify signing key is correct # Check key rotation ``` 4. **Check clock skew:** ```bash # Verify system time is correct # Check NTP sync ``` **Prevention:** - Implement token refresh - Add token validation - Add token monitoring --- ## WebSocket Issues ### WebSocket Connection Drops **Symptoms:** - Connection drops repeatedly - Reconnection fails - Connection timeout **Common Causes:** 1. Network instability 2. Backend restart 3. Auth token expired 4. Connection limit 5. Firewall timeout **Resolution Steps:** 1. **Check network:** ```bash # Verify network stability # Check for packet loss ``` 2. **Check backend:** ```bash # Verify backend is not restarting # Check backend logs ``` 3. **Check auth token:** ```bash # Verify JWT token is valid # Check token expiration ``` 4. **Check connection limit:** ```bash # Verify connection limit not exceeded # Check for connection leaks ``` **Prevention:** - Implement reconnection logic - Add connection monitoring - Implement heartbeat --- ### WebSocket Not Receiving Events **Symptoms:** - WebSocket connected but no events - Events not arriving - Event subscription issue **Common Causes:** 1. Not subscribed to events 2. Namespace mismatch 3. Backend not broadcasting 4. Event filtering 5. Room mismatch **Resolution Steps:** 1. **Check subscription:** ```javascript // Verify correct events are subscribed // Check for subscription errors ``` 2. **Check namespace:** ```bash # Verify correct namespace # Check namespace exists ``` 3. **Check backend:** ```bash # Verify backend is broadcasting # Check backend logs ``` 4. **Check filtering:** ```javascript // Verify event filters are correct # Check for filtering errors ``` **Prevention:** - Add event monitoring - Implement event acknowledgment - Add event logging --- ## Performance Issues ### Backend Slow Response Times **Symptoms:** - API responses slow - High latency - Timeout errors **Common Causes:** 1. Database query slow 2. External API slow 3. CPU bottleneck 4. Memory bottleneck 5. Network latency **Resolution Steps:** 1. **Check database queries:** ```bash # Check query performance # Look for slow queries ``` 2. **Check external APIs:** ```bash # Check exchange API latency # Verify API performance ``` 3. **Check system resources:** ```bash # Check CPU usage # Check memory usage ``` 4. **Check network:** ```bash # Check network latency # Verify network bandwidth ``` **Prevention:** - Add performance monitoring - Implement caching - Add performance alerts --- ### Web Slow Load Times **Symptoms:** - Slow page load - Large bundle size - Slow initial render **Common Causes:** 1. Large bundle size 2. Too many requests 3. Slow API responses 4. Unoptimized assets 5. No caching **Resolution Steps:** 1. **Check bundle size:** ```bash cd web pnpm build # Check bundle size ``` 2. **Check network requests:** ```javascript // Check network tab // Look for large requests ``` 3. **Check API responses:** ```bash # Check API response times # Look for slow endpoints ``` 4. **Optimize assets:** ```bash # Optimize images # Minify assets # Implement caching ``` **Prevention:** - Add performance monitoring - Implement code splitting - Add caching --- ## Deployment Issues ### Docker Build Failing **Symptoms:** - Docker build fails - Build timeout - Dependency install fails **Common Causes:** 1. Invalid Dockerfile 2. Missing dependencies 3. Network issue 4. Insufficient resources 5. Build context too large **Resolution Steps:** 1. **Check Dockerfile:** ```bash # Verify Dockerfile syntax # Check for errors ``` 2. **Check dependencies:** ```bash # Verify package.json is valid # Check for missing dependencies ``` 3. **Check network:** ```bash # Verify network connectivity # Check registry access ``` 4. **Check resources:** ```bash # Verify sufficient disk space # Check memory availability ``` **Prevention:** - Add Docker build validation - Implement build caching - Add build monitoring --- ### Docker Container Won't Start **Symptoms:** - Container won't start - Container exits immediately - Container crashes **Common Causes:** 1. Invalid configuration 2. Missing environment variables 3. Port conflict 4. Dependency unavailable 5. Health check failing **Resolution Steps:** 1. **Check container logs:** ```bash docker logs # Look for startup errors ``` 2. **Check configuration:** ```bash # Verify environment variables # Check configuration files ``` 3. **Check ports:** ```bash # Verify port availability # Check for conflicts ``` 4. **Check dependencies:** ```bash # Verify dependencies are available # Check service health ``` **Prevention:** - Add container health checks - Implement dependency checks - Add startup validation --- ## Escalation Procedures ### When to Escalate Escalate to on-call if: - Production outage - Data loss or corruption - Security breach - Critical bug affecting multiple users - Issue cannot be resolved within 30 minutes ### Escalation Steps 1. **Document the issue:** - Describe the problem - Include error messages - Include logs - Include request IDs 2. **Attempt resolution:** - Follow troubleshooting steps - Document attempted solutions - Note what worked/didn't work 3. **Escalate:** - Contact on-call engineer - Provide documentation - Provide context - Provide urgency 4. **Follow up:** - Monitor resolution - Document resolution - Update runbooks - Share learnings ### On-Call Contact - **Primary:** [On-call engineer] - **Secondary:** [Backup engineer] - **Escalation:** [Engineering manager] ### Incident Response 1. **Detect:** Monitoring alert or user report 2. **Acknowledge:** Acknowledge alert within 5 minutes 3. **Investigate:** Gather information, check logs 4. **Mitigate:** Implement temporary fix if needed 5. **Resolve:** Implement permanent fix 6. **Post-mortem:** Document incident, learnings, improvements --- ## Common Error Messages ### Backend Errors | Error | Cause | Resolution | |-------|-------|------------| | `ECONNREFUSED` | Port not available | Check port, kill process | | `ETIMEDOUT` | Connection timeout | Check network, increase timeout | | `Unauthorized` | Invalid auth | Check credentials, token | | `Forbidden` | Insufficient permissions | Check user role, permissions | | `InternalServerError` | Server error | Check logs, fix bug | ### Web Errors | Error | Cause | Resolution | |-------|-------|------------| | `Network Error` | Backend unreachable | Check backend, network | | `401 Unauthorized` | Invalid token | Refresh token, re-auth | | `403 Forbidden` | Insufficient permissions | Check user role | | `404 Not Found` | Resource not found | Check URL, resource exists | | `500 Internal Server Error` | Server error | Check logs, fix bug | ### Mobile Errors | Error | Cause | Resolution | |-------|-------|------------| | `Network request failed` | Network issue | Check network, backend | | `Auth failed` | Invalid credentials | Check credentials, re-auth | | `WebSocket error` | Connection issue | Check backend, network | | `Storage error` | Secure storage issue | Check storage permissions | | `App crashed` | Runtime error | Check logs, fix bug | --- ## Monitoring and Alerts ### Key Metrics to Monitor **Backend:** - CPU usage - Memory usage - API response times - Error rate - Database query times - WebSocket connection count **Web:** - Page load time - Bundle size - API response times - Error rate - WebSocket connection count **Mobile:** - App launch time - API response times - Error rate - WebSocket connection count - Crash rate ### Alert Thresholds | Metric | Warning | Critical | |--------|---------|----------| | CPU usage | 70% | 90% | | Memory usage | 70% | 90% | | API response time | 1s | 5s | | Error rate | 5% | 10% | | Database query time | 500ms | 2s | ### Monitoring Tools - Backend: Winston logging, Prometheus metrics - Web: Browser devtools, Lighthouse - Mobile: Expo logs, Sentry --- ## References - **README.md** - Project overview - **ARCHITECTURE_DOCUMENTATION.md** - System architecture - **API_DOCUMENTATION_GUIDE.md** - API documentation - **FUNCTIONALITY_REVIEW.md** - Functional gaps and issues - **CUTOVER_WEB.md** - Web cutover procedures - **CUTOVER_MOBILE.md** - Mobile cutover procedures