Created detailed documentation files to guide coding agents in implementing the fixes and improvements identified in the functionality review: 1. API_DOCUMENTATION_GUIDE.md - Complete API endpoint catalog - Authentication documentation - REST API endpoints (health, user, trading, orders, market data, research, backtesting, feature flags, admin) - WebSocket namespaces (/trading, /admin, /) - Error responses and error codes - Rate limiting recommendations - Request ID propagation - Deprecation policy - OpenAPI/Swagger generation guide - Documentation maintenance process 2. ARCHITECTURE_DOCUMENTATION.md - System overview and high-level architecture - Monorepo structure and directory layout - Backend architecture (component structure, data flow, trading loop) - Web architecture (component structure, data flow, UI architecture) - Mobile architecture (component structure, data flow) - Shared code architecture (shared modules, platform integration) - Data architecture (Cosmos DB containers, data flow) - Service boundaries (backend, platform-service, web, mobile responsibilities) - Integration points (backend ↔ platform-service, web ↔ backend, mobile ↔ backend, etc.) - Security architecture (authentication, authorization, tenant isolation) - Monitoring & observability (telemetry, logging) - Deployment architecture (Docker, environment variables) - Scalability considerations (horizontal scaling, WebSocket scaling) 3. TROUBLESHOOTING_GUIDE.md - Backend issues (won't start, health check failing, trading loop not running, orders not executing, reconciliation failures) - Web issues (won't load, authentication failing, WebSocket connection failing, data not updating) - Mobile issues (won't launch, authentication failing, WebSocket connection failing, data not updating) - Database issues (Cosmos DB connection failing, slow performance, data inconsistency) - Authentication issues (platform-service unreachable, JWT token invalid) - WebSocket issues (connection drops, not receiving events) - Performance issues (backend slow response times, web slow load times) - Deployment issues (Docker build failing, container won't start) - Escalation procedures (when to escalate, escalation steps, on-call contact, incident response) - Common error messages (backend, web, mobile) - Monitoring and alerts (key metrics, alert thresholds, monitoring tools) 4. ONBOARDING_GUIDE.md - Project overview (what is the trading dashboard, tech stack, key concepts) - Prerequisites (required software, required accounts, optional tools) - Development setup (clone repo, install dependencies, configure environment, verify setup) - Project structure (monorepo layout, key files) - Development workflow (starting development, making changes, commit message format, pushing changes) - Testing (backend tests, web tests, mobile tests, verification script) - Code review process (PR process, PR checklist, code review guidelines) - Common tasks (adding backend endpoint, adding web component, adding mobile screen, adding feature flag) - Resources (documentation, external documentation, tools) - Getting help (internal resources, common issues, asking questions) - Best practices - Next steps (first week, first month, ongoing) These documentation files provide detailed guidance for coding agents to implement the fixes and improvements identified in FUNCTIONALITY_REVIEW.md without needing to understand the entire codebase from scratch.
24 KiB
Troubleshooting Guide
Purpose: Comprehensive troubleshooting guide for the trading dashboard, covering common issues, resolution steps, and escalation procedures.
Target Audience: Developers, operators, support engineers.
Table of Contents
- Backend Issues
- Web Issues
- Mobile Issues
- Database Issues
- Authentication Issues
- WebSocket Issues
- Performance Issues
- Deployment Issues
- Escalation Procedures
Backend Issues
Backend Won't Start
Symptoms:
- Backend fails to start
- Error message on startup
- Process exits immediately
Common Causes:
- Missing environment variables
- Invalid configuration
- Database connection failure
- Port already in use
Resolution Steps:
- Check environment variables:
cd backend
cat .env
# Ensure all required variables are set
- Verify configuration:
cd backend
npm run dev
# Check error messages for specific issues
- Check database connection:
# Test Cosmos DB connection
curl -I $COSMOS_ENDPOINT
# Check if Cosmos DB is accessible
- Check port availability:
lsof -i :4018
# Kill process using port 4018 if needed
kill -9 <PID>
- Check logs:
# Backend logs should show startup errors
# Look for specific error messages
Prevention:
- Use
.env.exampleas template - Validate environment variables at startup
- Add configuration validation script
Backend Health Check Failing
Symptoms:
GET /health/livereturns non-200 status- Health check endpoint unreachable
- Docker healthcheck failing
Common Causes:
- Backend not running
- Health check endpoint not implemented
- Database connection issues
- Dependencies unhealthy
Resolution Steps:
- Check if backend is running:
curl http://localhost:4018/health/live
- Check backend logs:
# Look for health check errors
# Check for database connection errors
- Check dependencies:
curl http://localhost:4003/health
# Check platform-service health
- Check database:
# Verify Cosmos DB is accessible
# Check database container exists
Prevention:
- Implement comprehensive health checks
- Add dependency health checks
- Monitor health check failures
Trading Loop Not Running
Symptoms:
- No trades executing
- Bot state shows "idle"
- Positions not updating
Common Causes:
- Trading disabled globally
- Profile disabled
- Market closed
- Configuration error
- Exchange API issue
Resolution Steps:
- Check trading control state:
curl -H "Authorization: Bearer <token>" http://localhost:4018/api/trading/control
# Check if globalTradingEnabled is true
- Check profile status:
# Check if specific profile is enabled
# Look for profile-level disable
- Check market hours:
# Verify SessionRule is not blocking
# Check if market is open
- Check exchange API:
# Verify exchange API credentials
# Test exchange API connectivity
- Check backend logs:
# Look for trading loop errors
# Check for strategy rule failures
Prevention:
- Add trading loop status monitoring
- Add exchange API health checks
- Implement trading loop alerts
Orders Not Executing
Symptoms:
- Orders submitted but not executed
- Orders stuck in "pending" state
- Exchange API errors
Common Causes:
- Exchange API issue
- Insufficient capital
- Risk management blocking
- Exchange maintenance
- Invalid order parameters
Resolution Steps:
- Check exchange API:
# Test exchange API connectivity
# Verify API credentials
# Check exchange status page
- Check capital:
# Verify sufficient capital available
# Check capital ledger
- Check risk engine:
# Check if risk management is blocking
# Look for risk rule failures
- Check order parameters:
# Verify order parameters are valid
# Check symbol, quantity, price
- Check backend logs:
# Look for order execution errors
# Check for exchange API errors
Prevention:
- Add order execution monitoring
- Add exchange API health checks
- Implement order execution alerts
Reconciliation Failures
Symptoms:
- Reconciliation service failing
- Data inconsistencies
- Reconciliation errors in logs
Common Causes:
- Exchange API issue
- Database connection issue
- Data format mismatch
- Missing data
- Logic error
Resolution Steps:
- Check reconciliation logs:
# Look for reconciliation errors
# Check for specific failure reasons
- Run reconciliation manually:
cd backend
npm run reconcile:lifecycle-history
# Run specific reconciliation script
- Check exchange data:
# Verify exchange API is returning correct data
# Check for data format changes
- Check database:
# Verify database is accessible
# Check for data corruption
Prevention:
- Add reconciliation monitoring
- Add reconciliation alerts
- Implement reconciliation retry logic
Web Issues
Web Won't Load
Symptoms:
- Blank screen
- Loading spinner indefinitely
- Browser console errors
Common Causes:
- Backend API unreachable
- Authentication failure
- Build error
- JavaScript error
- Network issue
Resolution Steps:
- Check browser console:
// Open browser devtools
// Check for JavaScript errors
// Check for network errors
- Check backend API:
curl http://localhost:4018/health/live
# Verify backend is running
- Check authentication:
# Verify auth token is valid
# Check platform-service is reachable
- Check build:
cd web
pnpm build
# Verify build succeeds
- Check network:
# Check network connectivity
# Verify CORS settings
Prevention:
- Add error boundaries
- Add loading states
- Implement graceful degradation
Authentication Failing
Symptoms:
- Login fails
- Session expires immediately
- Auth token invalid
- 401/403 errors
Common Causes:
- Platform-service unreachable
- Invalid credentials
- Token expired
- Platform-service down
- Configuration error
Resolution Steps:
- Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
- Check credentials:
# Verify username/password are correct
# Check platform-service user exists
- Check token:
# Verify JWT token is valid
# Check token expiration
- Check configuration:
cat web/.env.local
# Verify VITE_PLATFORM_URL is correct
# Verify auth configuration
- Check logs:
# Check platform-service logs
# Check backend auth logs
Prevention:
- Add auth error handling
- Implement token refresh
- Add auth monitoring
WebSocket Connection Failing
Symptoms:
- WebSocket won't connect
- Connection drops repeatedly
- No real-time updates
- WebSocket errors in console
Common Causes:
- Backend unreachable
- Auth token invalid
- WebSocket blocked by firewall
- Namespace mismatch
- Connection limit exceeded
Resolution Steps:
- Check backend WebSocket:
# Verify backend is running
# Check WebSocket port is accessible
- Check auth token:
# Verify JWT token is valid
# Check token is included in connection
- Check firewall:
# Verify WebSocket port is not blocked
# Check network settings
- Check namespace:
# Verify correct namespace (/trading)
# Check namespace exists in backend
- Check logs:
# Check backend WebSocket logs
# Look for connection errors
Prevention:
- Add WebSocket error handling
- Implement reconnection logic
- Add WebSocket monitoring
Data Not Updating
Symptoms:
- Data stale
- No real-time updates
- Manual refresh required
- WebSocket connected but no updates
Common Causes:
- WebSocket not receiving events
- Backend not broadcasting
- Event subscription issue
- Data cache issue
- Filter/mask issue
Resolution Steps:
- Check WebSocket connection:
// Check WebSocket is connected
// Check for connection errors
- Check backend broadcasting:
# Check backend logs for broadcast events
# Verify events are being sent
- Check event subscription:
// Verify correct events are subscribed
// Check for subscription errors
- Check data cache:
// Clear cache if needed
// Verify cache is not stale
- Check filters:
// Verify data filters are correct
# Check for masking issues
Prevention:
- Add data staleness monitoring
- Implement cache invalidation
- Add data update alerts
Mobile Issues
Mobile App Won't Launch
Symptoms:
- App crashes on launch
- Blank screen
- Loading spinner indefinitely
- Expo Go won't load
Common Causes:
- Build error
- Configuration error
- Platform-service unreachable
- Backend unreachable
- Expo Go version mismatch
Resolution Steps:
- Check Expo logs:
# Check Expo dev server logs
# Look for build errors
- Check configuration:
cat mobile/.env.local
# Verify EXPO_PUBLIC_* variables are set
- Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
- Check backend:
curl http://localhost:4018/health/live
# Verify backend is running
- Check Expo Go:
# Verify Expo Go is latest version
# Try clearing Expo Go cache
Prevention:
- Add error boundaries
- Add launch error handling
- Implement graceful degradation
Mobile Authentication Failing
Symptoms:
- Login fails
- Session won't restore
- Auth token invalid
- Secure storage issue
Common Causes:
- Platform-service unreachable
- Invalid credentials
- Token expired
- Secure storage error
- Platform-service down
Resolution Steps:
- Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
- Check credentials:
# Verify credentials are correct
# Check platform-service user exists
- Check token:
# Verify JWT token is valid
# Check token expiration
- Check secure storage:
# Verify secure storage is working
# Check for storage errors
- Check logs:
# Check mobile logs
# Look for auth errors
Prevention:
- Add auth error handling
- Implement token refresh
- Add secure storage error handling
Mobile WebSocket Connection Failing
Symptoms:
- WebSocket won't connect
- Connection drops repeatedly
- No real-time updates
- WebSocket errors in logs
Common Causes:
- Backend unreachable
- Auth token invalid
- WebSocket blocked by network
- Namespace mismatch
- Connection limit exceeded
Resolution Steps:
- Check backend WebSocket:
# Verify backend is running
# Check WebSocket port is accessible
- Check auth token:
# Verify JWT token is valid
# Check token is included in connection
- Check network:
# Verify network allows WebSocket
# Check mobile network settings
- Check namespace:
# Verify correct namespace (/trading)
# Check namespace exists in backend
- Check logs:
# Check mobile logs
# Look for connection errors
Prevention:
- Add WebSocket error handling
- Implement reconnection logic
- Add polling fallback
Mobile Data Not Updating
Symptoms:
- Data stale
- No real-time updates
- Manual refresh required
- WebSocket connected but no updates
Common Causes:
- WebSocket not receiving events
- Backend not broadcasting
- Event subscription issue
- Data cache issue
- Polling fallback not working
Resolution Steps:
- Check WebSocket connection:
// Check WebSocket is connected
// Check for connection errors
- Check backend broadcasting:
# Check backend logs for broadcast events
# Verify events are being sent
- Check polling fallback:
// Verify polling is working
// Check polling interval
- Check data cache:
// Clear cache if needed
# Verify cache is not stale
- Check logs:
# Check mobile logs
# Look for update errors
Prevention:
- Add data staleness monitoring
- Implement polling fallback
- Add data update alerts
Database Issues
Cosmos DB Connection Failing
Symptoms:
- Backend can't connect to Cosmos
- Connection timeout
- Authentication error
- Database not found
Common Causes:
- Invalid credentials
- Network issue
- Cosmos DB down
- Container not found
- Firewall blocking
Resolution Steps:
- Check credentials:
cat backend/.env
# Verify COSMOS_ENDPOINT and COSMOS_KEY are correct
- Test connection:
curl -I $COSMOS_ENDPOINT
# Verify Cosmos DB is accessible
- Check database:
# Verify database exists
# Check container exists
- Check network:
# Verify network allows connection
# Check firewall settings
- Check logs:
# Check backend logs for Cosmos errors
# Look for connection errors
Prevention:
- Add connection retry logic
- Implement connection pooling
- Add connection monitoring
Cosmos DB Slow Performance
Symptoms:
- Slow query response times
- Timeout errors
- Performance degradation
Common Causes:
- Large result sets
- Missing indexes
- High RU consumption
- Network latency
- Database throttling
Resolution Steps:
- Check query performance:
# Check query execution time
# Look for slow queries
- Check indexes:
# Verify indexes exist
# Check index usage
- Check RU consumption:
# Monitor RU consumption
# Check for throttling
- Optimize queries:
# Add query filters
# Use pagination
# Reduce result set size
Prevention:
- Add query performance monitoring
- Implement query optimization
- Add performance alerts
Data Inconsistency
Symptoms:
- Data mismatch between surfaces
- Stale data
- Missing data
- Duplicate data
Common Causes:
- Reconciliation failure
- Cache issue
- Race condition
- Data corruption
- Sync issue
Resolution Steps:
- Run reconciliation:
cd backend
npm run reconcile:lifecycle-history
# Run reconciliation script
- Check cache:
# Clear cache if needed
# Verify cache is not stale
- Check sync:
# Verify sync is working
# Check for sync errors
- Check data integrity:
# Verify data integrity
# Check for corruption
Prevention:
- Add data consistency monitoring
- Implement reconciliation
- Add data integrity checks
Authentication Issues
Platform-Service Unreachable
Symptoms:
- Can't authenticate
- Token validation fails
- Platform-service health check fails
Common Causes:
- Platform-service down
- Network issue
- Wrong URL
- Firewall blocking
- DNS issue
Resolution Steps:
- Check platform-service:
curl http://localhost:4003/health
# Verify platform-service is running
- Check configuration:
cat backend/.env
# Verify PLATFORM_API_URL is correct
cat web/.env.local
# Verify VITE_PLATFORM_URL is correct
- Check network:
# Verify network allows connection
# Check firewall settings
- Check DNS:
# Verify DNS resolution
# Check host file
Prevention:
- Add platform-service monitoring
- Implement fallback auth
- Add auth monitoring
JWT Token Invalid
Symptoms:
- 401 errors
- Token validation fails
- Session expires immediately
Common Causes:
- Token expired
- Token malformed
- Wrong signing key
- Token revoked
- Clock skew
Resolution Steps:
- Check token expiration:
# Decode JWT token
# Check expiration time
- Check token format:
# Verify JWT format is correct
# Check for token errors
- Check signing key:
# Verify signing key is correct
# Check key rotation
- Check clock skew:
# Verify system time is correct
# Check NTP sync
Prevention:
- Implement token refresh
- Add token validation
- Add token monitoring
WebSocket Issues
WebSocket Connection Drops
Symptoms:
- Connection drops repeatedly
- Reconnection fails
- Connection timeout
Common Causes:
- Network instability
- Backend restart
- Auth token expired
- Connection limit
- Firewall timeout
Resolution Steps:
- Check network:
# Verify network stability
# Check for packet loss
- Check backend:
# Verify backend is not restarting
# Check backend logs
- Check auth token:
# Verify JWT token is valid
# Check token expiration
- Check connection limit:
# Verify connection limit not exceeded
# Check for connection leaks
Prevention:
- Implement reconnection logic
- Add connection monitoring
- Implement heartbeat
WebSocket Not Receiving Events
Symptoms:
- WebSocket connected but no events
- Events not arriving
- Event subscription issue
Common Causes:
- Not subscribed to events
- Namespace mismatch
- Backend not broadcasting
- Event filtering
- Room mismatch
Resolution Steps:
- Check subscription:
// Verify correct events are subscribed
// Check for subscription errors
- Check namespace:
# Verify correct namespace
# Check namespace exists
- Check backend:
# Verify backend is broadcasting
# Check backend logs
- Check filtering:
// Verify event filters are correct
# Check for filtering errors
Prevention:
- Add event monitoring
- Implement event acknowledgment
- Add event logging
Performance Issues
Backend Slow Response Times
Symptoms:
- API responses slow
- High latency
- Timeout errors
Common Causes:
- Database query slow
- External API slow
- CPU bottleneck
- Memory bottleneck
- Network latency
Resolution Steps:
- Check database queries:
# Check query performance
# Look for slow queries
- Check external APIs:
# Check exchange API latency
# Verify API performance
- Check system resources:
# Check CPU usage
# Check memory usage
- Check network:
# Check network latency
# Verify network bandwidth
Prevention:
- Add performance monitoring
- Implement caching
- Add performance alerts
Web Slow Load Times
Symptoms:
- Slow page load
- Large bundle size
- Slow initial render
Common Causes:
- Large bundle size
- Too many requests
- Slow API responses
- Unoptimized assets
- No caching
Resolution Steps:
- Check bundle size:
cd web
pnpm build
# Check bundle size
- Check network requests:
// Check network tab
// Look for large requests
- Check API responses:
# Check API response times
# Look for slow endpoints
- Optimize assets:
# Optimize images
# Minify assets
# Implement caching
Prevention:
- Add performance monitoring
- Implement code splitting
- Add caching
Deployment Issues
Docker Build Failing
Symptoms:
- Docker build fails
- Build timeout
- Dependency install fails
Common Causes:
- Invalid Dockerfile
- Missing dependencies
- Network issue
- Insufficient resources
- Build context too large
Resolution Steps:
- Check Dockerfile:
# Verify Dockerfile syntax
# Check for errors
- Check dependencies:
# Verify package.json is valid
# Check for missing dependencies
- Check network:
# Verify network connectivity
# Check registry access
- Check resources:
# Verify sufficient disk space
# Check memory availability
Prevention:
- Add Docker build validation
- Implement build caching
- Add build monitoring
Docker Container Won't Start
Symptoms:
- Container won't start
- Container exits immediately
- Container crashes
Common Causes:
- Invalid configuration
- Missing environment variables
- Port conflict
- Dependency unavailable
- Health check failing
Resolution Steps:
- Check container logs:
docker logs <container-id>
# Look for startup errors
- Check configuration:
# Verify environment variables
# Check configuration files
- Check ports:
# Verify port availability
# Check for conflicts
- Check dependencies:
# Verify dependencies are available
# Check service health
Prevention:
- Add container health checks
- Implement dependency checks
- Add startup validation
Escalation Procedures
When to Escalate
Escalate to on-call if:
- Production outage
- Data loss or corruption
- Security breach
- Critical bug affecting multiple users
- Issue cannot be resolved within 30 minutes
Escalation Steps
-
Document the issue:
- Describe the problem
- Include error messages
- Include logs
- Include request IDs
-
Attempt resolution:
- Follow troubleshooting steps
- Document attempted solutions
- Note what worked/didn't work
-
Escalate:
- Contact on-call engineer
- Provide documentation
- Provide context
- Provide urgency
-
Follow up:
- Monitor resolution
- Document resolution
- Update runbooks
- Share learnings
On-Call Contact
- Primary: [On-call engineer]
- Secondary: [Backup engineer]
- Escalation: [Engineering manager]
Incident Response
- Detect: Monitoring alert or user report
- Acknowledge: Acknowledge alert within 5 minutes
- Investigate: Gather information, check logs
- Mitigate: Implement temporary fix if needed
- Resolve: Implement permanent fix
- Post-mortem: Document incident, learnings, improvements
Common Error Messages
Backend Errors
| Error | Cause | Resolution |
|---|---|---|
ECONNREFUSED |
Port not available | Check port, kill process |
ETIMEDOUT |
Connection timeout | Check network, increase timeout |
Unauthorized |
Invalid auth | Check credentials, token |
Forbidden |
Insufficient permissions | Check user role, permissions |
InternalServerError |
Server error | Check logs, fix bug |
Web Errors
| Error | Cause | Resolution |
|---|---|---|
Network Error |
Backend unreachable | Check backend, network |
401 Unauthorized |
Invalid token | Refresh token, re-auth |
403 Forbidden |
Insufficient permissions | Check user role |
404 Not Found |
Resource not found | Check URL, resource exists |
500 Internal Server Error |
Server error | Check logs, fix bug |
Mobile Errors
| Error | Cause | Resolution |
|---|---|---|
Network request failed |
Network issue | Check network, backend |
Auth failed |
Invalid credentials | Check credentials, re-auth |
WebSocket error |
Connection issue | Check backend, network |
Storage error |
Secure storage issue | Check storage permissions |
App crashed |
Runtime error | Check logs, fix bug |
Monitoring and Alerts
Key Metrics to Monitor
Backend:
- CPU usage
- Memory usage
- API response times
- Error rate
- Database query times
- WebSocket connection count
Web:
- Page load time
- Bundle size
- API response times
- Error rate
- WebSocket connection count
Mobile:
- App launch time
- API response times
- Error rate
- WebSocket connection count
- Crash rate
Alert Thresholds
| Metric | Warning | Critical |
|---|---|---|
| CPU usage | 70% | 90% |
| Memory usage | 70% | 90% |
| API response time | 1s | 5s |
| Error rate | 5% | 10% |
| Database query time | 500ms | 2s |
Monitoring Tools
- Backend: Winston logging, Prometheus metrics
- Web: Browser devtools, Lighthouse
- Mobile: Expo logs, Sentry
References
- README.md - Project overview
- ARCHITECTURE_DOCUMENTATION.md - System architecture
- API_DOCUMENTATION_GUIDE.md - API documentation
- FUNCTIONALITY_REVIEW.md - Functional gaps and issues
- CUTOVER_WEB.md - Web cutover procedures
- CUTOVER_MOBILE.md - Mobile cutover procedures