Created detailed documentation files to guide coding agents in implementing the fixes and improvements identified in the functionality review: 1. API_DOCUMENTATION_GUIDE.md - Complete API endpoint catalog - Authentication documentation - REST API endpoints (health, user, trading, orders, market data, research, backtesting, feature flags, admin) - WebSocket namespaces (/trading, /admin, /) - Error responses and error codes - Rate limiting recommendations - Request ID propagation - Deprecation policy - OpenAPI/Swagger generation guide - Documentation maintenance process 2. ARCHITECTURE_DOCUMENTATION.md - System overview and high-level architecture - Monorepo structure and directory layout - Backend architecture (component structure, data flow, trading loop) - Web architecture (component structure, data flow, UI architecture) - Mobile architecture (component structure, data flow) - Shared code architecture (shared modules, platform integration) - Data architecture (Cosmos DB containers, data flow) - Service boundaries (backend, platform-service, web, mobile responsibilities) - Integration points (backend ↔ platform-service, web ↔ backend, mobile ↔ backend, etc.) - Security architecture (authentication, authorization, tenant isolation) - Monitoring & observability (telemetry, logging) - Deployment architecture (Docker, environment variables) - Scalability considerations (horizontal scaling, WebSocket scaling) 3. TROUBLESHOOTING_GUIDE.md - Backend issues (won't start, health check failing, trading loop not running, orders not executing, reconciliation failures) - Web issues (won't load, authentication failing, WebSocket connection failing, data not updating) - Mobile issues (won't launch, authentication failing, WebSocket connection failing, data not updating) - Database issues (Cosmos DB connection failing, slow performance, data inconsistency) - Authentication issues (platform-service unreachable, JWT token invalid) - WebSocket issues (connection drops, not receiving events) - Performance issues (backend slow response times, web slow load times) - Deployment issues (Docker build failing, container won't start) - Escalation procedures (when to escalate, escalation steps, on-call contact, incident response) - Common error messages (backend, web, mobile) - Monitoring and alerts (key metrics, alert thresholds, monitoring tools) 4. ONBOARDING_GUIDE.md - Project overview (what is the trading dashboard, tech stack, key concepts) - Prerequisites (required software, required accounts, optional tools) - Development setup (clone repo, install dependencies, configure environment, verify setup) - Project structure (monorepo layout, key files) - Development workflow (starting development, making changes, commit message format, pushing changes) - Testing (backend tests, web tests, mobile tests, verification script) - Code review process (PR process, PR checklist, code review guidelines) - Common tasks (adding backend endpoint, adding web component, adding mobile screen, adding feature flag) - Resources (documentation, external documentation, tools) - Getting help (internal resources, common issues, asking questions) - Best practices - Next steps (first week, first month, ongoing) These documentation files provide detailed guidance for coding agents to implement the fixes and improvements identified in FUNCTIONALITY_REVIEW.md without needing to understand the entire codebase from scratch.
1399 lines
24 KiB
Markdown
1399 lines
24 KiB
Markdown
# Troubleshooting Guide
|
|
|
|
**Purpose:** Comprehensive troubleshooting guide for the trading dashboard, covering common issues, resolution steps, and escalation procedures.
|
|
|
|
**Target Audience:** Developers, operators, support engineers.
|
|
|
|
---
|
|
|
|
## Table of Contents
|
|
|
|
- [Backend Issues](#backend-issues)
|
|
- [Web Issues](#web-issues)
|
|
- [Mobile Issues](#mobile-issues)
|
|
- [Database Issues](#database-issues)
|
|
- [Authentication Issues](#authentication-issues)
|
|
- [WebSocket Issues](#websocket-issues)
|
|
- [Performance Issues](#performance-issues)
|
|
- [Deployment Issues](#deployment-issues)
|
|
- [Escalation Procedures](#escalation-procedures)
|
|
|
|
---
|
|
|
|
## Backend Issues
|
|
|
|
### Backend Won't Start
|
|
|
|
**Symptoms:**
|
|
- Backend fails to start
|
|
- Error message on startup
|
|
- Process exits immediately
|
|
|
|
**Common Causes:**
|
|
1. Missing environment variables
|
|
2. Invalid configuration
|
|
3. Database connection failure
|
|
4. Port already in use
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check environment variables:**
|
|
```bash
|
|
cd backend
|
|
cat .env
|
|
# Ensure all required variables are set
|
|
```
|
|
|
|
2. **Verify configuration:**
|
|
```bash
|
|
cd backend
|
|
npm run dev
|
|
# Check error messages for specific issues
|
|
```
|
|
|
|
3. **Check database connection:**
|
|
```bash
|
|
# Test Cosmos DB connection
|
|
curl -I $COSMOS_ENDPOINT
|
|
# Check if Cosmos DB is accessible
|
|
```
|
|
|
|
4. **Check port availability:**
|
|
```bash
|
|
lsof -i :4018
|
|
# Kill process using port 4018 if needed
|
|
kill -9 <PID>
|
|
```
|
|
|
|
5. **Check logs:**
|
|
```bash
|
|
# Backend logs should show startup errors
|
|
# Look for specific error messages
|
|
```
|
|
|
|
**Prevention:**
|
|
- Use `.env.example` as template
|
|
- Validate environment variables at startup
|
|
- Add configuration validation script
|
|
|
|
---
|
|
|
|
### Backend Health Check Failing
|
|
|
|
**Symptoms:**
|
|
- `GET /health/live` returns non-200 status
|
|
- Health check endpoint unreachable
|
|
- Docker healthcheck failing
|
|
|
|
**Common Causes:**
|
|
1. Backend not running
|
|
2. Health check endpoint not implemented
|
|
3. Database connection issues
|
|
4. Dependencies unhealthy
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check if backend is running:**
|
|
```bash
|
|
curl http://localhost:4018/health/live
|
|
```
|
|
|
|
2. **Check backend logs:**
|
|
```bash
|
|
# Look for health check errors
|
|
# Check for database connection errors
|
|
```
|
|
|
|
3. **Check dependencies:**
|
|
```bash
|
|
curl http://localhost:4003/health
|
|
# Check platform-service health
|
|
```
|
|
|
|
4. **Check database:**
|
|
```bash
|
|
# Verify Cosmos DB is accessible
|
|
# Check database container exists
|
|
```
|
|
|
|
**Prevention:**
|
|
- Implement comprehensive health checks
|
|
- Add dependency health checks
|
|
- Monitor health check failures
|
|
|
|
---
|
|
|
|
### Trading Loop Not Running
|
|
|
|
**Symptoms:**
|
|
- No trades executing
|
|
- Bot state shows "idle"
|
|
- Positions not updating
|
|
|
|
**Common Causes:**
|
|
1. Trading disabled globally
|
|
2. Profile disabled
|
|
3. Market closed
|
|
4. Configuration error
|
|
5. Exchange API issue
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check trading control state:**
|
|
```bash
|
|
curl -H "Authorization: Bearer <token>" http://localhost:4018/api/trading/control
|
|
# Check if globalTradingEnabled is true
|
|
```
|
|
|
|
2. **Check profile status:**
|
|
```bash
|
|
# Check if specific profile is enabled
|
|
# Look for profile-level disable
|
|
```
|
|
|
|
3. **Check market hours:**
|
|
```bash
|
|
# Verify SessionRule is not blocking
|
|
# Check if market is open
|
|
```
|
|
|
|
4. **Check exchange API:**
|
|
```bash
|
|
# Verify exchange API credentials
|
|
# Test exchange API connectivity
|
|
```
|
|
|
|
5. **Check backend logs:**
|
|
```bash
|
|
# Look for trading loop errors
|
|
# Check for strategy rule failures
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add trading loop status monitoring
|
|
- Add exchange API health checks
|
|
- Implement trading loop alerts
|
|
|
|
---
|
|
|
|
### Orders Not Executing
|
|
|
|
**Symptoms:**
|
|
- Orders submitted but not executed
|
|
- Orders stuck in "pending" state
|
|
- Exchange API errors
|
|
|
|
**Common Causes:**
|
|
1. Exchange API issue
|
|
2. Insufficient capital
|
|
3. Risk management blocking
|
|
4. Exchange maintenance
|
|
5. Invalid order parameters
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check exchange API:**
|
|
```bash
|
|
# Test exchange API connectivity
|
|
# Verify API credentials
|
|
# Check exchange status page
|
|
```
|
|
|
|
2. **Check capital:**
|
|
```bash
|
|
# Verify sufficient capital available
|
|
# Check capital ledger
|
|
```
|
|
|
|
3. **Check risk engine:**
|
|
```bash
|
|
# Check if risk management is blocking
|
|
# Look for risk rule failures
|
|
```
|
|
|
|
4. **Check order parameters:**
|
|
```bash
|
|
# Verify order parameters are valid
|
|
# Check symbol, quantity, price
|
|
```
|
|
|
|
5. **Check backend logs:**
|
|
```bash
|
|
# Look for order execution errors
|
|
# Check for exchange API errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add order execution monitoring
|
|
- Add exchange API health checks
|
|
- Implement order execution alerts
|
|
|
|
---
|
|
|
|
### Reconciliation Failures
|
|
|
|
**Symptoms:**
|
|
- Reconciliation service failing
|
|
- Data inconsistencies
|
|
- Reconciliation errors in logs
|
|
|
|
**Common Causes:**
|
|
1. Exchange API issue
|
|
2. Database connection issue
|
|
3. Data format mismatch
|
|
4. Missing data
|
|
5. Logic error
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check reconciliation logs:**
|
|
```bash
|
|
# Look for reconciliation errors
|
|
# Check for specific failure reasons
|
|
```
|
|
|
|
2. **Run reconciliation manually:**
|
|
```bash
|
|
cd backend
|
|
npm run reconcile:lifecycle-history
|
|
# Run specific reconciliation script
|
|
```
|
|
|
|
3. **Check exchange data:**
|
|
```bash
|
|
# Verify exchange API is returning correct data
|
|
# Check for data format changes
|
|
```
|
|
|
|
4. **Check database:**
|
|
```bash
|
|
# Verify database is accessible
|
|
# Check for data corruption
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add reconciliation monitoring
|
|
- Add reconciliation alerts
|
|
- Implement reconciliation retry logic
|
|
|
|
---
|
|
|
|
## Web Issues
|
|
|
|
### Web Won't Load
|
|
|
|
**Symptoms:**
|
|
- Blank screen
|
|
- Loading spinner indefinitely
|
|
- Browser console errors
|
|
|
|
**Common Causes:**
|
|
1. Backend API unreachable
|
|
2. Authentication failure
|
|
3. Build error
|
|
4. JavaScript error
|
|
5. Network issue
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check browser console:**
|
|
```javascript
|
|
// Open browser devtools
|
|
// Check for JavaScript errors
|
|
// Check for network errors
|
|
```
|
|
|
|
2. **Check backend API:**
|
|
```bash
|
|
curl http://localhost:4018/health/live
|
|
# Verify backend is running
|
|
```
|
|
|
|
3. **Check authentication:**
|
|
```bash
|
|
# Verify auth token is valid
|
|
# Check platform-service is reachable
|
|
```
|
|
|
|
4. **Check build:**
|
|
```bash
|
|
cd web
|
|
pnpm build
|
|
# Verify build succeeds
|
|
```
|
|
|
|
5. **Check network:**
|
|
```bash
|
|
# Check network connectivity
|
|
# Verify CORS settings
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add error boundaries
|
|
- Add loading states
|
|
- Implement graceful degradation
|
|
|
|
---
|
|
|
|
### Authentication Failing
|
|
|
|
**Symptoms:**
|
|
- Login fails
|
|
- Session expires immediately
|
|
- Auth token invalid
|
|
- 401/403 errors
|
|
|
|
**Common Causes:**
|
|
1. Platform-service unreachable
|
|
2. Invalid credentials
|
|
3. Token expired
|
|
4. Platform-service down
|
|
5. Configuration error
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check platform-service:**
|
|
```bash
|
|
curl http://localhost:4003/health
|
|
# Verify platform-service is running
|
|
```
|
|
|
|
2. **Check credentials:**
|
|
```bash
|
|
# Verify username/password are correct
|
|
# Check platform-service user exists
|
|
```
|
|
|
|
3. **Check token:**
|
|
```bash
|
|
# Verify JWT token is valid
|
|
# Check token expiration
|
|
```
|
|
|
|
4. **Check configuration:**
|
|
```bash
|
|
cat web/.env.local
|
|
# Verify VITE_PLATFORM_URL is correct
|
|
# Verify auth configuration
|
|
```
|
|
|
|
5. **Check logs:**
|
|
```bash
|
|
# Check platform-service logs
|
|
# Check backend auth logs
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add auth error handling
|
|
- Implement token refresh
|
|
- Add auth monitoring
|
|
|
|
---
|
|
|
|
### WebSocket Connection Failing
|
|
|
|
**Symptoms:**
|
|
- WebSocket won't connect
|
|
- Connection drops repeatedly
|
|
- No real-time updates
|
|
- WebSocket errors in console
|
|
|
|
**Common Causes:**
|
|
1. Backend unreachable
|
|
2. Auth token invalid
|
|
3. WebSocket blocked by firewall
|
|
4. Namespace mismatch
|
|
5. Connection limit exceeded
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check backend WebSocket:**
|
|
```bash
|
|
# Verify backend is running
|
|
# Check WebSocket port is accessible
|
|
```
|
|
|
|
2. **Check auth token:**
|
|
```bash
|
|
# Verify JWT token is valid
|
|
# Check token is included in connection
|
|
```
|
|
|
|
3. **Check firewall:**
|
|
```bash
|
|
# Verify WebSocket port is not blocked
|
|
# Check network settings
|
|
```
|
|
|
|
4. **Check namespace:**
|
|
```bash
|
|
# Verify correct namespace (/trading)
|
|
# Check namespace exists in backend
|
|
```
|
|
|
|
5. **Check logs:**
|
|
```bash
|
|
# Check backend WebSocket logs
|
|
# Look for connection errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add WebSocket error handling
|
|
- Implement reconnection logic
|
|
- Add WebSocket monitoring
|
|
|
|
---
|
|
|
|
### Data Not Updating
|
|
|
|
**Symptoms:**
|
|
- Data stale
|
|
- No real-time updates
|
|
- Manual refresh required
|
|
- WebSocket connected but no updates
|
|
|
|
**Common Causes:**
|
|
1. WebSocket not receiving events
|
|
2. Backend not broadcasting
|
|
3. Event subscription issue
|
|
4. Data cache issue
|
|
5. Filter/mask issue
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check WebSocket connection:**
|
|
```javascript
|
|
// Check WebSocket is connected
|
|
// Check for connection errors
|
|
```
|
|
|
|
2. **Check backend broadcasting:**
|
|
```bash
|
|
# Check backend logs for broadcast events
|
|
# Verify events are being sent
|
|
```
|
|
|
|
3. **Check event subscription:**
|
|
```javascript
|
|
// Verify correct events are subscribed
|
|
// Check for subscription errors
|
|
```
|
|
|
|
4. **Check data cache:**
|
|
```javascript
|
|
// Clear cache if needed
|
|
// Verify cache is not stale
|
|
```
|
|
|
|
5. **Check filters:**
|
|
```javascript
|
|
// Verify data filters are correct
|
|
# Check for masking issues
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add data staleness monitoring
|
|
- Implement cache invalidation
|
|
- Add data update alerts
|
|
|
|
---
|
|
|
|
## Mobile Issues
|
|
|
|
### Mobile App Won't Launch
|
|
|
|
**Symptoms:**
|
|
- App crashes on launch
|
|
- Blank screen
|
|
- Loading spinner indefinitely
|
|
- Expo Go won't load
|
|
|
|
**Common Causes:**
|
|
1. Build error
|
|
2. Configuration error
|
|
3. Platform-service unreachable
|
|
4. Backend unreachable
|
|
5. Expo Go version mismatch
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check Expo logs:**
|
|
```bash
|
|
# Check Expo dev server logs
|
|
# Look for build errors
|
|
```
|
|
|
|
2. **Check configuration:**
|
|
```bash
|
|
cat mobile/.env.local
|
|
# Verify EXPO_PUBLIC_* variables are set
|
|
```
|
|
|
|
3. **Check platform-service:**
|
|
```bash
|
|
curl http://localhost:4003/health
|
|
# Verify platform-service is running
|
|
```
|
|
|
|
4. **Check backend:**
|
|
```bash
|
|
curl http://localhost:4018/health/live
|
|
# Verify backend is running
|
|
```
|
|
|
|
5. **Check Expo Go:**
|
|
```bash
|
|
# Verify Expo Go is latest version
|
|
# Try clearing Expo Go cache
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add error boundaries
|
|
- Add launch error handling
|
|
- Implement graceful degradation
|
|
|
|
---
|
|
|
|
### Mobile Authentication Failing
|
|
|
|
**Symptoms:**
|
|
- Login fails
|
|
- Session won't restore
|
|
- Auth token invalid
|
|
- Secure storage issue
|
|
|
|
**Common Causes:**
|
|
1. Platform-service unreachable
|
|
2. Invalid credentials
|
|
3. Token expired
|
|
4. Secure storage error
|
|
5. Platform-service down
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check platform-service:**
|
|
```bash
|
|
curl http://localhost:4003/health
|
|
# Verify platform-service is running
|
|
```
|
|
|
|
2. **Check credentials:**
|
|
```bash
|
|
# Verify credentials are correct
|
|
# Check platform-service user exists
|
|
```
|
|
|
|
3. **Check token:**
|
|
```bash
|
|
# Verify JWT token is valid
|
|
# Check token expiration
|
|
```
|
|
|
|
4. **Check secure storage:**
|
|
```bash
|
|
# Verify secure storage is working
|
|
# Check for storage errors
|
|
```
|
|
|
|
5. **Check logs:**
|
|
```bash
|
|
# Check mobile logs
|
|
# Look for auth errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add auth error handling
|
|
- Implement token refresh
|
|
- Add secure storage error handling
|
|
|
|
---
|
|
|
|
### Mobile WebSocket Connection Failing
|
|
|
|
**Symptoms:**
|
|
- WebSocket won't connect
|
|
- Connection drops repeatedly
|
|
- No real-time updates
|
|
- WebSocket errors in logs
|
|
|
|
**Common Causes:**
|
|
1. Backend unreachable
|
|
2. Auth token invalid
|
|
3. WebSocket blocked by network
|
|
4. Namespace mismatch
|
|
5. Connection limit exceeded
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check backend WebSocket:**
|
|
```bash
|
|
# Verify backend is running
|
|
# Check WebSocket port is accessible
|
|
```
|
|
|
|
2. **Check auth token:**
|
|
```bash
|
|
# Verify JWT token is valid
|
|
# Check token is included in connection
|
|
```
|
|
|
|
3. **Check network:**
|
|
```bash
|
|
# Verify network allows WebSocket
|
|
# Check mobile network settings
|
|
```
|
|
|
|
4. **Check namespace:**
|
|
```bash
|
|
# Verify correct namespace (/trading)
|
|
# Check namespace exists in backend
|
|
```
|
|
|
|
5. **Check logs:**
|
|
```bash
|
|
# Check mobile logs
|
|
# Look for connection errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add WebSocket error handling
|
|
- Implement reconnection logic
|
|
- Add polling fallback
|
|
|
|
---
|
|
|
|
### Mobile Data Not Updating
|
|
|
|
**Symptoms:**
|
|
- Data stale
|
|
- No real-time updates
|
|
- Manual refresh required
|
|
- WebSocket connected but no updates
|
|
|
|
**Common Causes:**
|
|
1. WebSocket not receiving events
|
|
2. Backend not broadcasting
|
|
3. Event subscription issue
|
|
4. Data cache issue
|
|
5. Polling fallback not working
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check WebSocket connection:**
|
|
```javascript
|
|
// Check WebSocket is connected
|
|
// Check for connection errors
|
|
```
|
|
|
|
2. **Check backend broadcasting:**
|
|
```bash
|
|
# Check backend logs for broadcast events
|
|
# Verify events are being sent
|
|
```
|
|
|
|
3. **Check polling fallback:**
|
|
```javascript
|
|
// Verify polling is working
|
|
// Check polling interval
|
|
```
|
|
|
|
4. **Check data cache:**
|
|
```javascript
|
|
// Clear cache if needed
|
|
# Verify cache is not stale
|
|
```
|
|
|
|
5. **Check logs:**
|
|
```bash
|
|
# Check mobile logs
|
|
# Look for update errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add data staleness monitoring
|
|
- Implement polling fallback
|
|
- Add data update alerts
|
|
|
|
---
|
|
|
|
## Database Issues
|
|
|
|
### Cosmos DB Connection Failing
|
|
|
|
**Symptoms:**
|
|
- Backend can't connect to Cosmos
|
|
- Connection timeout
|
|
- Authentication error
|
|
- Database not found
|
|
|
|
**Common Causes:**
|
|
1. Invalid credentials
|
|
2. Network issue
|
|
3. Cosmos DB down
|
|
4. Container not found
|
|
5. Firewall blocking
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check credentials:**
|
|
```bash
|
|
cat backend/.env
|
|
# Verify COSMOS_ENDPOINT and COSMOS_KEY are correct
|
|
```
|
|
|
|
2. **Test connection:**
|
|
```bash
|
|
curl -I $COSMOS_ENDPOINT
|
|
# Verify Cosmos DB is accessible
|
|
```
|
|
|
|
3. **Check database:**
|
|
```bash
|
|
# Verify database exists
|
|
# Check container exists
|
|
```
|
|
|
|
4. **Check network:**
|
|
```bash
|
|
# Verify network allows connection
|
|
# Check firewall settings
|
|
```
|
|
|
|
5. **Check logs:**
|
|
```bash
|
|
# Check backend logs for Cosmos errors
|
|
# Look for connection errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add connection retry logic
|
|
- Implement connection pooling
|
|
- Add connection monitoring
|
|
|
|
---
|
|
|
|
### Cosmos DB Slow Performance
|
|
|
|
**Symptoms:**
|
|
- Slow query response times
|
|
- Timeout errors
|
|
- Performance degradation
|
|
|
|
**Common Causes:**
|
|
1. Large result sets
|
|
2. Missing indexes
|
|
3. High RU consumption
|
|
4. Network latency
|
|
5. Database throttling
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check query performance:**
|
|
```bash
|
|
# Check query execution time
|
|
# Look for slow queries
|
|
```
|
|
|
|
2. **Check indexes:**
|
|
```bash
|
|
# Verify indexes exist
|
|
# Check index usage
|
|
```
|
|
|
|
3. **Check RU consumption:**
|
|
```bash
|
|
# Monitor RU consumption
|
|
# Check for throttling
|
|
```
|
|
|
|
4. **Optimize queries:**
|
|
```bash
|
|
# Add query filters
|
|
# Use pagination
|
|
# Reduce result set size
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add query performance monitoring
|
|
- Implement query optimization
|
|
- Add performance alerts
|
|
|
|
---
|
|
|
|
### Data Inconsistency
|
|
|
|
**Symptoms:**
|
|
- Data mismatch between surfaces
|
|
- Stale data
|
|
- Missing data
|
|
- Duplicate data
|
|
|
|
**Common Causes:**
|
|
1. Reconciliation failure
|
|
2. Cache issue
|
|
3. Race condition
|
|
4. Data corruption
|
|
5. Sync issue
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Run reconciliation:**
|
|
```bash
|
|
cd backend
|
|
npm run reconcile:lifecycle-history
|
|
# Run reconciliation script
|
|
```
|
|
|
|
2. **Check cache:**
|
|
```bash
|
|
# Clear cache if needed
|
|
# Verify cache is not stale
|
|
```
|
|
|
|
3. **Check sync:**
|
|
```bash
|
|
# Verify sync is working
|
|
# Check for sync errors
|
|
```
|
|
|
|
4. **Check data integrity:**
|
|
```bash
|
|
# Verify data integrity
|
|
# Check for corruption
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add data consistency monitoring
|
|
- Implement reconciliation
|
|
- Add data integrity checks
|
|
|
|
---
|
|
|
|
## Authentication Issues
|
|
|
|
### Platform-Service Unreachable
|
|
|
|
**Symptoms:**
|
|
- Can't authenticate
|
|
- Token validation fails
|
|
- Platform-service health check fails
|
|
|
|
**Common Causes:**
|
|
1. Platform-service down
|
|
2. Network issue
|
|
3. Wrong URL
|
|
4. Firewall blocking
|
|
5. DNS issue
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check platform-service:**
|
|
```bash
|
|
curl http://localhost:4003/health
|
|
# Verify platform-service is running
|
|
```
|
|
|
|
2. **Check configuration:**
|
|
```bash
|
|
cat backend/.env
|
|
# Verify PLATFORM_API_URL is correct
|
|
cat web/.env.local
|
|
# Verify VITE_PLATFORM_URL is correct
|
|
```
|
|
|
|
3. **Check network:**
|
|
```bash
|
|
# Verify network allows connection
|
|
# Check firewall settings
|
|
```
|
|
|
|
4. **Check DNS:**
|
|
```bash
|
|
# Verify DNS resolution
|
|
# Check host file
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add platform-service monitoring
|
|
- Implement fallback auth
|
|
- Add auth monitoring
|
|
|
|
---
|
|
|
|
### JWT Token Invalid
|
|
|
|
**Symptoms:**
|
|
- 401 errors
|
|
- Token validation fails
|
|
- Session expires immediately
|
|
|
|
**Common Causes:**
|
|
1. Token expired
|
|
2. Token malformed
|
|
3. Wrong signing key
|
|
4. Token revoked
|
|
5. Clock skew
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check token expiration:**
|
|
```bash
|
|
# Decode JWT token
|
|
# Check expiration time
|
|
```
|
|
|
|
2. **Check token format:**
|
|
```bash
|
|
# Verify JWT format is correct
|
|
# Check for token errors
|
|
```
|
|
|
|
3. **Check signing key:**
|
|
```bash
|
|
# Verify signing key is correct
|
|
# Check key rotation
|
|
```
|
|
|
|
4. **Check clock skew:**
|
|
```bash
|
|
# Verify system time is correct
|
|
# Check NTP sync
|
|
```
|
|
|
|
**Prevention:**
|
|
- Implement token refresh
|
|
- Add token validation
|
|
- Add token monitoring
|
|
|
|
---
|
|
|
|
## WebSocket Issues
|
|
|
|
### WebSocket Connection Drops
|
|
|
|
**Symptoms:**
|
|
- Connection drops repeatedly
|
|
- Reconnection fails
|
|
- Connection timeout
|
|
|
|
**Common Causes:**
|
|
1. Network instability
|
|
2. Backend restart
|
|
3. Auth token expired
|
|
4. Connection limit
|
|
5. Firewall timeout
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check network:**
|
|
```bash
|
|
# Verify network stability
|
|
# Check for packet loss
|
|
```
|
|
|
|
2. **Check backend:**
|
|
```bash
|
|
# Verify backend is not restarting
|
|
# Check backend logs
|
|
```
|
|
|
|
3. **Check auth token:**
|
|
```bash
|
|
# Verify JWT token is valid
|
|
# Check token expiration
|
|
```
|
|
|
|
4. **Check connection limit:**
|
|
```bash
|
|
# Verify connection limit not exceeded
|
|
# Check for connection leaks
|
|
```
|
|
|
|
**Prevention:**
|
|
- Implement reconnection logic
|
|
- Add connection monitoring
|
|
- Implement heartbeat
|
|
|
|
---
|
|
|
|
### WebSocket Not Receiving Events
|
|
|
|
**Symptoms:**
|
|
- WebSocket connected but no events
|
|
- Events not arriving
|
|
- Event subscription issue
|
|
|
|
**Common Causes:**
|
|
1. Not subscribed to events
|
|
2. Namespace mismatch
|
|
3. Backend not broadcasting
|
|
4. Event filtering
|
|
5. Room mismatch
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check subscription:**
|
|
```javascript
|
|
// Verify correct events are subscribed
|
|
// Check for subscription errors
|
|
```
|
|
|
|
2. **Check namespace:**
|
|
```bash
|
|
# Verify correct namespace
|
|
# Check namespace exists
|
|
```
|
|
|
|
3. **Check backend:**
|
|
```bash
|
|
# Verify backend is broadcasting
|
|
# Check backend logs
|
|
```
|
|
|
|
4. **Check filtering:**
|
|
```javascript
|
|
// Verify event filters are correct
|
|
# Check for filtering errors
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add event monitoring
|
|
- Implement event acknowledgment
|
|
- Add event logging
|
|
|
|
---
|
|
|
|
## Performance Issues
|
|
|
|
### Backend Slow Response Times
|
|
|
|
**Symptoms:**
|
|
- API responses slow
|
|
- High latency
|
|
- Timeout errors
|
|
|
|
**Common Causes:**
|
|
1. Database query slow
|
|
2. External API slow
|
|
3. CPU bottleneck
|
|
4. Memory bottleneck
|
|
5. Network latency
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check database queries:**
|
|
```bash
|
|
# Check query performance
|
|
# Look for slow queries
|
|
```
|
|
|
|
2. **Check external APIs:**
|
|
```bash
|
|
# Check exchange API latency
|
|
# Verify API performance
|
|
```
|
|
|
|
3. **Check system resources:**
|
|
```bash
|
|
# Check CPU usage
|
|
# Check memory usage
|
|
```
|
|
|
|
4. **Check network:**
|
|
```bash
|
|
# Check network latency
|
|
# Verify network bandwidth
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add performance monitoring
|
|
- Implement caching
|
|
- Add performance alerts
|
|
|
|
---
|
|
|
|
### Web Slow Load Times
|
|
|
|
**Symptoms:**
|
|
- Slow page load
|
|
- Large bundle size
|
|
- Slow initial render
|
|
|
|
**Common Causes:**
|
|
1. Large bundle size
|
|
2. Too many requests
|
|
3. Slow API responses
|
|
4. Unoptimized assets
|
|
5. No caching
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check bundle size:**
|
|
```bash
|
|
cd web
|
|
pnpm build
|
|
# Check bundle size
|
|
```
|
|
|
|
2. **Check network requests:**
|
|
```javascript
|
|
// Check network tab
|
|
// Look for large requests
|
|
```
|
|
|
|
3. **Check API responses:**
|
|
```bash
|
|
# Check API response times
|
|
# Look for slow endpoints
|
|
```
|
|
|
|
4. **Optimize assets:**
|
|
```bash
|
|
# Optimize images
|
|
# Minify assets
|
|
# Implement caching
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add performance monitoring
|
|
- Implement code splitting
|
|
- Add caching
|
|
|
|
---
|
|
|
|
## Deployment Issues
|
|
|
|
### Docker Build Failing
|
|
|
|
**Symptoms:**
|
|
- Docker build fails
|
|
- Build timeout
|
|
- Dependency install fails
|
|
|
|
**Common Causes:**
|
|
1. Invalid Dockerfile
|
|
2. Missing dependencies
|
|
3. Network issue
|
|
4. Insufficient resources
|
|
5. Build context too large
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check Dockerfile:**
|
|
```bash
|
|
# Verify Dockerfile syntax
|
|
# Check for errors
|
|
```
|
|
|
|
2. **Check dependencies:**
|
|
```bash
|
|
# Verify package.json is valid
|
|
# Check for missing dependencies
|
|
```
|
|
|
|
3. **Check network:**
|
|
```bash
|
|
# Verify network connectivity
|
|
# Check registry access
|
|
```
|
|
|
|
4. **Check resources:**
|
|
```bash
|
|
# Verify sufficient disk space
|
|
# Check memory availability
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add Docker build validation
|
|
- Implement build caching
|
|
- Add build monitoring
|
|
|
|
---
|
|
|
|
### Docker Container Won't Start
|
|
|
|
**Symptoms:**
|
|
- Container won't start
|
|
- Container exits immediately
|
|
- Container crashes
|
|
|
|
**Common Causes:**
|
|
1. Invalid configuration
|
|
2. Missing environment variables
|
|
3. Port conflict
|
|
4. Dependency unavailable
|
|
5. Health check failing
|
|
|
|
**Resolution Steps:**
|
|
|
|
1. **Check container logs:**
|
|
```bash
|
|
docker logs <container-id>
|
|
# Look for startup errors
|
|
```
|
|
|
|
2. **Check configuration:**
|
|
```bash
|
|
# Verify environment variables
|
|
# Check configuration files
|
|
```
|
|
|
|
3. **Check ports:**
|
|
```bash
|
|
# Verify port availability
|
|
# Check for conflicts
|
|
```
|
|
|
|
4. **Check dependencies:**
|
|
```bash
|
|
# Verify dependencies are available
|
|
# Check service health
|
|
```
|
|
|
|
**Prevention:**
|
|
- Add container health checks
|
|
- Implement dependency checks
|
|
- Add startup validation
|
|
|
|
---
|
|
|
|
## Escalation Procedures
|
|
|
|
### When to Escalate
|
|
|
|
Escalate to on-call if:
|
|
- Production outage
|
|
- Data loss or corruption
|
|
- Security breach
|
|
- Critical bug affecting multiple users
|
|
- Issue cannot be resolved within 30 minutes
|
|
|
|
### Escalation Steps
|
|
|
|
1. **Document the issue:**
|
|
- Describe the problem
|
|
- Include error messages
|
|
- Include logs
|
|
- Include request IDs
|
|
|
|
2. **Attempt resolution:**
|
|
- Follow troubleshooting steps
|
|
- Document attempted solutions
|
|
- Note what worked/didn't work
|
|
|
|
3. **Escalate:**
|
|
- Contact on-call engineer
|
|
- Provide documentation
|
|
- Provide context
|
|
- Provide urgency
|
|
|
|
4. **Follow up:**
|
|
- Monitor resolution
|
|
- Document resolution
|
|
- Update runbooks
|
|
- Share learnings
|
|
|
|
### On-Call Contact
|
|
|
|
- **Primary:** [On-call engineer]
|
|
- **Secondary:** [Backup engineer]
|
|
- **Escalation:** [Engineering manager]
|
|
|
|
### Incident Response
|
|
|
|
1. **Detect:** Monitoring alert or user report
|
|
2. **Acknowledge:** Acknowledge alert within 5 minutes
|
|
3. **Investigate:** Gather information, check logs
|
|
4. **Mitigate:** Implement temporary fix if needed
|
|
5. **Resolve:** Implement permanent fix
|
|
6. **Post-mortem:** Document incident, learnings, improvements
|
|
|
|
---
|
|
|
|
## Common Error Messages
|
|
|
|
### Backend Errors
|
|
|
|
| Error | Cause | Resolution |
|
|
|-------|-------|------------|
|
|
| `ECONNREFUSED` | Port not available | Check port, kill process |
|
|
| `ETIMEDOUT` | Connection timeout | Check network, increase timeout |
|
|
| `Unauthorized` | Invalid auth | Check credentials, token |
|
|
| `Forbidden` | Insufficient permissions | Check user role, permissions |
|
|
| `InternalServerError` | Server error | Check logs, fix bug |
|
|
|
|
### Web Errors
|
|
|
|
| Error | Cause | Resolution |
|
|
|-------|-------|------------|
|
|
| `Network Error` | Backend unreachable | Check backend, network |
|
|
| `401 Unauthorized` | Invalid token | Refresh token, re-auth |
|
|
| `403 Forbidden` | Insufficient permissions | Check user role |
|
|
| `404 Not Found` | Resource not found | Check URL, resource exists |
|
|
| `500 Internal Server Error` | Server error | Check logs, fix bug |
|
|
|
|
### Mobile Errors
|
|
|
|
| Error | Cause | Resolution |
|
|
|-------|-------|------------|
|
|
| `Network request failed` | Network issue | Check network, backend |
|
|
| `Auth failed` | Invalid credentials | Check credentials, re-auth |
|
|
| `WebSocket error` | Connection issue | Check backend, network |
|
|
| `Storage error` | Secure storage issue | Check storage permissions |
|
|
| `App crashed` | Runtime error | Check logs, fix bug |
|
|
|
|
---
|
|
|
|
## Monitoring and Alerts
|
|
|
|
### Key Metrics to Monitor
|
|
|
|
**Backend:**
|
|
- CPU usage
|
|
- Memory usage
|
|
- API response times
|
|
- Error rate
|
|
- Database query times
|
|
- WebSocket connection count
|
|
|
|
**Web:**
|
|
- Page load time
|
|
- Bundle size
|
|
- API response times
|
|
- Error rate
|
|
- WebSocket connection count
|
|
|
|
**Mobile:**
|
|
- App launch time
|
|
- API response times
|
|
- Error rate
|
|
- WebSocket connection count
|
|
- Crash rate
|
|
|
|
### Alert Thresholds
|
|
|
|
| Metric | Warning | Critical |
|
|
|--------|---------|----------|
|
|
| CPU usage | 70% | 90% |
|
|
| Memory usage | 70% | 90% |
|
|
| API response time | 1s | 5s |
|
|
| Error rate | 5% | 10% |
|
|
| Database query time | 500ms | 2s |
|
|
|
|
### Monitoring Tools
|
|
|
|
- Backend: Winston logging, Prometheus metrics
|
|
- Web: Browser devtools, Lighthouse
|
|
- Mobile: Expo logs, Sentry
|
|
|
|
---
|
|
|
|
## References
|
|
|
|
- **README.md** - Project overview
|
|
- **ARCHITECTURE_DOCUMENTATION.md** - System architecture
|
|
- **API_DOCUMENTATION_GUIDE.md** - API documentation
|
|
- **FUNCTIONALITY_REVIEW.md** - Functional gaps and issues
|
|
- **CUTOVER_WEB.md** - Web cutover procedures
|
|
- **CUTOVER_MOBILE.md** - Mobile cutover procedures
|