learning_ai_invt_trdg/docs/planned/TROUBLESHOOTING_GUIDE.md
Saravana Achu Mac 83bc3af260 docs: add comprehensive documentation for coding agents
Created detailed documentation files to guide coding agents in implementing
the fixes and improvements identified in the functionality review:

1. API_DOCUMENTATION_GUIDE.md
   - Complete API endpoint catalog
   - Authentication documentation
   - REST API endpoints (health, user, trading, orders, market data, research, backtesting, feature flags, admin)
   - WebSocket namespaces (/trading, /admin, /)
   - Error responses and error codes
   - Rate limiting recommendations
   - Request ID propagation
   - Deprecation policy
   - OpenAPI/Swagger generation guide
   - Documentation maintenance process

2. ARCHITECTURE_DOCUMENTATION.md
   - System overview and high-level architecture
   - Monorepo structure and directory layout
   - Backend architecture (component structure, data flow, trading loop)
   - Web architecture (component structure, data flow, UI architecture)
   - Mobile architecture (component structure, data flow)
   - Shared code architecture (shared modules, platform integration)
   - Data architecture (Cosmos DB containers, data flow)
   - Service boundaries (backend, platform-service, web, mobile responsibilities)
   - Integration points (backend ↔ platform-service, web ↔ backend, mobile ↔ backend, etc.)
   - Security architecture (authentication, authorization, tenant isolation)
   - Monitoring & observability (telemetry, logging)
   - Deployment architecture (Docker, environment variables)
   - Scalability considerations (horizontal scaling, WebSocket scaling)

3. TROUBLESHOOTING_GUIDE.md
   - Backend issues (won't start, health check failing, trading loop not running, orders not executing, reconciliation failures)
   - Web issues (won't load, authentication failing, WebSocket connection failing, data not updating)
   - Mobile issues (won't launch, authentication failing, WebSocket connection failing, data not updating)
   - Database issues (Cosmos DB connection failing, slow performance, data inconsistency)
   - Authentication issues (platform-service unreachable, JWT token invalid)
   - WebSocket issues (connection drops, not receiving events)
   - Performance issues (backend slow response times, web slow load times)
   - Deployment issues (Docker build failing, container won't start)
   - Escalation procedures (when to escalate, escalation steps, on-call contact, incident response)
   - Common error messages (backend, web, mobile)
   - Monitoring and alerts (key metrics, alert thresholds, monitoring tools)

4. ONBOARDING_GUIDE.md
   - Project overview (what is the trading dashboard, tech stack, key concepts)
   - Prerequisites (required software, required accounts, optional tools)
   - Development setup (clone repo, install dependencies, configure environment, verify setup)
   - Project structure (monorepo layout, key files)
   - Development workflow (starting development, making changes, commit message format, pushing changes)
   - Testing (backend tests, web tests, mobile tests, verification script)
   - Code review process (PR process, PR checklist, code review guidelines)
   - Common tasks (adding backend endpoint, adding web component, adding mobile screen, adding feature flag)
   - Resources (documentation, external documentation, tools)
   - Getting help (internal resources, common issues, asking questions)
   - Best practices
   - Next steps (first week, first month, ongoing)

These documentation files provide detailed guidance for coding agents to implement
the fixes and improvements identified in FUNCTIONALITY_REVIEW.md without needing
to understand the entire codebase from scratch.
2026-05-09 17:02:05 -07:00

1399 lines
24 KiB
Markdown

# Troubleshooting Guide
**Purpose:** Comprehensive troubleshooting guide for the trading dashboard, covering common issues, resolution steps, and escalation procedures.
**Target Audience:** Developers, operators, support engineers.
---
## Table of Contents
- [Backend Issues](#backend-issues)
- [Web Issues](#web-issues)
- [Mobile Issues](#mobile-issues)
- [Database Issues](#database-issues)
- [Authentication Issues](#authentication-issues)
- [WebSocket Issues](#websocket-issues)
- [Performance Issues](#performance-issues)
- [Deployment Issues](#deployment-issues)
- [Escalation Procedures](#escalation-procedures)
---
## Backend Issues
### Backend Won't Start
**Symptoms:**
- Backend fails to start
- Error message on startup
- Process exits immediately
**Common Causes:**
1. Missing environment variables
2. Invalid configuration
3. Database connection failure
4. Port already in use
**Resolution Steps:**
1. **Check environment variables:**
```bash
cd backend
cat .env
# Ensure all required variables are set
```
2. **Verify configuration:**
```bash
cd backend
npm run dev
# Check error messages for specific issues
```
3. **Check database connection:**
```bash
# Test Cosmos DB connection
curl -I $COSMOS_ENDPOINT
# Check if Cosmos DB is accessible
```
4. **Check port availability:**
```bash
lsof -i :4018
# Kill process using port 4018 if needed
kill -9 <PID>
```
5. **Check logs:**
```bash
# Backend logs should show startup errors
# Look for specific error messages
```
**Prevention:**
- Use `.env.example` as template
- Validate environment variables at startup
- Add configuration validation script
---
### Backend Health Check Failing
**Symptoms:**
- `GET /health/live` returns non-200 status
- Health check endpoint unreachable
- Docker healthcheck failing
**Common Causes:**
1. Backend not running
2. Health check endpoint not implemented
3. Database connection issues
4. Dependencies unhealthy
**Resolution Steps:**
1. **Check if backend is running:**
```bash
curl http://localhost:4018/health/live
```
2. **Check backend logs:**
```bash
# Look for health check errors
# Check for database connection errors
```
3. **Check dependencies:**
```bash
curl http://localhost:4003/health
# Check platform-service health
```
4. **Check database:**
```bash
# Verify Cosmos DB is accessible
# Check database container exists
```
**Prevention:**
- Implement comprehensive health checks
- Add dependency health checks
- Monitor health check failures
---
### Trading Loop Not Running
**Symptoms:**
- No trades executing
- Bot state shows "idle"
- Positions not updating
**Common Causes:**
1. Trading disabled globally
2. Profile disabled
3. Market closed
4. Configuration error
5. Exchange API issue
**Resolution Steps:**
1. **Check trading control state:**
```bash
curl -H "Authorization: Bearer <token>" http://localhost:4018/api/trading/control
# Check if globalTradingEnabled is true
```
2. **Check profile status:**
```bash
# Check if specific profile is enabled
# Look for profile-level disable
```
3. **Check market hours:**
```bash
# Verify SessionRule is not blocking
# Check if market is open
```
4. **Check exchange API:**
```bash
# Verify exchange API credentials
# Test exchange API connectivity
```
5. **Check backend logs:**
```bash
# Look for trading loop errors
# Check for strategy rule failures
```
**Prevention:**
- Add trading loop status monitoring
- Add exchange API health checks
- Implement trading loop alerts
---
### Orders Not Executing
**Symptoms:**
- Orders submitted but not executed
- Orders stuck in "pending" state
- Exchange API errors
**Common Causes:**
1. Exchange API issue
2. Insufficient capital
3. Risk management blocking
4. Exchange maintenance
5. Invalid order parameters
**Resolution Steps:**
1. **Check exchange API:**
```bash
# Test exchange API connectivity
# Verify API credentials
# Check exchange status page
```
2. **Check capital:**
```bash
# Verify sufficient capital available
# Check capital ledger
```
3. **Check risk engine:**
```bash
# Check if risk management is blocking
# Look for risk rule failures
```
4. **Check order parameters:**
```bash
# Verify order parameters are valid
# Check symbol, quantity, price
```
5. **Check backend logs:**
```bash
# Look for order execution errors
# Check for exchange API errors
```
**Prevention:**
- Add order execution monitoring
- Add exchange API health checks
- Implement order execution alerts
---
### Reconciliation Failures
**Symptoms:**
- Reconciliation service failing
- Data inconsistencies
- Reconciliation errors in logs
**Common Causes:**
1. Exchange API issue
2. Database connection issue
3. Data format mismatch
4. Missing data
5. Logic error
**Resolution Steps:**
1. **Check reconciliation logs:**
```bash
# Look for reconciliation errors
# Check for specific failure reasons
```
2. **Run reconciliation manually:**
```bash
cd backend
npm run reconcile:lifecycle-history
# Run specific reconciliation script
```
3. **Check exchange data:**
```bash
# Verify exchange API is returning correct data
# Check for data format changes
```
4. **Check database:**
```bash
# Verify database is accessible
# Check for data corruption
```
**Prevention:**
- Add reconciliation monitoring
- Add reconciliation alerts
- Implement reconciliation retry logic
---
## Web Issues
### Web Won't Load
**Symptoms:**
- Blank screen
- Loading spinner indefinitely
- Browser console errors
**Common Causes:**
1. Backend API unreachable
2. Authentication failure
3. Build error
4. JavaScript error
5. Network issue
**Resolution Steps:**
1. **Check browser console:**
```javascript
// Open browser devtools
// Check for JavaScript errors
// Check for network errors
```
2. **Check backend API:**
```bash
curl http://localhost:4018/health/live
# Verify backend is running
```
3. **Check authentication:**
```bash
# Verify auth token is valid
# Check platform-service is reachable
```
4. **Check build:**
```bash
cd web
pnpm build
# Verify build succeeds
```
5. **Check network:**
```bash
# Check network connectivity
# Verify CORS settings
```
**Prevention:**
- Add error boundaries
- Add loading states
- Implement graceful degradation
---
### Authentication Failing
**Symptoms:**
- Login fails
- Session expires immediately
- Auth token invalid
- 401/403 errors
**Common Causes:**
1. Platform-service unreachable
2. Invalid credentials
3. Token expired
4. Platform-service down
5. Configuration error
**Resolution Steps:**
1. **Check platform-service:**
```bash
curl http://localhost:4003/health
# Verify platform-service is running
```
2. **Check credentials:**
```bash
# Verify username/password are correct
# Check platform-service user exists
```
3. **Check token:**
```bash
# Verify JWT token is valid
# Check token expiration
```
4. **Check configuration:**
```bash
cat web/.env.local
# Verify VITE_PLATFORM_URL is correct
# Verify auth configuration
```
5. **Check logs:**
```bash
# Check platform-service logs
# Check backend auth logs
```
**Prevention:**
- Add auth error handling
- Implement token refresh
- Add auth monitoring
---
### WebSocket Connection Failing
**Symptoms:**
- WebSocket won't connect
- Connection drops repeatedly
- No real-time updates
- WebSocket errors in console
**Common Causes:**
1. Backend unreachable
2. Auth token invalid
3. WebSocket blocked by firewall
4. Namespace mismatch
5. Connection limit exceeded
**Resolution Steps:**
1. **Check backend WebSocket:**
```bash
# Verify backend is running
# Check WebSocket port is accessible
```
2. **Check auth token:**
```bash
# Verify JWT token is valid
# Check token is included in connection
```
3. **Check firewall:**
```bash
# Verify WebSocket port is not blocked
# Check network settings
```
4. **Check namespace:**
```bash
# Verify correct namespace (/trading)
# Check namespace exists in backend
```
5. **Check logs:**
```bash
# Check backend WebSocket logs
# Look for connection errors
```
**Prevention:**
- Add WebSocket error handling
- Implement reconnection logic
- Add WebSocket monitoring
---
### Data Not Updating
**Symptoms:**
- Data stale
- No real-time updates
- Manual refresh required
- WebSocket connected but no updates
**Common Causes:**
1. WebSocket not receiving events
2. Backend not broadcasting
3. Event subscription issue
4. Data cache issue
5. Filter/mask issue
**Resolution Steps:**
1. **Check WebSocket connection:**
```javascript
// Check WebSocket is connected
// Check for connection errors
```
2. **Check backend broadcasting:**
```bash
# Check backend logs for broadcast events
# Verify events are being sent
```
3. **Check event subscription:**
```javascript
// Verify correct events are subscribed
// Check for subscription errors
```
4. **Check data cache:**
```javascript
// Clear cache if needed
// Verify cache is not stale
```
5. **Check filters:**
```javascript
// Verify data filters are correct
# Check for masking issues
```
**Prevention:**
- Add data staleness monitoring
- Implement cache invalidation
- Add data update alerts
---
## Mobile Issues
### Mobile App Won't Launch
**Symptoms:**
- App crashes on launch
- Blank screen
- Loading spinner indefinitely
- Expo Go won't load
**Common Causes:**
1. Build error
2. Configuration error
3. Platform-service unreachable
4. Backend unreachable
5. Expo Go version mismatch
**Resolution Steps:**
1. **Check Expo logs:**
```bash
# Check Expo dev server logs
# Look for build errors
```
2. **Check configuration:**
```bash
cat mobile/.env.local
# Verify EXPO_PUBLIC_* variables are set
```
3. **Check platform-service:**
```bash
curl http://localhost:4003/health
# Verify platform-service is running
```
4. **Check backend:**
```bash
curl http://localhost:4018/health/live
# Verify backend is running
```
5. **Check Expo Go:**
```bash
# Verify Expo Go is latest version
# Try clearing Expo Go cache
```
**Prevention:**
- Add error boundaries
- Add launch error handling
- Implement graceful degradation
---
### Mobile Authentication Failing
**Symptoms:**
- Login fails
- Session won't restore
- Auth token invalid
- Secure storage issue
**Common Causes:**
1. Platform-service unreachable
2. Invalid credentials
3. Token expired
4. Secure storage error
5. Platform-service down
**Resolution Steps:**
1. **Check platform-service:**
```bash
curl http://localhost:4003/health
# Verify platform-service is running
```
2. **Check credentials:**
```bash
# Verify credentials are correct
# Check platform-service user exists
```
3. **Check token:**
```bash
# Verify JWT token is valid
# Check token expiration
```
4. **Check secure storage:**
```bash
# Verify secure storage is working
# Check for storage errors
```
5. **Check logs:**
```bash
# Check mobile logs
# Look for auth errors
```
**Prevention:**
- Add auth error handling
- Implement token refresh
- Add secure storage error handling
---
### Mobile WebSocket Connection Failing
**Symptoms:**
- WebSocket won't connect
- Connection drops repeatedly
- No real-time updates
- WebSocket errors in logs
**Common Causes:**
1. Backend unreachable
2. Auth token invalid
3. WebSocket blocked by network
4. Namespace mismatch
5. Connection limit exceeded
**Resolution Steps:**
1. **Check backend WebSocket:**
```bash
# Verify backend is running
# Check WebSocket port is accessible
```
2. **Check auth token:**
```bash
# Verify JWT token is valid
# Check token is included in connection
```
3. **Check network:**
```bash
# Verify network allows WebSocket
# Check mobile network settings
```
4. **Check namespace:**
```bash
# Verify correct namespace (/trading)
# Check namespace exists in backend
```
5. **Check logs:**
```bash
# Check mobile logs
# Look for connection errors
```
**Prevention:**
- Add WebSocket error handling
- Implement reconnection logic
- Add polling fallback
---
### Mobile Data Not Updating
**Symptoms:**
- Data stale
- No real-time updates
- Manual refresh required
- WebSocket connected but no updates
**Common Causes:**
1. WebSocket not receiving events
2. Backend not broadcasting
3. Event subscription issue
4. Data cache issue
5. Polling fallback not working
**Resolution Steps:**
1. **Check WebSocket connection:**
```javascript
// Check WebSocket is connected
// Check for connection errors
```
2. **Check backend broadcasting:**
```bash
# Check backend logs for broadcast events
# Verify events are being sent
```
3. **Check polling fallback:**
```javascript
// Verify polling is working
// Check polling interval
```
4. **Check data cache:**
```javascript
// Clear cache if needed
# Verify cache is not stale
```
5. **Check logs:**
```bash
# Check mobile logs
# Look for update errors
```
**Prevention:**
- Add data staleness monitoring
- Implement polling fallback
- Add data update alerts
---
## Database Issues
### Cosmos DB Connection Failing
**Symptoms:**
- Backend can't connect to Cosmos
- Connection timeout
- Authentication error
- Database not found
**Common Causes:**
1. Invalid credentials
2. Network issue
3. Cosmos DB down
4. Container not found
5. Firewall blocking
**Resolution Steps:**
1. **Check credentials:**
```bash
cat backend/.env
# Verify COSMOS_ENDPOINT and COSMOS_KEY are correct
```
2. **Test connection:**
```bash
curl -I $COSMOS_ENDPOINT
# Verify Cosmos DB is accessible
```
3. **Check database:**
```bash
# Verify database exists
# Check container exists
```
4. **Check network:**
```bash
# Verify network allows connection
# Check firewall settings
```
5. **Check logs:**
```bash
# Check backend logs for Cosmos errors
# Look for connection errors
```
**Prevention:**
- Add connection retry logic
- Implement connection pooling
- Add connection monitoring
---
### Cosmos DB Slow Performance
**Symptoms:**
- Slow query response times
- Timeout errors
- Performance degradation
**Common Causes:**
1. Large result sets
2. Missing indexes
3. High RU consumption
4. Network latency
5. Database throttling
**Resolution Steps:**
1. **Check query performance:**
```bash
# Check query execution time
# Look for slow queries
```
2. **Check indexes:**
```bash
# Verify indexes exist
# Check index usage
```
3. **Check RU consumption:**
```bash
# Monitor RU consumption
# Check for throttling
```
4. **Optimize queries:**
```bash
# Add query filters
# Use pagination
# Reduce result set size
```
**Prevention:**
- Add query performance monitoring
- Implement query optimization
- Add performance alerts
---
### Data Inconsistency
**Symptoms:**
- Data mismatch between surfaces
- Stale data
- Missing data
- Duplicate data
**Common Causes:**
1. Reconciliation failure
2. Cache issue
3. Race condition
4. Data corruption
5. Sync issue
**Resolution Steps:**
1. **Run reconciliation:**
```bash
cd backend
npm run reconcile:lifecycle-history
# Run reconciliation script
```
2. **Check cache:**
```bash
# Clear cache if needed
# Verify cache is not stale
```
3. **Check sync:**
```bash
# Verify sync is working
# Check for sync errors
```
4. **Check data integrity:**
```bash
# Verify data integrity
# Check for corruption
```
**Prevention:**
- Add data consistency monitoring
- Implement reconciliation
- Add data integrity checks
---
## Authentication Issues
### Platform-Service Unreachable
**Symptoms:**
- Can't authenticate
- Token validation fails
- Platform-service health check fails
**Common Causes:**
1. Platform-service down
2. Network issue
3. Wrong URL
4. Firewall blocking
5. DNS issue
**Resolution Steps:**
1. **Check platform-service:**
```bash
curl http://localhost:4003/health
# Verify platform-service is running
```
2. **Check configuration:**
```bash
cat backend/.env
# Verify PLATFORM_API_URL is correct
cat web/.env.local
# Verify VITE_PLATFORM_URL is correct
```
3. **Check network:**
```bash
# Verify network allows connection
# Check firewall settings
```
4. **Check DNS:**
```bash
# Verify DNS resolution
# Check host file
```
**Prevention:**
- Add platform-service monitoring
- Implement fallback auth
- Add auth monitoring
---
### JWT Token Invalid
**Symptoms:**
- 401 errors
- Token validation fails
- Session expires immediately
**Common Causes:**
1. Token expired
2. Token malformed
3. Wrong signing key
4. Token revoked
5. Clock skew
**Resolution Steps:**
1. **Check token expiration:**
```bash
# Decode JWT token
# Check expiration time
```
2. **Check token format:**
```bash
# Verify JWT format is correct
# Check for token errors
```
3. **Check signing key:**
```bash
# Verify signing key is correct
# Check key rotation
```
4. **Check clock skew:**
```bash
# Verify system time is correct
# Check NTP sync
```
**Prevention:**
- Implement token refresh
- Add token validation
- Add token monitoring
---
## WebSocket Issues
### WebSocket Connection Drops
**Symptoms:**
- Connection drops repeatedly
- Reconnection fails
- Connection timeout
**Common Causes:**
1. Network instability
2. Backend restart
3. Auth token expired
4. Connection limit
5. Firewall timeout
**Resolution Steps:**
1. **Check network:**
```bash
# Verify network stability
# Check for packet loss
```
2. **Check backend:**
```bash
# Verify backend is not restarting
# Check backend logs
```
3. **Check auth token:**
```bash
# Verify JWT token is valid
# Check token expiration
```
4. **Check connection limit:**
```bash
# Verify connection limit not exceeded
# Check for connection leaks
```
**Prevention:**
- Implement reconnection logic
- Add connection monitoring
- Implement heartbeat
---
### WebSocket Not Receiving Events
**Symptoms:**
- WebSocket connected but no events
- Events not arriving
- Event subscription issue
**Common Causes:**
1. Not subscribed to events
2. Namespace mismatch
3. Backend not broadcasting
4. Event filtering
5. Room mismatch
**Resolution Steps:**
1. **Check subscription:**
```javascript
// Verify correct events are subscribed
// Check for subscription errors
```
2. **Check namespace:**
```bash
# Verify correct namespace
# Check namespace exists
```
3. **Check backend:**
```bash
# Verify backend is broadcasting
# Check backend logs
```
4. **Check filtering:**
```javascript
// Verify event filters are correct
# Check for filtering errors
```
**Prevention:**
- Add event monitoring
- Implement event acknowledgment
- Add event logging
---
## Performance Issues
### Backend Slow Response Times
**Symptoms:**
- API responses slow
- High latency
- Timeout errors
**Common Causes:**
1. Database query slow
2. External API slow
3. CPU bottleneck
4. Memory bottleneck
5. Network latency
**Resolution Steps:**
1. **Check database queries:**
```bash
# Check query performance
# Look for slow queries
```
2. **Check external APIs:**
```bash
# Check exchange API latency
# Verify API performance
```
3. **Check system resources:**
```bash
# Check CPU usage
# Check memory usage
```
4. **Check network:**
```bash
# Check network latency
# Verify network bandwidth
```
**Prevention:**
- Add performance monitoring
- Implement caching
- Add performance alerts
---
### Web Slow Load Times
**Symptoms:**
- Slow page load
- Large bundle size
- Slow initial render
**Common Causes:**
1. Large bundle size
2. Too many requests
3. Slow API responses
4. Unoptimized assets
5. No caching
**Resolution Steps:**
1. **Check bundle size:**
```bash
cd web
pnpm build
# Check bundle size
```
2. **Check network requests:**
```javascript
// Check network tab
// Look for large requests
```
3. **Check API responses:**
```bash
# Check API response times
# Look for slow endpoints
```
4. **Optimize assets:**
```bash
# Optimize images
# Minify assets
# Implement caching
```
**Prevention:**
- Add performance monitoring
- Implement code splitting
- Add caching
---
## Deployment Issues
### Docker Build Failing
**Symptoms:**
- Docker build fails
- Build timeout
- Dependency install fails
**Common Causes:**
1. Invalid Dockerfile
2. Missing dependencies
3. Network issue
4. Insufficient resources
5. Build context too large
**Resolution Steps:**
1. **Check Dockerfile:**
```bash
# Verify Dockerfile syntax
# Check for errors
```
2. **Check dependencies:**
```bash
# Verify package.json is valid
# Check for missing dependencies
```
3. **Check network:**
```bash
# Verify network connectivity
# Check registry access
```
4. **Check resources:**
```bash
# Verify sufficient disk space
# Check memory availability
```
**Prevention:**
- Add Docker build validation
- Implement build caching
- Add build monitoring
---
### Docker Container Won't Start
**Symptoms:**
- Container won't start
- Container exits immediately
- Container crashes
**Common Causes:**
1. Invalid configuration
2. Missing environment variables
3. Port conflict
4. Dependency unavailable
5. Health check failing
**Resolution Steps:**
1. **Check container logs:**
```bash
docker logs <container-id>
# Look for startup errors
```
2. **Check configuration:**
```bash
# Verify environment variables
# Check configuration files
```
3. **Check ports:**
```bash
# Verify port availability
# Check for conflicts
```
4. **Check dependencies:**
```bash
# Verify dependencies are available
# Check service health
```
**Prevention:**
- Add container health checks
- Implement dependency checks
- Add startup validation
---
## Escalation Procedures
### When to Escalate
Escalate to on-call if:
- Production outage
- Data loss or corruption
- Security breach
- Critical bug affecting multiple users
- Issue cannot be resolved within 30 minutes
### Escalation Steps
1. **Document the issue:**
- Describe the problem
- Include error messages
- Include logs
- Include request IDs
2. **Attempt resolution:**
- Follow troubleshooting steps
- Document attempted solutions
- Note what worked/didn't work
3. **Escalate:**
- Contact on-call engineer
- Provide documentation
- Provide context
- Provide urgency
4. **Follow up:**
- Monitor resolution
- Document resolution
- Update runbooks
- Share learnings
### On-Call Contact
- **Primary:** [On-call engineer]
- **Secondary:** [Backup engineer]
- **Escalation:** [Engineering manager]
### Incident Response
1. **Detect:** Monitoring alert or user report
2. **Acknowledge:** Acknowledge alert within 5 minutes
3. **Investigate:** Gather information, check logs
4. **Mitigate:** Implement temporary fix if needed
5. **Resolve:** Implement permanent fix
6. **Post-mortem:** Document incident, learnings, improvements
---
## Common Error Messages
### Backend Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `ECONNREFUSED` | Port not available | Check port, kill process |
| `ETIMEDOUT` | Connection timeout | Check network, increase timeout |
| `Unauthorized` | Invalid auth | Check credentials, token |
| `Forbidden` | Insufficient permissions | Check user role, permissions |
| `InternalServerError` | Server error | Check logs, fix bug |
### Web Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `Network Error` | Backend unreachable | Check backend, network |
| `401 Unauthorized` | Invalid token | Refresh token, re-auth |
| `403 Forbidden` | Insufficient permissions | Check user role |
| `404 Not Found` | Resource not found | Check URL, resource exists |
| `500 Internal Server Error` | Server error | Check logs, fix bug |
### Mobile Errors
| Error | Cause | Resolution |
|-------|-------|------------|
| `Network request failed` | Network issue | Check network, backend |
| `Auth failed` | Invalid credentials | Check credentials, re-auth |
| `WebSocket error` | Connection issue | Check backend, network |
| `Storage error` | Secure storage issue | Check storage permissions |
| `App crashed` | Runtime error | Check logs, fix bug |
---
## Monitoring and Alerts
### Key Metrics to Monitor
**Backend:**
- CPU usage
- Memory usage
- API response times
- Error rate
- Database query times
- WebSocket connection count
**Web:**
- Page load time
- Bundle size
- API response times
- Error rate
- WebSocket connection count
**Mobile:**
- App launch time
- API response times
- Error rate
- WebSocket connection count
- Crash rate
### Alert Thresholds
| Metric | Warning | Critical |
|--------|---------|----------|
| CPU usage | 70% | 90% |
| Memory usage | 70% | 90% |
| API response time | 1s | 5s |
| Error rate | 5% | 10% |
| Database query time | 500ms | 2s |
### Monitoring Tools
- Backend: Winston logging, Prometheus metrics
- Web: Browser devtools, Lighthouse
- Mobile: Expo logs, Sentry
---
## References
- **README.md** - Project overview
- **ARCHITECTURE_DOCUMENTATION.md** - System architecture
- **API_DOCUMENTATION_GUIDE.md** - API documentation
- **FUNCTIONALITY_REVIEW.md** - Functional gaps and issues
- **CUTOVER_WEB.md** - Web cutover procedures
- **CUTOVER_MOBILE.md** - Mobile cutover procedures