Error Handling
kube-ingress-dash implements comprehensive error handling to provide a reliable and resilient user experience.
Architecture Overview
The error handling system consists of multiple layers:
- Error Classification - Categorizes errors as transient or permanent
- Retry Logic - Automatically retries transient errors with exponential backoff (implemented)
- Circuit Breaker - Prevents cascading failures during outages (implemented)
- Error Boundaries - Catches and displays UI errors gracefully
- User Feedback - Provides clear, actionable error messages
The core error handling infrastructure (classification, retry logic, and circuit breaker) has been implemented. Integration with all API routes is ongoing.
For detailed architecture diagrams including error handling flow, circuit breaker state machine, and retry logic, see the Production Features Architecture documentation.
Error Classification
Errors are automatically classified into two categories:
Transient Errors
Temporary failures that may succeed on retry:
- Network timeouts (
ETIMEDOUT,ECONNRESET) - Connection refused (
ECONNREFUSED) - DNS resolution failures (
ENOTFOUND) - HTTP 503 Service Unavailable
- HTTP 429 Too Many Requests
- Temporary Kubernetes API unavailability
Handling: Automatically retried with exponential backoff
Permanent Errors
Failures that won't succeed on retry:
- Authentication failures (401 Unauthorized)
- Permission denied (403 Forbidden)
- Resource not found (404 Not Found)
- Invalid requests (400 Bad Request)
- RBAC permission errors
Handling: Returned immediately to user with helpful message
Retry Logic
Transient errors are automatically retried using exponential backoff:
// Retry configuration
{
maxAttempts: 3,
initialDelay: 1000, // 1 second
maxDelay: 30000, // 30 seconds
backoffMultiplier: 2 // Double delay each retry
}
Retry Sequence
- First attempt - Immediate
- First retry - After 1 second
- Second retry - After 2 seconds
- Third retry - After 4 seconds
- Give up - Return error to user
Example
Attempt 1: Failed (ETIMEDOUT)
Wait 1s...
Attempt 2: Failed (ETIMEDOUT)
Wait 2s...
Attempt 3: Success ✓
Circuit Breaker
Protects the application and Kubernetes API from cascading failures.
States
Closed (Normal Operation)
- All requests pass through
- Failures are tracked
- Transitions to Open if failure threshold exceeded
Open (Failing Fast)
- Requests fail immediately without calling Kubernetes API
- Prevents overload during outages
- Transitions to Half-Open after timeout
Half-Open (Testing Recovery)
- Limited requests allowed through
- Tests if service has recovered
- Transitions to Closed if successful, back to Open if failed
Configuration
{
failureThreshold: 0.5, // Open at 50% failure rate
windowSize: 30000, // 30-second window
openTimeout: 60000 // Wait 60s before testing recovery
}
Behavior Example
Time 0s: Circuit Closed - Normal operation
Time 10s: 15 failures out of 30 requests (50%)
Time 10s: Circuit Opens - Fail fast mode
Time 70s: Circuit Half-Open - Testing recovery
Time 71s: Test request succeeds
Time 71s: Circuit Closed - Back to normal
Error Boundaries
React Error Boundaries catch and handle UI errors gracefully.
Dashboard Error Boundary
Catches errors in the entire dashboard:
<DashboardErrorBoundary>
<Dashboard />
</DashboardErrorBoundary>
Fallback: Full-page error screen with retry option
Component Error Boundaries
Catches errors in specific components:
- IngressListErrorBoundary - Ingress list rendering errors
- FiltersErrorBoundary - Filter component errors
Fallback: Component-level error message, rest of UI continues working
Benefits
- Prevents entire app from crashing
- Isolates errors to affected components
- Provides recovery options
- Maintains user context
User-Facing Error Messages
Clear, actionable error messages help users resolve issues:
Permission Errors
Permission Error
You don't have sufficient permissions to access Kubernetes resources.
Check your RBAC configuration.
[View RBAC Setup Documentation] [Retry]
API Errors
API Error
There was an issue connecting to the Kubernetes API.
Please check your cluster configuration.
[Retry]
Generic Errors
Something went wrong
An unexpected error occurred. Please try again.
[Retry]
Error Logging
Errors are logged with context for debugging:
{
"level": "error",
"message": "Kubernetes API request failed",
"error": {
"code": "ETIMEDOUT",
"message": "Request timeout",
"statusCode": null
},
"context": {
"operation": "listIngresses",
"namespace": "default",
"attempt": 2,
"timestamp": "2024-01-15T10:30:00.000Z"
}
}
Monitoring and Alerts
Metrics to Monitor
- Error rate - Percentage of failed requests
- Circuit breaker state - Open/Closed/Half-Open
- Retry attempts - Number of retries per request
- Error types - Distribution of error categories
Recommended Alerts
- High error rate - Alert if error rate > 10% for 5 minutes
- Circuit breaker open - Alert when circuit opens
- Repeated failures - Alert if same error occurs > 10 times
- Permission errors - Alert on RBAC errors (may indicate misconfiguration)
Troubleshooting
High Error Rate
Symptoms: Many requests failing
Possible Causes:
- Kubernetes API unavailable or slow
- Network connectivity issues
- RBAC permissions incorrect
- Resource limits too low
Solutions:
- Check Kubernetes API health
- Verify network connectivity
- Review RBAC configuration
- Increase resource limits
Circuit Breaker Frequently Opening
Symptoms: Circuit breaker opens repeatedly
Possible Causes:
- Kubernetes API instability
- Application resource constraints
- Network issues
Solutions:
- Investigate Kubernetes API health
- Increase application resources
- Adjust circuit breaker thresholds
- Check network latency
Permission Errors
Symptoms: 403 Forbidden errors
Possible Causes:
- Missing RBAC permissions
- Incorrect service account
- Namespace restrictions
Solutions:
- Verify ClusterRole includes all required permissions
- Check ClusterRoleBinding references correct ServiceAccount
- Ensure ServiceAccount is mounted in pod
- Review RBAC Setup Guide
Best Practices
- Monitor error rates - Set up alerts for unusual error patterns
- Review logs regularly - Look for recurring errors
- Test error scenarios - Verify error handling works as expected
- Update RBAC - Ensure permissions match application needs
- Tune circuit breaker - Adjust thresholds based on your environment
Configuration
Error handling configuration is currently hardcoded in the implementation. Environment variable configuration is planned for a future release.
Current default configuration:
- Retry attempts: 3
- Initial delay: 1000ms
- Max delay: 30000ms
- Circuit breaker failure threshold: 50%
- Circuit breaker window: 30 seconds
- Circuit breaker timeout: 60 seconds