LLM Incident Manager is an enterprise-grade, production-ready incident management system built in Rust, designed specifically for LLM DevOps ecosystems. It provides intelligent incident detection, classification, enrichment, correlation, routing, escalation, and automated resolution capabilities for modern LLM infrastructure.
Available on:
- π¦ crates.io - Rust library and binaries
- π¦ npm - Server with npm CLI tooling
- π npm types - TypeScript type definitions
- π npm client - JavaScript/TypeScript client SDK
- π High Performance: Built in Rust with async/await for maximum throughput and minimal latency
- π€ ML-Powered Classification: Machine learning-based incident classification with confidence scoring
- π Context Enrichment: Automatic enrichment with historical data, service info, and team context
- π Intelligent Correlation: Groups related incidents to reduce alert fatigue
- β‘ Smart Escalation: Policy-based escalation with multi-level notification chains
- π Persistent Storage: PostgreSQL and in-memory storage implementations
- π― Smart Routing: Policy-based routing with team and severity-based rules
- π Multi-Channel Notifications: Email, Slack, PagerDuty, webhooks
- π€ Automated Playbooks: Execute automated remediation workflows
- π Complete Audit Trail: Full incident lifecycle tracking
- Multi-level escalation policies
- Time-based automatic escalation
- Configurable notification channels per level
- Target types: Users, Teams, On-Call schedules
- Pause/resume/resolve escalation flows
- Real-time escalation state tracking
- Documentation: ESCALATION_GUIDE.md
- PostgreSQL backend with connection pooling
- In-memory storage for testing/development
- Trait-based abstraction for extensibility
- Transaction support for data consistency
- Full incident lifecycle persistence
- Query optimizations and indexing
- Documentation: STORAGE_IMPLEMENTATION.md
- Time-window based correlation
- Multi-strategy correlation: Source, Type, Similarity, Tag, Service
- Dynamic correlation groups
- Configurable thresholds and windows
- Pattern detection across incidents
- Graph-based relationship tracking
- Documentation: CORRELATION_GUIDE.md
- Automated severity classification
- Multi-model ensemble architecture
- Feature extraction from incidents
- Confidence scoring
- Incremental learning with feedback
- Model versioning and persistence
- Real-time classification API
- Documentation: ML_CLASSIFICATION_GUIDE.md
- Historical incident analysis with similarity matching
- Service catalog integration (CMDB)
- Team and on-call information
- External API integrations (Prometheus, Elasticsearch)
- Parallel enrichment pipeline
- Intelligent caching with TTL
- Configurable enrichers and priorities
- Documentation: ENRICHMENT_GUIDE.md
- Fingerprint-based duplicate detection
- Time-window deduplication
- Automatic incident merging
- Alert correlation
- Multi-channel delivery (Email, Slack, PagerDuty)
- Template-based formatting
- Rate limiting and throttling
- Delivery confirmation
- Trigger-based playbook execution
- Step-by-step action execution
- Auto-execution on incident creation
- Manual playbook execution
- Rule-based incident routing
- Team assignment suggestions
- Severity-based routing
- Service-aware routing
- Sentinel Client: Monitoring & anomaly detection with ML-powered analysis
- Shield Client: Security threat analysis and mitigation planning
- Edge-Agent Client: Distributed edge inference with offline queue management
- Governance Client: Multi-framework compliance (GDPR, HIPAA, SOC2, PCI, ISO27001)
- Enterprise features: Exponential backoff retry, circuit breaker, rate limiting
- Comprehensive error handling and observability
- Full-featured GraphQL API alongside REST
- Real-time WebSocket subscriptions for incident updates
- Type-safe schema with queries, mutations, and subscriptions
- DataLoaders for efficient batch loading and N+1 prevention
- GraphQL Playground for interactive API exploration
- Support for filtering, pagination, and complex queries
- Documentation: GRAPHQL_GUIDE.md, WEBSOCKET_STREAMING_GUIDE.md
- Prometheus Integration: Native Prometheus metrics export on port 9090
- Real-time Performance Tracking: Request rates, latency, success/error rates
- Integration Metrics: Per-integration monitoring (Sentinel, Shield, Edge-Agent, Governance)
- System Metrics: Processing pipeline, correlation, enrichment, ML classification
- Zero-Overhead Collection: Lock-free atomic operations with <1ΞΌs recording time
- Grafana Dashboards: Pre-built dashboards for system overview and deep-dive analysis
- Alert Rules: Production-ready alerting for critical conditions
- Documentation: METRICS_GUIDE.md | Implementation | Runbook
- Resilience Pattern: Prevent cascading failures with automatic circuit breaking
- State Management: Closed, Open, and Half-Open states with intelligent transitions
- Per-Service Configuration: Individual circuit breakers for each external dependency
- Fast Failure: Millisecond response time when circuit is open (vs. 30s+ timeouts)
- Automatic Recovery: Self-healing with configurable recovery strategies
- Fallback Support: Graceful degradation with fallback mechanisms
- Comprehensive Metrics: Real-time state tracking and Prometheus integration
- Manual Control: API endpoints for operational override and testing
- Documentation: CIRCUIT_BREAKER_GUIDE.md | API Reference | Integration Guide | Operations
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β LLM Incident Manager β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ€
β β
β ββββββββββββββββ ββββββββββββββββ ββββββββββββββββββββββββββββ β
β β REST API β β gRPC API β β GraphQL API β β
β β (HTTP/JSON) β β (Protobuf) β β (Queries/Mutations/Subs) β β
β ββββββββ¬ββββββββ ββββββββ¬ββββββββ ββββββββ¬ββββββββββββββββββββ β
β β β β β
β ββββββββββββββββββββΌβββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββ β
β β IncidentProcessor β β
β β - Deduplication β β
β β - Classification β β
β β - Enrichment β β
β β - Correlation β β
β βββββββββββ¬ββββββββββββ β
β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ βΌ βΌ β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β Escalation β β Notification β β Playbook β β
β β Engine β β Service β β Service β β
β βββββββββββββββ βββββββββββββββ βββββββββββββββ β
β β β β β
β βββββββββββββββββββΌββββββββββββββββββ β
β βΌ β
β βββββββββββββββββββββββ β
β β Storage Layer β β
β β - PostgreSQL β β
β β - In-Memory β β
β βββββββββββββββββββββββ β
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
Alert β Deduplication β ML Classification β Context Enrichment
β
Correlation
β
Routing β β β β β β β β β β β
β
ββββββββββββββββββββΌβββββββββββββββββββ
βΌ βΌ βΌ
Notifications Escalation Playbooks
For Rust/Cargo:
- Rust 1.75+ (2021 edition)
- PostgreSQL 14+ (optional, for persistent storage)
- Redis (optional, for distributed caching)
For npm:
- Node.js 16.0+
- npm 7.0+
# Install from crates.io
cargo install llm-incident-manager
# Or add as dependency in Cargo.toml
[dependencies]
llm-incident-manager = "1.0.1"# Install the server globally
npm install -g @llm-dev-ops/llm-incident-manager
# Build the Rust binaries
npm run build
# Start the server
npm start
# Or run directly
llm-incident-manager# Install the WebSocket/GraphQL client
npm install @llm-dev-ops/incident-manager-client
# Install type definitions (TypeScript)
npm install @llm-dev-ops/incident-manager-types# Clone repository
git clone https://github.com/globalbusinessadvisors/llm-incident-manager.git
cd llm-incident-manager
# Build with Cargo
cargo build --release
# Or build with npm
npm install
npm run build
# Run tests
cargo test --all-features
# Run with default configuration (in-memory storage)
cargo run --release# From cargo installation
llm-incident-manager
# From npm installation
npm start
# Or with environment variables
DATABASE_URL=postgresql://localhost/incident_manager \
API_PORT=8080 \
GRPC_PORT=50051 \
llm-incident-managerimport { IncidentManagerClient } from '@llm-dev-ops/incident-manager-client';
const client = new IncidentManagerClient({
wsUrl: 'ws://localhost:8080/graphql/ws',
authToken: 'your-jwt-token'
});
// Subscribe to critical incidents (P0 and P1)
client.subscribeToCriticalIncidents((incident) => {
console.log('π¨ Critical incident:', incident.title);
console.log(' Severity:', incident.severity);
// Trigger alerts, send to PagerDuty, etc.
if (incident.severity === 'P0') {
sendPagerDutyAlert(incident);
}
});
// Subscribe to all incident updates
client.subscribeToIncidentUpdates(['P0', 'P1', 'P2'], (update) => {
console.log('π Incident update:', update.updateType);
updateDashboard(update);
});import type {
Incident,
Severity,
IncidentStatus,
CreateIncidentRequest,
EscalationPolicy
} from '@llm-dev-ops/incident-manager-types';
const incident: Incident = {
id: 'inc-123',
severity: 'P1',
status: 'NEW',
title: 'High Latency Detected',
// ... rest of incident fields
};use llm_incident_manager::{
Config,
models::{Alert, Incident, Severity, IncidentType},
processing::{IncidentProcessor, DeduplicationEngine},
state::InMemoryStore,
escalation::EscalationEngine,
enrichment::EnrichmentService,
correlation::CorrelationEngine,
ml::MLService,
};
use std::sync::Arc;
#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
// Initialize storage
let store = Arc::new(InMemoryStore::new());
// Create deduplication engine
let dedup_engine = Arc::new(DeduplicationEngine::new(store.clone(), 900));
// Create incident processor
let mut processor = IncidentProcessor::new(store.clone(), dedup_engine);
// Optional: Add escalation engine
let escalation_engine = Arc::new(EscalationEngine::new());
processor.set_escalation_engine(escalation_engine);
// Optional: Add ML classification
let ml_service = Arc::new(MLService::new(Default::default()));
ml_service.start().await?;
processor.set_ml_service(ml_service);
// Optional: Add context enrichment
let enrichment_config = Default::default();
let enrichment_service = Arc::new(
EnrichmentService::new(enrichment_config, store.clone())
);
enrichment_service.start().await?;
processor.set_enrichment_service(enrichment_service);
// Optional: Add correlation engine
let correlation_engine = Arc::new(
CorrelationEngine::new(store.clone(), Default::default())
);
processor.set_correlation_engine(correlation_engine);
// Process an alert
let alert = Alert::new(
"ext-123".to_string(),
"monitoring".to_string(),
"High CPU Usage".to_string(),
"CPU usage exceeded 90% threshold".to_string(),
Severity::P1,
IncidentType::Infrastructure,
);
let ack = processor.process_alert(alert).await?;
println!("Incident created: {:?}", ack.incident_id);
Ok(())
}# Database
DATABASE_URL=postgresql://user:password@localhost/incident_manager
DATABASE_POOL_SIZE=20
# Redis (optional)
REDIS_URL=redis://localhost:6379
# API Server
API_HOST=0.0.0.0
API_PORT=3000
# gRPC Server
GRPC_HOST=0.0.0.0
GRPC_PORT=50051
# Feature Flags
ENABLE_ML_CLASSIFICATION=true
ENABLE_ENRICHMENT=true
ENABLE_CORRELATION=true
ENABLE_ESCALATION=true
# Logging
RUST_LOG=info,llm_incident_manager=debuginstance_id: "standalone-001"
# Storage configuration
storage:
type: "postgresql" # or "memory"
connection_string: "postgresql://localhost/incident_manager"
pool_size: 20
# ML Configuration
ml:
enabled: true
confidence_threshold: 0.7
model_path: "./models"
auto_train: true
training_batch_size: 100
# Enrichment Configuration
enrichment:
enabled: true
enable_historical: true
enable_service: true
enable_team: true
timeout_secs: 10
cache_ttl_secs: 300
async_enrichment: true
max_concurrent: 5
similarity_threshold: 0.5
# Correlation Configuration
correlation:
enabled: true
time_window_secs: 300
min_incidents: 2
max_group_size: 50
enable_source: true
enable_type: true
enable_similarity: true
enable_tags: true
enable_service: true
# Escalation Configuration
escalation:
enabled: true
default_timeout_secs: 300
# Deduplication Configuration
deduplication:
window_secs: 900
fingerprint_enabled: true
# Notification Configuration
notifications:
channels:
- type: "email"
enabled: true
- type: "slack"
enabled: true
webhook_url: "https://hooks.slack.com/..."
- type: "pagerduty"
enabled: true
integration_key: "..."The LLM Incident Manager provides a GraphQL WebSocket API for real-time incident streaming. This allows clients to subscribe to incident events and receive immediate notifications.
Quick Start:
import { createClient } from 'graphql-ws';
const client = createClient({
url: 'ws://localhost:8080/graphql/ws',
connectionParams: {
Authorization: 'Bearer YOUR_JWT_TOKEN'
}
});
// Subscribe to critical incidents
client.subscribe(
{
query: `
subscription {
criticalIncidents {
id
title
severity
state
createdAt
}
}
`
},
{
next: (data) => {
console.log('Critical incident:', data.criticalIncidents);
},
error: (error) => console.error('Subscription error:', error),
complete: () => console.log('Subscription completed')
}
);Available Subscriptions:
criticalIncidents- Subscribe to P0 and P1 incidentsincidentUpdates- Subscribe to incident lifecycle eventsnewIncidents- Subscribe to newly created incidentsincidentStateChanges- Subscribe to state transitionsalerts- Subscribe to incoming alert submissions
Documentation:
- WebSocket Streaming Guide - Architecture and overview
- WebSocket API Reference - Complete API documentation
- WebSocket Client Guide - Integration examples
- WebSocket Deployment Guide - Production setup
- Example Clients - TypeScript, Python, Rust examples
# Create an incident
curl -X POST http://localhost:3000/api/v1/incidents \
-H "Content-Type: application/json" \
-d '{
"source": "monitoring",
"title": "High Memory Usage",
"description": "Memory usage exceeded 85% threshold",
"severity": "P2",
"incident_type": "Infrastructure"
}'
# Get incident
curl http://localhost:3000/api/v1/incidents/{incident_id}
# Acknowledge incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/acknowledge \
-H "Content-Type: application/json" \
-d '{"actor": "user@example.com"}'
# Resolve incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/resolve \
-H "Content-Type: application/json" \
-d '{
"resolved_by": "user@example.com",
"method": "Manual",
"notes": "Restarted service",
"root_cause": "Memory leak in application"
}'service IncidentService {
rpc CreateIncident(CreateIncidentRequest) returns (CreateIncidentResponse);
rpc GetIncident(GetIncidentRequest) returns (Incident);
rpc UpdateIncident(UpdateIncidentRequest) returns (Incident);
rpc StreamIncidents(StreamIncidentsRequest) returns (stream Incident);
rpc AnalyzeCorrelations(AnalyzeCorrelationsRequest) returns (CorrelationResult);
}The GraphQL API provides a flexible, type-safe interface with real-time subscriptions:
# Query incidents with advanced filtering
query GetIncidents {
incidents(
first: 20
filter: {
severity: [P0, P1]
status: [NEW, ACKNOWLEDGED]
environment: [PRODUCTION]
}
orderBy: { field: CREATED_AT, direction: DESC }
) {
edges {
node {
id
title
severity
status
assignedTo {
name
email
}
sla {
resolutionDeadline
resolutionBreached
}
}
}
pageInfo {
hasNextPage
endCursor
}
}
}
# Subscribe to real-time incident updates
subscription IncidentUpdates {
incidentUpdated(filter: { severity: [P0, P1] }) {
incident {
id
title
status
}
updateType
changedFields
}
}GraphQL Endpoints:
- Query/Mutation:
POST http://localhost:8080/graphql - Subscriptions:
WS ws://localhost:8080/graphql - Playground:
GET http://localhost:8080/graphql/playground
Documentation:
- GraphQL API Guide - Complete API documentation with authentication, pagination, and best practices
- GraphQL Schema Reference - Full schema documentation with all types, queries, mutations, and subscriptions
- GraphQL Integration Guide - Client integration examples for Apollo Client, Relay, urql, and plain fetch
- GraphQL Development Guide - Implementation guide for extending the API
- GraphQL Examples - Common query patterns and real-world use cases
Create escalation policies and automatically escalate incidents based on time and severity:
use llm_incident_manager::escalation::{
EscalationPolicy, EscalationLevel, EscalationTarget, TargetType,
};
// Define escalation policy
let policy = EscalationPolicy {
name: "Critical Production Incidents".to_string(),
levels: vec![
EscalationLevel {
level: 1,
name: "L1 On-Call".to_string(),
targets: vec![
EscalationTarget {
target_type: TargetType::OnCall,
identifier: "platform-team".to_string(),
}
],
escalate_after_secs: 300, // 5 minutes
channels: vec!["pagerduty".to_string(), "slack".to_string()],
},
EscalationLevel {
level: 2,
name: "Engineering Lead".to_string(),
targets: vec![
EscalationTarget {
target_type: TargetType::User,
identifier: "eng-lead@example.com".to_string(),
}
],
escalate_after_secs: 900, // 15 minutes
channels: vec!["pagerduty".to_string(), "sms".to_string()],
},
],
// ... conditions
};
escalation_engine.register_policy(policy);See ESCALATION_GUIDE.md for complete documentation.
Automatically enrich incidents with historical data, service information, and team context:
use llm_incident_manager::enrichment::{EnrichmentConfig, EnrichmentService};
let mut config = EnrichmentConfig::default();
config.enable_historical = true;
config.enable_service = true;
config.enable_team = true;
config.similarity_threshold = 0.5;
let service = EnrichmentService::new(config, store);
service.start().await?;
// Enrichment happens automatically in the processor
let context = service.enrich_incident(&incident).await?;
// Access enriched data
if let Some(historical) = context.historical {
println!("Found {} similar incidents", historical.similar_incidents.len());
}See ENRICHMENT_GUIDE.md for complete documentation.
Group related incidents to reduce alert fatigue:
use llm_incident_manager::correlation::{CorrelationEngine, CorrelationConfig};
let mut config = CorrelationConfig::default();
config.time_window_secs = 300; // 5 minutes
config.enable_similarity = true;
config.enable_source = true;
let engine = CorrelationEngine::new(store, config);
let result = engine.analyze_incident(&incident).await?;
if result.has_correlations() {
println!("Found {} related incidents", result.correlation_count());
}See CORRELATION_GUIDE.md for complete documentation.
Automatically classify incident severity using machine learning:
use llm_incident_manager::ml::{MLService, MLConfig};
let config = MLConfig::default();
let service = MLService::new(config);
service.start().await?;
// Classification happens automatically
let prediction = service.predict_severity(&incident).await?;
println!("Predicted severity: {:?} (confidence: {:.2})",
prediction.predicted_severity,
prediction.confidence
);
// Train with feedback
service.add_training_sample(&incident).await?;
service.trigger_training().await?;See ML_CLASSIFICATION_GUIDE.md for complete documentation.
Protect your system from cascading failures with automatic circuit breaking:
use llm_incident_manager::circuit_breaker::CircuitBreaker;
use std::time::Duration;
// Create circuit breaker for external service
let circuit_breaker = CircuitBreaker::new("sentinel-api")
.failure_threshold(5) // Open after 5 failures
.timeout(Duration::from_secs(60)) // Wait 60s before testing recovery
.success_threshold(2) // Close after 2 successful tests
.build();
// Execute request through circuit breaker
let result = circuit_breaker.call(|| async {
sentinel_client.fetch_alerts(Some(10)).await
}).await;
match result {
Ok(alerts) => {
println!("Fetched {} alerts", alerts.len());
}
Err(e) if e.is_circuit_open() => {
println!("Circuit breaker is open, using fallback");
// Use cached data or alternative service
let fallback_data = cache.get_alerts()?;
Ok(fallback_data)
}
Err(e) => {
println!("Request failed: {}", e);
Err(e)
}
}-
Three States:
- Closed: Normal operation, requests flow through
- Open: Service failing, requests fail immediately (< 1ms)
- Half-Open: Testing recovery with limited requests
-
Automatic Recovery:
- Configurable timeout before recovery testing
- Multiple recovery strategies (fixed, linear, exponential backoff)
- Gradual traffic restoration
-
Comprehensive Monitoring:
// Check circuit breaker state
let state = circuit_breaker.state().await;
println!("Circuit state: {:?}", state);
// Get detailed information
let info = circuit_breaker.info().await;
println!("Error rate: {:.2}%", info.error_rate * 100.0);
println!("Total requests: {}", info.total_requests);
println!("Failures: {}", info.failure_count);
// Health check
let health = circuit_breaker.health_check().await;- Manual Control (for operations):
# Force open (maintenance mode)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/open
# Force close (after maintenance)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/close
# Reset circuit breaker
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/reset
# Get status
curl http://localhost:8080/v1/circuit-breakers/sentinel- Configuration Example:
# config/circuit_breakers.yaml
circuit_breakers:
sentinel:
name: "sentinel-api"
failure_threshold: 5
success_threshold: 2
timeout_secs: 60
volume_threshold: 10
recovery_strategy:
type: "exponential_backoff"
initial_timeout_secs: 60
max_timeout_secs: 300
multiplier: 2.0- Prometheus Metrics:
circuit_breaker_state{name="sentinel"} 0 # 0=closed, 1=open, 2=half-open
circuit_breaker_requests_total{name="sentinel"}
circuit_breaker_requests_failed{name="sentinel"}
circuit_breaker_error_rate{name="sentinel"}
circuit_breaker_open_count{name="sentinel"}
See CIRCUIT_BREAKER_GUIDE.md for complete documentation.
# Unit tests
cargo test --lib
# Integration tests
cargo test --test '*'
# All tests with coverage
cargo tarpaulin --all-features --workspace --timeout 120- Unit Tests: 48 tests across all modules
- Integration Tests: 75+ tests covering end-to-end workflows
- Total Coverage: ~85%
| Operation | Latency (p95) | Throughput |
|---|---|---|
| Alert Processing | < 50ms | 10,000/sec |
| Incident Creation | < 100ms | 5,000/sec |
| ML Classification | < 30ms | 15,000/sec |
| Enrichment (cached) | < 5ms | 50,000/sec |
| Enrichment (uncached) | < 150ms | 3,000/sec |
| Correlation Analysis | < 80ms | 8,000/sec |
| Component | CPU | Memory | Notes |
|---|---|---|---|
| Core Processor | 2 cores | 512MB | Base requirements |
| ML Service | 2 cores | 1GB | With models loaded |
| Enrichment Service | 1 core | 256MB | With caching |
| PostgreSQL | 4 cores | 4GB | For production |
- Escalation Engine Guide - Complete escalation documentation
- Escalation Implementation - Technical details
- Storage Implementation - Storage layer details
- Correlation Guide - Correlation engine usage
- Correlation Implementation - Technical details
- ML Classification Guide - ML usage and training
- ML Implementation - Technical details
- Enrichment Guide - Context enrichment usage
- Enrichment Implementation - Technical details
- LLM Integrations Overview - Complete LLM integration guide
- LLM Architecture - Detailed architecture specs
- LLM Implementation Guide - Step-by-step implementation
- LLM Quick Reference - Fast lookup guide
- Metrics Guide - NEW: Complete metrics and observability documentation
- Metrics Implementation - NEW: Technical implementation details
- Metrics Operational Runbook - NEW: Operations and troubleshooting
- REST API:
cargo doc --open - gRPC API: See
proto/directory for Protocol Buffer definitions - GraphQL API: Comprehensive documentation suite
- GraphQL API Guide - Complete API overview
- GraphQL Schema Reference - Full schema documentation
- GraphQL Integration Guide - Client integration examples
- GraphQL Development Guide - Implementation guide
- GraphQL Examples - Query patterns and use cases
llm-incident-manager/
βββ src/
β βββ api/ # REST/gRPC/GraphQL APIs
β βββ config/ # Configuration management
β βββ correlation/ # Correlation engine
β βββ enrichment/ # Context enrichment
β β βββ enrichers.rs # Enricher implementations
β β βββ models.rs # Data structures
β β βββ pipeline.rs # Enrichment orchestration
β β βββ service.rs # Service management
β βββ error/ # Error types
β βββ escalation/ # Escalation engine
β βββ grpc/ # gRPC service implementations
β βββ integrations/ # LLM integrations (NEW)
β β βββ common/ # Shared utilities (client trait, retry, auth)
β β βββ sentinel/ # Sentinel monitoring client
β β βββ shield/ # Shield security client
β β βββ edge_agent/ # Edge-Agent distributed client
β β βββ governance/ # Governance compliance client
β βββ ml/ # ML classification
β β βββ classifier.rs # Classification logic
β β βββ features.rs # Feature extraction
β β βββ models.rs # Data structures
β β βββ service.rs # Service management
β βββ models/ # Core data models
β βββ notifications/ # Notification service
β βββ playbooks/ # Playbook automation
β βββ processing/ # Incident processor
β βββ state/ # Storage implementations
βββ tests/ # Integration tests
β βββ integration_sentinel_test.rs # Sentinel client tests
β βββ integration_shield_test.rs # Shield client tests
β βββ integration_edge_agent_test.rs # Edge-Agent client tests
β βββ integration_governance_test.rs # Governance client tests
βββ proto/ # Protocol buffer definitions
βββ migrations/ # Database migrations
βββ docs/ # Additional documentation
βββ LLM_CLIENT_README.md # LLM integrations overview
βββ LLM_CLIENT_ARCHITECTURE.md # Detailed architecture
βββ LLM_CLIENT_IMPLEMENTATION_GUIDE.md # Implementation guide
βββ LLM_CLIENT_QUICK_REFERENCE.md # Quick reference
βββ llm-client-types.ts # TypeScript type definitions
We welcome contributions! Please see CONTRIBUTING.md for guidelines.
# Format code
cargo fmt
# Lint
cargo clippy --all-features
# Check
cargo check --all-features# Development mode with hot reload
cargo watch -x run
# With debug logging
RUST_LOG=debug cargo run
# With specific features
cargo run --features "postgresql,redis"FROM rust:1.75 as builder
WORKDIR /app
COPY . .
RUN cargo build --release
FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/llm-incident-manager /usr/local/bin/
EXPOSE 8080 50051 9090
CMD ["llm-incident-manager"]Or use the pre-built image with npm:
FROM node:20-slim
# Install Rust for building
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"
# Install the server
RUN npm install -g @llm-dev-ops/llm-incident-manager
# Build Rust binaries
WORKDIR /app
RUN npm run build
EXPOSE 8080 50051 9090
CMD ["llm-incident-manager"]apiVersion: apps/v1
kind: Deployment
metadata:
name: incident-manager
spec:
replicas: 3
template:
spec:
containers:
- name: incident-manager
image: llm-incident-manager:latest
ports:
- containerPort: 3000
- containerPort: 50051
env:
- name: DATABASE_URL
valueFrom:
secretKeyRef:
name: incident-manager-secrets
key: database-urlThe system exposes comprehensive metrics on port 9090 (configurable via LLM_IM__SERVER__METRICS_PORT).
Integration Metrics (per LLM integration):
llm_integration_requests_total{integration="sentinel|shield|edge-agent|governance"}
llm_integration_requests_successful{integration="..."}
llm_integration_requests_failed{integration="..."}
llm_integration_success_rate_percent{integration="..."}
llm_integration_latency_milliseconds_average{integration="..."}
llm_integration_last_request_timestamp{integration="..."}
Core System Metrics:
incident_manager_alerts_processed_total
incident_manager_incidents_created_total
incident_manager_incidents_resolved_total
incident_manager_escalations_triggered_total
incident_manager_enrichment_duration_seconds
incident_manager_enrichment_cache_hit_rate
incident_manager_correlation_groups_created_total
incident_manager_ml_predictions_total
incident_manager_ml_prediction_confidence
incident_manager_notifications_sent_total
incident_manager_processing_duration_seconds
Quick Access:
# Prometheus format
curl http://localhost:9090/metrics
# JSON format
curl http://localhost:8080/v1/metrics/integrationsFor complete metrics documentation, dashboards, and alerting:
- Metrics Guide - Metrics catalog and configuration
- Operational Runbook - Troubleshooting and alerts
# Liveness probe
curl http://localhost:8080/health/live
# Readiness probe
curl http://localhost:8080/health/ready
# Full health status with metrics
curl http://localhost:8080/health- API Key authentication
- mTLS for gRPC
- JWT tokens for WebSocket
- Encrypted at rest (PostgreSQL encryption)
- TLS 1.3 in transit
- Sensitive data redaction in logs
Please report security issues to: security@example.com
This project is licensed under the MIT License - see the LICENSE file for details.
- Rust - Systems programming language
- Tokio - Async runtime
- PostgreSQL - Primary database
- SQLx - SQL toolkit
- Tonic - gRPC implementation
- Axum - Web framework
- Serde - Serialization framework
- SmartCore - Machine learning library
- Tracing - Structured logging
Designed and implemented for enterprise-grade LLM infrastructure management with a focus on reliability, performance, and extensibility.
Status: Production Ready | Version: 1.0.1 | Language: Rust | Last Updated: 2025-11-14
Published Packages:
- π¦ Cargo:
llm-incident-managerv1.0.1 (crates.io) - π¦ npm Server:
@llm-dev-ops/llm-incident-managerv1.0.1 (npmjs) - π npm Types:
@llm-dev-ops/incident-manager-typesv1.0.1 (npmjs) - π npm Client:
@llm-dev-ops/incident-manager-clientv1.0.1 (npmjs)
The complete incident management server with npm CLI tooling for easy installation and operation.
# Install globally
npm install -g @llm-dev-ops/llm-incident-manager
# Available commands
llm-im # CLI tool
llm-incident-manager # Start the server
npm run build # Build Rust binaries
npm run health # Check health status
npm run metrics # View Prometheus metrics
npm run graphql # Open GraphQL PlaygroundFeatures:
- Rust-based high-performance server
- npm wrapper for easy installation
- Automated build scripts
- Health check and metrics endpoints
- Full REST, gRPC, and GraphQL APIs
Comprehensive TypeScript type definitions (2,400+ lines) for the entire incident management system.
npm install @llm-dev-ops/incident-manager-typesimport type {
// Core incident types
Incident,
RawEvent,
IncidentEvent,
Severity,
IncidentStatus,
// LLM integration types
LLMRequest,
LLMResponse,
SentinelLLMConfig,
ShieldLLMConfig,
EdgeAgentLLMConfig,
GovernanceLLMConfig,
// Policy & workflow types
EscalationPolicy,
NotificationTemplate,
RoutingRule,
Playbook,
// Analytics types
IncidentAnalytics,
TeamMetrics,
PostMortem
} from '@llm-dev-ops/incident-manager-types';Includes:
- Complete incident management data models
- LLM client integration types (Sentinel, Shield, Edge-Agent, Governance)
- Escalation, notification, and routing types
- API request/response types
- Analytics and metrics types
- Zero dependencies, pure TypeScript
WebSocket/GraphQL client SDK for real-time incident streaming.
npm install @llm-dev-ops/incident-manager-client
# Node.js also requires ws
npm install wsimport { IncidentManagerClient } from '@llm-dev-ops/incident-manager-client';
const client = new IncidentManagerClient({
wsUrl: 'ws://localhost:8080/graphql/ws',
authToken: 'your-jwt-token',
retryAttempts: 10
});
// Subscribe to critical incidents
client.subscribeToCriticalIncidents((incident) => {
console.log('Critical incident:', incident);
});
// Subscribe to updates
client.subscribeToIncidentUpdates(['P0', 'P1'], (update) => {
console.log('Update:', update);
});
// Subscribe to new incidents
client.subscribeToNewIncidents((incident) => {
console.log('New incident:', incident);
});
// Subscribe to state changes
client.subscribeToStateChanges((change) => {
console.log('State change:', change);
});
// Subscribe to all alerts
client.subscribeToAlerts((alert) => {
console.log('Alert:', alert);
});Features:
- Real-time WebSocket streaming
- Auto-reconnection with exponential backoff
- Full TypeScript support
- GraphQL subscriptions
- Works in browser and Node.js
- Multiple subscription helpers
npm install @llm-dev-ops/incident-manager-client @llm-dev-ops/incident-manager-typesimport { useEffect, useState } from 'react';
import { IncidentManagerClient } from '@llm-dev-ops/incident-manager-client';
import type { Incident } from '@llm-dev-ops/incident-manager-types';
function IncidentDashboard() {
const [criticalIncidents, setCriticalIncidents] = useState<Incident[]>([]);
useEffect(() => {
const client = new IncidentManagerClient({
wsUrl: 'ws://your-server.com/graphql/ws',
authToken: getAuthToken()
});
client.subscribeToCriticalIncidents((incident) => {
setCriticalIncidents(prev => [...prev, incident]);
showNotification(incident);
});
return () => client.close();
}, []);
return (
<div>
<h1>Critical Incidents</h1>
{criticalIncidents.map(incident => (
<IncidentCard key={incident.id} incident={incident} />
))}
</div>
);
}- Published to crates.io: Rust library and binaries available via
cargo install llm-incident-manager - Published to npm (3 packages):
@llm-dev-ops/llm-incident-manager- Server with npm CLI tooling@llm-dev-ops/incident-manager-types- TypeScript type definitions (2,400+ lines)@llm-dev-ops/incident-manager-client- WebSocket/GraphQL client SDK
- All warnings resolved: Fixed 83 compiler warnings for clean crates.io publication
- Version sync: All packages aligned at v1.0.1
- Complete documentation: Updated README with installation options, examples, and ecosystem guide
- Implemented enterprise-grade LLM client integrations for Sentinel, Shield, Edge-Agent, and Governance
- 5,913 lines of production Rust code with comprehensive error handling
- 1,578 lines of integration tests (78 test cases)
- Multi-framework compliance support (GDPR, HIPAA, SOC2, PCI, ISO27001)
- gRPC bidirectional streaming for Edge-Agent
- Exponential backoff retry logic with jitter
- Complete documentation suite in
/docs