LLM Incident Manager

Overview

LLM Incident Manager is an enterprise-grade, production-ready incident management system built in Rust, designed specifically for LLM DevOps ecosystems. It provides intelligent incident detection, classification, enrichment, correlation, routing, escalation, and automated resolution capabilities for modern LLM infrastructure.

Available on:

🦀 crates.io - Rust library and binaries
📦 npm - Server with npm CLI tooling
📘 npm types - TypeScript type definitions
🔌 npm client - JavaScript/TypeScript client SDK

Key Features

Core Capabilities

🚀 High Performance: Built in Rust with async/await for maximum throughput and minimal latency
🤖 ML-Powered Classification: Machine learning-based incident classification with confidence scoring
🔍 Context Enrichment: Automatic enrichment with historical data, service info, and team context
🔗 Intelligent Correlation: Groups related incidents to reduce alert fatigue
⚡ Smart Escalation: Policy-based escalation with multi-level notification chains
📊 Persistent Storage: PostgreSQL and in-memory storage implementations
🎯 Smart Routing: Policy-based routing with team and severity-based rules
🔔 Multi-Channel Notifications: Email, Slack, PagerDuty, webhooks
🤝 Automated Playbooks: Execute automated remediation workflows
📝 Complete Audit Trail: Full incident lifecycle tracking

Implemented Subsystems

1. Escalation Engine ✅

Multi-level escalation policies
Time-based automatic escalation
Configurable notification channels per level
Target types: Users, Teams, On-Call schedules
Pause/resume/resolve escalation flows
Real-time escalation state tracking
Documentation: ESCALATION_GUIDE.md

2. Persistent Storage ✅

PostgreSQL backend with connection pooling
In-memory storage for testing/development
Trait-based abstraction for extensibility
Transaction support for data consistency
Full incident lifecycle persistence
Query optimizations and indexing
Documentation: STORAGE_IMPLEMENTATION.md

3. Correlation Engine ✅

Time-window based correlation
Multi-strategy correlation: Source, Type, Similarity, Tag, Service
Dynamic correlation groups
Configurable thresholds and windows
Pattern detection across incidents
Graph-based relationship tracking
Documentation: CORRELATION_GUIDE.md

4. ML Classification ✅

Automated severity classification
Multi-model ensemble architecture
Feature extraction from incidents
Confidence scoring
Incremental learning with feedback
Model versioning and persistence
Real-time classification API
Documentation: ML_CLASSIFICATION_GUIDE.md

5. Context Enrichment ✅

Historical incident analysis with similarity matching
Service catalog integration (CMDB)
Team and on-call information
External API integrations (Prometheus, Elasticsearch)
Parallel enrichment pipeline
Intelligent caching with TTL
Configurable enrichers and priorities
Documentation: ENRICHMENT_GUIDE.md

6. Deduplication Engine ✅

Fingerprint-based duplicate detection
Time-window deduplication
Automatic incident merging
Alert correlation

7. Notification Service ✅

Multi-channel delivery (Email, Slack, PagerDuty)
Template-based formatting
Rate limiting and throttling
Delivery confirmation

8. Playbook Automation ✅

Trigger-based playbook execution
Step-by-step action execution
Auto-execution on incident creation
Manual playbook execution

9. Routing Engine ✅

Rule-based incident routing
Team assignment suggestions
Severity-based routing
Service-aware routing

10. LLM Integrations ✅

Sentinel Client: Monitoring & anomaly detection with ML-powered analysis
Shield Client: Security threat analysis and mitigation planning
Edge-Agent Client: Distributed edge inference with offline queue management
Governance Client: Multi-framework compliance (GDPR, HIPAA, SOC2, PCI, ISO27001)
Enterprise features: Exponential backoff retry, circuit breaker, rate limiting
Comprehensive error handling and observability

11. GraphQL API with WebSocket Streaming ✅

Full-featured GraphQL API alongside REST
Real-time WebSocket subscriptions for incident updates
Type-safe schema with queries, mutations, and subscriptions
DataLoaders for efficient batch loading and N+1 prevention
GraphQL Playground for interactive API exploration
Support for filtering, pagination, and complex queries
Documentation: GRAPHQL_GUIDE.md, WEBSOCKET_STREAMING_GUIDE.md

12. Metrics & Observability ✅

Prometheus Integration: Native Prometheus metrics export on port 9090
Real-time Performance Tracking: Request rates, latency, success/error rates
Integration Metrics: Per-integration monitoring (Sentinel, Shield, Edge-Agent, Governance)
System Metrics: Processing pipeline, correlation, enrichment, ML classification
Zero-Overhead Collection: Lock-free atomic operations with <1μs recording time
Grafana Dashboards: Pre-built dashboards for system overview and deep-dive analysis
Alert Rules: Production-ready alerting for critical conditions
Documentation: METRICS_GUIDE.md | Implementation | Runbook

13. Circuit Breaker Pattern ✅

Resilience Pattern: Prevent cascading failures with automatic circuit breaking
State Management: Closed, Open, and Half-Open states with intelligent transitions
Per-Service Configuration: Individual circuit breakers for each external dependency
Fast Failure: Millisecond response time when circuit is open (vs. 30s+ timeouts)
Automatic Recovery: Self-healing with configurable recovery strategies
Fallback Support: Graceful degradation with fallback mechanisms
Comprehensive Metrics: Real-time state tracking and Prometheus integration
Manual Control: API endpoints for operational override and testing
Documentation: CIRCUIT_BREAKER_GUIDE.md | API Reference | Integration Guide | Operations

Architecture

System Architecture

┌──────────────────────────────────────────────────────────────────────┐
│                        LLM Incident Manager                          │
├──────────────────────────────────────────────────────────────────────┤
│                                                                       │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────────────────┐  │
│  │   REST API   │  │   gRPC API   │  │    GraphQL API           │  │
│  │  (HTTP/JSON) │  │ (Protobuf)   │  │ (Queries/Mutations/Subs) │  │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────────────────┘  │
│         │                  │                  │                       │
│         └──────────────────┼──────────────────┘                      │
│                            ▼                                          │
│                 ┌─────────────────────┐                              │
│                 │ IncidentProcessor   │                              │
│                 │  - Deduplication    │                              │
│                 │  - Classification   │                              │
│                 │  - Enrichment       │                              │
│                 │  - Correlation      │                              │
│                 └─────────┬───────────┘                              │
│                           │                                           │
│         ┌─────────────────┼─────────────────┐                       │
│         ▼                 ▼                 ▼                        │
│  ┌─────────────┐  ┌─────────────┐  ┌─────────────┐                │
│  │  Escalation │  │ Notification │  │  Playbook   │                │
│  │   Engine    │  │   Service    │  │   Service   │                │
│  └─────────────┘  └─────────────┘  └─────────────┘                │
│         │                 │                 │                        │
│         └─────────────────┼─────────────────┘                       │
│                           ▼                                           │
│                 ┌─────────────────────┐                              │
│                 │   Storage Layer     │                              │
│                 │  - PostgreSQL       │                              │
│                 │  - In-Memory        │                              │
│                 └─────────────────────┘                              │
└──────────────────────────────────────────────────────────────────────┘

Data Flow

Alert → Deduplication → ML Classification → Context Enrichment
                                                     ↓
                                              Correlation
                                                     ↓
                        Routing ← ─ ─ ─ ─ ─ ─ ─ ─ ─ ┘
                           ↓
        ┌──────────────────┼──────────────────┐
        ▼                  ▼                   ▼
  Notifications      Escalation           Playbooks

Quick Start

Prerequisites

For Rust/Cargo:

Rust 1.75+ (2021 edition)
PostgreSQL 14+ (optional, for persistent storage)
Redis (optional, for distributed caching)

For npm:

Node.js 16.0+
npm 7.0+

Installation Options

Option 1: Install via Cargo (Rust)

# Install from crates.io
cargo install llm-incident-manager

# Or add as dependency in Cargo.toml
[dependencies]
llm-incident-manager = "1.0.1"

Option 2: Install via npm (Global CLI)

# Install the server globally
npm install -g @llm-dev-ops/llm-incident-manager

# Build the Rust binaries
npm run build

# Start the server
npm start

# Or run directly
llm-incident-manager

Option 3: Install Client SDK (TypeScript/JavaScript)

# Install the WebSocket/GraphQL client
npm install @llm-dev-ops/incident-manager-client

# Install type definitions (TypeScript)
npm install @llm-dev-ops/incident-manager-types

Option 4: Install from Source

# Clone repository
git clone https://github.com/globalbusinessadvisors/llm-incident-manager.git
cd llm-incident-manager

# Build with Cargo
cargo build --release

# Or build with npm
npm install
npm run build

# Run tests
cargo test --all-features

# Run with default configuration (in-memory storage)
cargo run --release

Quick Start Examples

Running the Server

# From cargo installation
llm-incident-manager

# From npm installation
npm start

# Or with environment variables
DATABASE_URL=postgresql://localhost/incident_manager \
API_PORT=8080 \
GRPC_PORT=50051 \
  llm-incident-manager

Using the Client SDK (JavaScript/TypeScript)

import { IncidentManagerClient } from '@llm-dev-ops/incident-manager-client';

const client = new IncidentManagerClient({
  wsUrl: 'ws://localhost:8080/graphql/ws',
  authToken: 'your-jwt-token'
});

// Subscribe to critical incidents (P0 and P1)
client.subscribeToCriticalIncidents((incident) => {
  console.log('🚨 Critical incident:', incident.title);
  console.log('   Severity:', incident.severity);

  // Trigger alerts, send to PagerDuty, etc.
  if (incident.severity === 'P0') {
    sendPagerDutyAlert(incident);
  }
});

// Subscribe to all incident updates
client.subscribeToIncidentUpdates(['P0', 'P1', 'P2'], (update) => {
  console.log('📊 Incident update:', update.updateType);
  updateDashboard(update);
});

Using Type Definitions (TypeScript)

import type {
  Incident,
  Severity,
  IncidentStatus,
  CreateIncidentRequest,
  EscalationPolicy
} from '@llm-dev-ops/incident-manager-types';

const incident: Incident = {
  id: 'inc-123',
  severity: 'P1',
  status: 'NEW',
  title: 'High Latency Detected',
  // ... rest of incident fields
};

Basic Usage

use llm_incident_manager::{
    Config,
    models::{Alert, Incident, Severity, IncidentType},
    processing::{IncidentProcessor, DeduplicationEngine},
    state::InMemoryStore,
    escalation::EscalationEngine,
    enrichment::EnrichmentService,
    correlation::CorrelationEngine,
    ml::MLService,
};
use std::sync::Arc;

#[tokio::main]
async fn main() -> Result<(), Box<dyn std::error::Error>> {
    // Initialize storage
    let store = Arc::new(InMemoryStore::new());

    // Create deduplication engine
    let dedup_engine = Arc::new(DeduplicationEngine::new(store.clone(), 900));

    // Create incident processor
    let mut processor = IncidentProcessor::new(store.clone(), dedup_engine);

    // Optional: Add escalation engine
    let escalation_engine = Arc::new(EscalationEngine::new());
    processor.set_escalation_engine(escalation_engine);

    // Optional: Add ML classification
    let ml_service = Arc::new(MLService::new(Default::default()));
    ml_service.start().await?;
    processor.set_ml_service(ml_service);

    // Optional: Add context enrichment
    let enrichment_config = Default::default();
    let enrichment_service = Arc::new(
        EnrichmentService::new(enrichment_config, store.clone())
    );
    enrichment_service.start().await?;
    processor.set_enrichment_service(enrichment_service);

    // Optional: Add correlation engine
    let correlation_engine = Arc::new(
        CorrelationEngine::new(store.clone(), Default::default())
    );
    processor.set_correlation_engine(correlation_engine);

    // Process an alert
    let alert = Alert::new(
        "ext-123".to_string(),
        "monitoring".to_string(),
        "High CPU Usage".to_string(),
        "CPU usage exceeded 90% threshold".to_string(),
        Severity::P1,
        IncidentType::Infrastructure,
    );

    let ack = processor.process_alert(alert).await?;
    println!("Incident created: {:?}", ack.incident_id);

    Ok(())
}

Configuration

Environment Variables

# Database
DATABASE_URL=postgresql://user:password@localhost/incident_manager
DATABASE_POOL_SIZE=20

# Redis (optional)
REDIS_URL=redis://localhost:6379

# API Server
API_HOST=0.0.0.0
API_PORT=3000

# gRPC Server
GRPC_HOST=0.0.0.0
GRPC_PORT=50051

# Feature Flags
ENABLE_ML_CLASSIFICATION=true
ENABLE_ENRICHMENT=true
ENABLE_CORRELATION=true
ENABLE_ESCALATION=true

# Logging
RUST_LOG=info,llm_incident_manager=debug

Configuration File (config.yaml)

instance_id: "standalone-001"

# Storage configuration
storage:
  type: "postgresql"  # or "memory"
  connection_string: "postgresql://localhost/incident_manager"
  pool_size: 20

# ML Configuration
ml:
  enabled: true
  confidence_threshold: 0.7
  model_path: "./models"
  auto_train: true
  training_batch_size: 100

# Enrichment Configuration
enrichment:
  enabled: true
  enable_historical: true
  enable_service: true
  enable_team: true
  timeout_secs: 10
  cache_ttl_secs: 300
  async_enrichment: true
  max_concurrent: 5
  similarity_threshold: 0.5

# Correlation Configuration
correlation:
  enabled: true
  time_window_secs: 300
  min_incidents: 2
  max_group_size: 50
  enable_source: true
  enable_type: true
  enable_similarity: true
  enable_tags: true
  enable_service: true

# Escalation Configuration
escalation:
  enabled: true
  default_timeout_secs: 300

# Deduplication Configuration
deduplication:
  window_secs: 900
  fingerprint_enabled: true

# Notification Configuration
notifications:
  channels:
    - type: "email"
      enabled: true
    - type: "slack"
      enabled: true
      webhook_url: "https://hooks.slack.com/..."
    - type: "pagerduty"
      enabled: true
      integration_key: "..."

API Examples

WebSocket Streaming (Real-Time Updates)

The LLM Incident Manager provides a GraphQL WebSocket API for real-time incident streaming. This allows clients to subscribe to incident events and receive immediate notifications.

Quick Start:

import { createClient } from 'graphql-ws';

const client = createClient({
  url: 'ws://localhost:8080/graphql/ws',
  connectionParams: {
    Authorization: 'Bearer YOUR_JWT_TOKEN'
  }
});

// Subscribe to critical incidents
client.subscribe(
  {
    query: `
      subscription {
        criticalIncidents {
          id
          title
          severity
          state
          createdAt
        }
      }
    `
  },
  {
    next: (data) => {
      console.log('Critical incident:', data.criticalIncidents);
    },
    error: (error) => console.error('Subscription error:', error),
    complete: () => console.log('Subscription completed')
  }
);

Available Subscriptions:

criticalIncidents - Subscribe to P0 and P1 incidents
incidentUpdates - Subscribe to incident lifecycle events
newIncidents - Subscribe to newly created incidents
incidentStateChanges - Subscribe to state transitions
alerts - Subscribe to incoming alert submissions

Documentation:

WebSocket Streaming Guide - Architecture and overview
WebSocket API Reference - Complete API documentation
WebSocket Client Guide - Integration examples
WebSocket Deployment Guide - Production setup
Example Clients - TypeScript, Python, Rust examples

REST API

# Create an incident
curl -X POST http://localhost:3000/api/v1/incidents \
  -H "Content-Type: application/json" \
  -d '{
    "source": "monitoring",
    "title": "High Memory Usage",
    "description": "Memory usage exceeded 85% threshold",
    "severity": "P2",
    "incident_type": "Infrastructure"
  }'

# Get incident
curl http://localhost:3000/api/v1/incidents/{incident_id}

# Acknowledge incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/acknowledge \
  -H "Content-Type: application/json" \
  -d '{"actor": "user@example.com"}'

# Resolve incident
curl -X POST http://localhost:3000/api/v1/incidents/{incident_id}/resolve \
  -H "Content-Type: application/json" \
  -d '{
    "resolved_by": "user@example.com",
    "method": "Manual",
    "notes": "Restarted service",
    "root_cause": "Memory leak in application"
  }'

gRPC API

service IncidentService {
  rpc CreateIncident(CreateIncidentRequest) returns (CreateIncidentResponse);
  rpc GetIncident(GetIncidentRequest) returns (Incident);
  rpc UpdateIncident(UpdateIncidentRequest) returns (Incident);
  rpc StreamIncidents(StreamIncidentsRequest) returns (stream Incident);
  rpc AnalyzeCorrelations(AnalyzeCorrelationsRequest) returns (CorrelationResult);
}

GraphQL API

The GraphQL API provides a flexible, type-safe interface with real-time subscriptions:

# Query incidents with advanced filtering
query GetIncidents {
  incidents(
    first: 20
    filter: {
      severity: [P0, P1]
      status: [NEW, ACKNOWLEDGED]
      environment: [PRODUCTION]
    }
    orderBy: { field: CREATED_AT, direction: DESC }
  ) {
    edges {
      node {
        id
        title
        severity
        status
        assignedTo {
          name
          email
        }
        sla {
          resolutionDeadline
          resolutionBreached
        }
      }
    }
    pageInfo {
      hasNextPage
      endCursor
    }
  }
}

# Subscribe to real-time incident updates
subscription IncidentUpdates {
  incidentUpdated(filter: { severity: [P0, P1] }) {
    incident {
      id
      title
      status
    }
    updateType
    changedFields
  }
}

GraphQL Endpoints:

Query/Mutation: POST http://localhost:8080/graphql
Subscriptions: WS ws://localhost:8080/graphql
Playground: GET http://localhost:8080/graphql/playground

Documentation:

GraphQL API Guide - Complete API documentation with authentication, pagination, and best practices
GraphQL Schema Reference - Full schema documentation with all types, queries, mutations, and subscriptions
GraphQL Integration Guide - Client integration examples for Apollo Client, Relay, urql, and plain fetch
GraphQL Development Guide - Implementation guide for extending the API
GraphQL Examples - Common query patterns and real-world use cases

Feature Guides

1. Escalation Engine

Create escalation policies and automatically escalate incidents based on time and severity:

use llm_incident_manager::escalation::{
    EscalationPolicy, EscalationLevel, EscalationTarget, TargetType,
};

// Define escalation policy
let policy = EscalationPolicy {
    name: "Critical Production Incidents".to_string(),
    levels: vec![
        EscalationLevel {
            level: 1,
            name: "L1 On-Call".to_string(),
            targets: vec![
                EscalationTarget {
                    target_type: TargetType::OnCall,
                    identifier: "platform-team".to_string(),
                }
            ],
            escalate_after_secs: 300,  // 5 minutes
            channels: vec!["pagerduty".to_string(), "slack".to_string()],
        },
        EscalationLevel {
            level: 2,
            name: "Engineering Lead".to_string(),
            targets: vec![
                EscalationTarget {
                    target_type: TargetType::User,
                    identifier: "eng-lead@example.com".to_string(),
                }
            ],
            escalate_after_secs: 900,  // 15 minutes
            channels: vec!["pagerduty".to_string(), "sms".to_string()],
        },
    ],
    // ... conditions
};

escalation_engine.register_policy(policy);

See ESCALATION_GUIDE.md for complete documentation.

2. Context Enrichment

Automatically enrich incidents with historical data, service information, and team context:

use llm_incident_manager::enrichment::{EnrichmentConfig, EnrichmentService};

let mut config = EnrichmentConfig::default();
config.enable_historical = true;
config.enable_service = true;
config.enable_team = true;
config.similarity_threshold = 0.5;

let service = EnrichmentService::new(config, store);
service.start().await?;

// Enrichment happens automatically in the processor
let context = service.enrich_incident(&incident).await?;

// Access enriched data
if let Some(historical) = context.historical {
    println!("Found {} similar incidents", historical.similar_incidents.len());
}

See ENRICHMENT_GUIDE.md for complete documentation.

3. Correlation Engine

Group related incidents to reduce alert fatigue:

use llm_incident_manager::correlation::{CorrelationEngine, CorrelationConfig};

let mut config = CorrelationConfig::default();
config.time_window_secs = 300;  // 5 minutes
config.enable_similarity = true;
config.enable_source = true;

let engine = CorrelationEngine::new(store, config);
let result = engine.analyze_incident(&incident).await?;

if result.has_correlations() {
    println!("Found {} related incidents", result.correlation_count());
}

See CORRELATION_GUIDE.md for complete documentation.

4. ML Classification

Automatically classify incident severity using machine learning:

use llm_incident_manager::ml::{MLService, MLConfig};

let config = MLConfig::default();
let service = MLService::new(config);
service.start().await?;

// Classification happens automatically
let prediction = service.predict_severity(&incident).await?;
println!("Predicted severity: {:?} (confidence: {:.2})",
    prediction.predicted_severity,
    prediction.confidence
);

// Train with feedback
service.add_training_sample(&incident).await?;
service.trigger_training().await?;

See ML_CLASSIFICATION_GUIDE.md for complete documentation.

5. Circuit Breakers

Protect your system from cascading failures with automatic circuit breaking:

use llm_incident_manager::circuit_breaker::CircuitBreaker;
use std::time::Duration;

// Create circuit breaker for external service
let circuit_breaker = CircuitBreaker::new("sentinel-api")
    .failure_threshold(5)       // Open after 5 failures
    .timeout(Duration::from_secs(60))  // Wait 60s before testing recovery
    .success_threshold(2)       // Close after 2 successful tests
    .build();

// Execute request through circuit breaker
let result = circuit_breaker.call(|| async {
    sentinel_client.fetch_alerts(Some(10)).await
}).await;

match result {
    Ok(alerts) => {
        println!("Fetched {} alerts", alerts.len());
    }
    Err(e) if e.is_circuit_open() => {
        println!("Circuit breaker is open, using fallback");
        // Use cached data or alternative service
        let fallback_data = cache.get_alerts()?;
        Ok(fallback_data)
    }
    Err(e) => {
        println!("Request failed: {}", e);
        Err(e)
    }
}

Key Features

Three States:
- Closed: Normal operation, requests flow through
- Open: Service failing, requests fail immediately (< 1ms)
- Half-Open: Testing recovery with limited requests
Automatic Recovery:
- Configurable timeout before recovery testing
- Multiple recovery strategies (fixed, linear, exponential backoff)
- Gradual traffic restoration
Comprehensive Monitoring:

// Check circuit breaker state
let state = circuit_breaker.state().await;
println!("Circuit state: {:?}", state);

// Get detailed information
let info = circuit_breaker.info().await;
println!("Error rate: {:.2}%", info.error_rate * 100.0);
println!("Total requests: {}", info.total_requests);
println!("Failures: {}", info.failure_count);

// Health check
let health = circuit_breaker.health_check().await;

Manual Control (for operations):

# Force open (maintenance mode)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/open

# Force close (after maintenance)
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/close

# Reset circuit breaker
curl -X POST http://localhost:8080/v1/circuit-breakers/sentinel/reset

# Get status
curl http://localhost:8080/v1/circuit-breakers/sentinel

Configuration Example:

# config/circuit_breakers.yaml
circuit_breakers:
  sentinel:
    name: "sentinel-api"
    failure_threshold: 5
    success_threshold: 2
    timeout_secs: 60
    volume_threshold: 10
    recovery_strategy:
      type: "exponential_backoff"
      initial_timeout_secs: 60
      max_timeout_secs: 300
      multiplier: 2.0

Prometheus Metrics:

circuit_breaker_state{name="sentinel"} 0           # 0=closed, 1=open, 2=half-open
circuit_breaker_requests_total{name="sentinel"}
circuit_breaker_requests_failed{name="sentinel"}
circuit_breaker_error_rate{name="sentinel"}
circuit_breaker_open_count{name="sentinel"}

See CIRCUIT_BREAKER_GUIDE.md for complete documentation.

Testing

Run All Tests

# Unit tests
cargo test --lib

# Integration tests
cargo test --test '*'

# All tests with coverage
cargo tarpaulin --all-features --workspace --timeout 120

Test Coverage

Unit Tests: 48 tests across all modules
Integration Tests: 75+ tests covering end-to-end workflows
Total Coverage: ~85%

Performance

Benchmarks

Operation	Latency (p95)	Throughput
Alert Processing	< 50ms	10,000/sec
Incident Creation	< 100ms	5,000/sec
ML Classification	< 30ms	15,000/sec
Enrichment (cached)	< 5ms	50,000/sec
Enrichment (uncached)	< 150ms	3,000/sec
Correlation Analysis	< 80ms	8,000/sec

Resource Requirements

Component	CPU	Memory	Notes
Core Processor	2 cores	512MB	Base requirements
ML Service	2 cores	1GB	With models loaded
Enrichment Service	1 core	256MB	With caching
PostgreSQL	4 cores	4GB	For production

Documentation

Implementation Guides

Escalation Engine Guide - Complete escalation documentation
Escalation Implementation - Technical details
Storage Implementation - Storage layer details
Correlation Guide - Correlation engine usage
Correlation Implementation - Technical details
ML Classification Guide - ML usage and training
ML Implementation - Technical details
Enrichment Guide - Context enrichment usage
Enrichment Implementation - Technical details
LLM Integrations Overview - Complete LLM integration guide
LLM Architecture - Detailed architecture specs
LLM Implementation Guide - Step-by-step implementation
LLM Quick Reference - Fast lookup guide
Metrics Guide - NEW: Complete metrics and observability documentation
Metrics Implementation - NEW: Technical implementation details
Metrics Operational Runbook - NEW: Operations and troubleshooting

API Documentation

REST API: cargo doc --open
gRPC API: See proto/ directory for Protocol Buffer definitions
GraphQL API: Comprehensive documentation suite
- GraphQL API Guide - Complete API overview
- GraphQL Schema Reference - Full schema documentation
- GraphQL Integration Guide - Client integration examples
- GraphQL Development Guide - Implementation guide
- GraphQL Examples - Query patterns and use cases

Project Structure

llm-incident-manager/
├── src/
│   ├── api/              # REST/gRPC/GraphQL APIs
│   ├── config/           # Configuration management
│   ├── correlation/      # Correlation engine
│   ├── enrichment/       # Context enrichment
│   │   ├── enrichers.rs  # Enricher implementations
│   │   ├── models.rs     # Data structures
│   │   ├── pipeline.rs   # Enrichment orchestration
│   │   └── service.rs    # Service management
│   ├── error/            # Error types
│   ├── escalation/       # Escalation engine
│   ├── grpc/             # gRPC service implementations
│   ├── integrations/     # LLM integrations (NEW)
│   │   ├── common/       # Shared utilities (client trait, retry, auth)
│   │   ├── sentinel/     # Sentinel monitoring client
│   │   ├── shield/       # Shield security client
│   │   ├── edge_agent/   # Edge-Agent distributed client
│   │   └── governance/   # Governance compliance client
│   ├── ml/               # ML classification
│   │   ├── classifier.rs # Classification logic
│   │   ├── features.rs   # Feature extraction
│   │   ├── models.rs     # Data structures
│   │   └── service.rs    # Service management
│   ├── models/           # Core data models
│   ├── notifications/    # Notification service
│   ├── playbooks/        # Playbook automation
│   ├── processing/       # Incident processor
│   └── state/            # Storage implementations
├── tests/                # Integration tests
│   ├── integration_sentinel_test.rs     # Sentinel client tests
│   ├── integration_shield_test.rs       # Shield client tests
│   ├── integration_edge_agent_test.rs   # Edge-Agent client tests
│   └── integration_governance_test.rs   # Governance client tests
├── proto/                # Protocol buffer definitions
├── migrations/           # Database migrations
└── docs/                 # Additional documentation
    ├── LLM_CLIENT_README.md                 # LLM integrations overview
    ├── LLM_CLIENT_ARCHITECTURE.md           # Detailed architecture
    ├── LLM_CLIENT_IMPLEMENTATION_GUIDE.md   # Implementation guide
    ├── LLM_CLIENT_QUICK_REFERENCE.md        # Quick reference
    └── llm-client-types.ts                  # TypeScript type definitions

Development

Contributing

We welcome contributions! Please see CONTRIBUTING.md for guidelines.

Code Style

# Format code
cargo fmt

# Lint
cargo clippy --all-features

# Check
cargo check --all-features

Running Locally

# Development mode with hot reload
cargo watch -x run

# With debug logging
RUST_LOG=debug cargo run

# With specific features
cargo run --features "postgresql,redis"

Deployment

Docker

FROM rust:1.75 as builder
WORKDIR /app
COPY . .
RUN cargo build --release

FROM debian:bookworm-slim
RUN apt-get update && apt-get install -y ca-certificates && rm -rf /var/lib/apt/lists/*
COPY --from=builder /app/target/release/llm-incident-manager /usr/local/bin/
EXPOSE 8080 50051 9090
CMD ["llm-incident-manager"]

Or use the pre-built image with npm:

FROM node:20-slim

# Install Rust for building
RUN curl --proto '=https' --tlsv1.2 -sSf https://sh.rustup.rs | sh -s -- -y
ENV PATH="/root/.cargo/bin:${PATH}"

# Install the server
RUN npm install -g @llm-dev-ops/llm-incident-manager

# Build Rust binaries
WORKDIR /app
RUN npm run build

EXPOSE 8080 50051 9090
CMD ["llm-incident-manager"]

Kubernetes

apiVersion: apps/v1
kind: Deployment
metadata:
  name: incident-manager
spec:
  replicas: 3
  template:
    spec:
      containers:
      - name: incident-manager
        image: llm-incident-manager:latest
        ports:
        - containerPort: 3000
        - containerPort: 50051
        env:
        - name: DATABASE_URL
          valueFrom:
            secretKeyRef:
              name: incident-manager-secrets
              key: database-url

Monitoring

Metrics (Prometheus)

The system exposes comprehensive metrics on port 9090 (configurable via LLM_IM__SERVER__METRICS_PORT).

Integration Metrics (per LLM integration):

llm_integration_requests_total{integration="sentinel|shield|edge-agent|governance"}
llm_integration_requests_successful{integration="..."}
llm_integration_requests_failed{integration="..."}
llm_integration_success_rate_percent{integration="..."}
llm_integration_latency_milliseconds_average{integration="..."}
llm_integration_last_request_timestamp{integration="..."}

Core System Metrics:

incident_manager_alerts_processed_total
incident_manager_incidents_created_total
incident_manager_incidents_resolved_total
incident_manager_escalations_triggered_total
incident_manager_enrichment_duration_seconds
incident_manager_enrichment_cache_hit_rate
incident_manager_correlation_groups_created_total
incident_manager_ml_predictions_total
incident_manager_ml_prediction_confidence
incident_manager_notifications_sent_total
incident_manager_processing_duration_seconds

Quick Access:

# Prometheus format
curl http://localhost:9090/metrics

# JSON format
curl http://localhost:8080/v1/metrics/integrations

For complete metrics documentation, dashboards, and alerting:

Metrics Guide - Metrics catalog and configuration
Operational Runbook - Troubleshooting and alerts

Health Checks

# Liveness probe
curl http://localhost:8080/health/live

# Readiness probe
curl http://localhost:8080/health/ready

# Full health status with metrics
curl http://localhost:8080/health

Security

Authentication

API Key authentication
mTLS for gRPC
JWT tokens for WebSocket

Data Protection

Encrypted at rest (PostgreSQL encryption)
TLS 1.3 in transit
Sensitive data redaction in logs

Vulnerability Reporting

Please report security issues to: security@example.com

License

This project is licensed under the MIT License - see the LICENSE file for details.

Built With

Rust - Systems programming language
Tokio - Async runtime
PostgreSQL - Primary database
SQLx - SQL toolkit
Tonic - gRPC implementation
Axum - Web framework
Serde - Serialization framework
SmartCore - Machine learning library
Tracing - Structured logging

Acknowledgments

Designed and implemented for enterprise-grade LLM infrastructure management with a focus on reliability, performance, and extensibility.

Status: Production Ready | Version: 1.0.1 | Language: Rust | Last Updated: 2025-11-14

Published Packages:

🦀 Cargo: llm-incident-manager v1.0.1 (crates.io)
📦 npm Server: @llm-dev-ops/llm-incident-manager v1.0.1 (npmjs)
📘 npm Types: @llm-dev-ops/incident-manager-types v1.0.1 (npmjs)
🔌 npm Client: @llm-dev-ops/incident-manager-client v1.0.1 (npmjs)

npm Packages Ecosystem

1. Main Server Package: `@llm-dev-ops/llm-incident-manager`

The complete incident management server with npm CLI tooling for easy installation and operation.

# Install globally
npm install -g @llm-dev-ops/llm-incident-manager

# Available commands
llm-im                    # CLI tool
llm-incident-manager      # Start the server
npm run build             # Build Rust binaries
npm run health            # Check health status
npm run metrics           # View Prometheus metrics
npm run graphql           # Open GraphQL Playground

Features:

Rust-based high-performance server
npm wrapper for easy installation
Automated build scripts
Health check and metrics endpoints
Full REST, gRPC, and GraphQL APIs

2. Type Definitions: `@llm-dev-ops/incident-manager-types`

Comprehensive TypeScript type definitions (2,400+ lines) for the entire incident management system.

npm install @llm-dev-ops/incident-manager-types

import type {
  // Core incident types
  Incident,
  RawEvent,
  IncidentEvent,
  Severity,
  IncidentStatus,

  // LLM integration types
  LLMRequest,
  LLMResponse,
  SentinelLLMConfig,
  ShieldLLMConfig,
  EdgeAgentLLMConfig,
  GovernanceLLMConfig,

  // Policy & workflow types
  EscalationPolicy,
  NotificationTemplate,
  RoutingRule,
  Playbook,

  // Analytics types
  IncidentAnalytics,
  TeamMetrics,
  PostMortem
} from '@llm-dev-ops/incident-manager-types';

Includes:

Complete incident management data models
LLM client integration types (Sentinel, Shield, Edge-Agent, Governance)
Escalation, notification, and routing types
API request/response types
Analytics and metrics types
Zero dependencies, pure TypeScript

3. Client SDK: `@llm-dev-ops/incident-manager-client`

WebSocket/GraphQL client SDK for real-time incident streaming.

npm install @llm-dev-ops/incident-manager-client

# Node.js also requires ws
npm install ws

import { IncidentManagerClient } from '@llm-dev-ops/incident-manager-client';

const client = new IncidentManagerClient({
  wsUrl: 'ws://localhost:8080/graphql/ws',
  authToken: 'your-jwt-token',
  retryAttempts: 10
});

// Subscribe to critical incidents
client.subscribeToCriticalIncidents((incident) => {
  console.log('Critical incident:', incident);
});

// Subscribe to updates
client.subscribeToIncidentUpdates(['P0', 'P1'], (update) => {
  console.log('Update:', update);
});

// Subscribe to new incidents
client.subscribeToNewIncidents((incident) => {
  console.log('New incident:', incident);
});

// Subscribe to state changes
client.subscribeToStateChanges((change) => {
  console.log('State change:', change);
});

// Subscribe to all alerts
client.subscribeToAlerts((alert) => {
  console.log('Alert:', alert);
});

Features:

Real-time WebSocket streaming
Auto-reconnection with exponential backoff
Full TypeScript support
GraphQL subscriptions
Works in browser and Node.js
Multiple subscription helpers

Example: Building a React Dashboard

npm install @llm-dev-ops/incident-manager-client @llm-dev-ops/incident-manager-types

import { useEffect, useState } from 'react';
import { IncidentManagerClient } from '@llm-dev-ops/incident-manager-client';
import type { Incident } from '@llm-dev-ops/incident-manager-types';

function IncidentDashboard() {
  const [criticalIncidents, setCriticalIncidents] = useState<Incident[]>([]);

  useEffect(() => {
    const client = new IncidentManagerClient({
      wsUrl: 'ws://your-server.com/graphql/ws',
      authToken: getAuthToken()
    });

    client.subscribeToCriticalIncidents((incident) => {
      setCriticalIncidents(prev => [...prev, incident]);
      showNotification(incident);
    });

    return () => client.close();
  }, []);

  return (
    <div>
      <h1>Critical Incidents</h1>
      {criticalIncidents.map(incident => (
        <IncidentCard key={incident.id} incident={incident} />
      ))}
    </div>
  );
}

Recent Updates

2025-11-14: Multi-Platform Package Distribution ✅

Published to crates.io: Rust library and binaries available via cargo install llm-incident-manager
Published to npm (3 packages):
- @llm-dev-ops/llm-incident-manager - Server with npm CLI tooling
- @llm-dev-ops/incident-manager-types - TypeScript type definitions (2,400+ lines)
- @llm-dev-ops/incident-manager-client - WebSocket/GraphQL client SDK
All warnings resolved: Fixed 83 compiler warnings for clean crates.io publication
Version sync: All packages aligned at v1.0.1
Complete documentation: Updated README with installation options, examples, and ecosystem guide

2025-11-12: LLM Integrations Module ✅

Implemented enterprise-grade LLM client integrations for Sentinel, Shield, Edge-Agent, and Governance
5,913 lines of production Rust code with comprehensive error handling
1,578 lines of integration tests (78 test cases)
Multi-framework compliance support (GDPR, HIPAA, SOC2, PCI, ISO27001)
gRPC bidirectional streaming for Edge-Agent
Exponential backoff retry logic with jitter
Complete documentation suite in /docs

Name		Name	Last commit message	Last commit date
Latest commit History 29 Commits
.github/workflows		.github/workflows
benches		benches
benchmarks/output		benchmarks/output
bin		bin
config		config
deploy		deploy
docs		docs
examples		examples
functions		functions
kubernetes		kubernetes
packages		packages
plans		plans
proto		proto
scripts		scripts
src		src
tests		tests
.gcloudignore		.gcloudignore
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
Cargo.toml		Cargo.toml
Dockerfile		Dockerfile
GRAPHQL_SCHEMA.graphql		GRAPHQL_SCHEMA.graphql
LICENSE-APACHE		LICENSE-APACHE
LICENSE-MIT		LICENSE-MIT
Makefile		Makefile
README.md		README.md
build.rs		build.rs
docker-compose.yml		docker-compose.yml
index.js		index.js
package-lock.json		package-lock.json
package.json		package.json

Folders and files

Latest commit

History

Repository files navigation

LLM Incident Manager

Overview

Key Features

Core Capabilities

Implemented Subsystems

1. Escalation Engine ✅

2. Persistent Storage ✅

3. Correlation Engine ✅

4. ML Classification ✅

5. Context Enrichment ✅

6. Deduplication Engine ✅

7. Notification Service ✅

8. Playbook Automation ✅

9. Routing Engine ✅

10. LLM Integrations ✅

11. GraphQL API with WebSocket Streaming ✅

12. Metrics & Observability ✅

13. Circuit Breaker Pattern ✅

Architecture

System Architecture

Data Flow

Quick Start

Prerequisites

Installation Options

Option 1: Install via Cargo (Rust)

Option 2: Install via npm (Global CLI)

Option 3: Install Client SDK (TypeScript/JavaScript)

Option 4: Install from Source

Quick Start Examples

Running the Server

Using the Client SDK (JavaScript/TypeScript)

Using Type Definitions (TypeScript)

Basic Usage

Configuration

Environment Variables

Configuration File (config.yaml)

API Examples

WebSocket Streaming (Real-Time Updates)

REST API

gRPC API

GraphQL API

Feature Guides

1. Escalation Engine

2. Context Enrichment

3. Correlation Engine

4. ML Classification

5. Circuit Breakers

Key Features

Testing

Run All Tests

Test Coverage

Performance

Benchmarks

Resource Requirements

Documentation

Implementation Guides

API Documentation

Project Structure

Development

Contributing

Code Style

Running Locally

Deployment

Docker

Kubernetes

Monitoring

Metrics (Prometheus)

Health Checks

Security

Authentication

Data Protection

Vulnerability Reporting

License

Built With

Acknowledgments

npm Packages Ecosystem

1. Main Server Package: `@llm-dev-ops/llm-incident-manager`

2. Type Definitions: `@llm-dev-ops/incident-manager-types`

3. Client SDK: `@llm-dev-ops/incident-manager-client`

Packages