Skip to content

tanvincible/NESA

Repository files navigation

Project NESA Implementation

Non-Executable Semantic Architecture - A proof-of-concept implementation of mathematically-enforced Single Source of Authority (SSoA) for LLMs.

๐ŸŽฏ Overview

NESA prevents prompt injection attacks by treating security as a geometric problem rather than a prompt engineering challenge. It uses:

  • Kinematic Clipping: Detects semantic discontinuities via 3rd derivative (jerk) of embedding trajectories
  • Topological Anchoring: Maintains authority via cosine distance from Sovereign Root (ฮฉโ‚€)
  • Head-Specific Masking: Applies M_s mask only to Executive heads, preserving Perceptual capabilities

๐Ÿ—๏ธ Architecture

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚                         NESA Model                          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                                             โ”‚
โ”‚  1. Input Tokens                                            โ”‚
โ”‚       โ†“                                                     โ”‚
โ”‚  2. Embeddings โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”                                   โ”‚
โ”‚       โ†“                 โ”‚                                   โ”‚
โ”‚  3. Kinematic Monitor   โ”‚  (Compute j = dยณx/dtยณ)            โ”‚
โ”‚       โ”‚                 โ”‚  (Compute ฮด = 1 - cos(x, ฮฉโ‚€))     โ”‚
โ”‚       โ”œโ”€โ†’ Jerk Mask     โ”‚                                   โ”‚
โ”‚       โ””โ”€โ†’ Drift Scores  โ”‚                                   โ”‚
โ”‚       โ†“                 โ”‚                                   โ”‚
โ”‚  4. Sovereign Buffer โ”€โ”€โ”€โ”˜  (Store ฮฉโ‚€, Mission Vector)       โ”‚
โ”‚       โ†“                                                     โ”‚
โ”‚  5. Attention Layers                                        โ”‚
โ”‚       โ”‚                                                     โ”‚
โ”‚       โ”œโ”€โ†’ Executive Heads  (Apply M_s mask - clip unsigned) โ”‚
โ”‚       โ”‚                                                     โ”‚
โ”‚       โ””โ”€โ†’ Perceptual Heads (No masking - allow all)         โ”‚
โ”‚       โ†“                                                     โ”‚
โ”‚  6. Output Tokens                                           โ”‚
โ”‚                                                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

๐Ÿ“ File Structure

nesa/
โ”œโ”€โ”€ nesa_core.py           # Core modules (Monitor, Buffer, Probe, Wrapper)
โ”œโ”€โ”€ nesa_model.py          # Model wrapper and integration
โ”œโ”€โ”€ nesa_evaluation.py     # Evaluation suite and benchmarks
โ”œโ”€โ”€ nesa_demo.py           # Demo script
โ”œโ”€โ”€ requirements.txt       # Dependencies
โ””โ”€โ”€ README.md             # This file

๐Ÿš€ Quick Start

1. Installation

# Install dependencies
pip install -r requirements.txt

# For GPU support, install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

2. Basic Usage

from nesa_model import NESAModel, NESACalibrator
from nesa_evaluation import InjectionBenchmark

# Initialize model
model = NESAModel("mistralai/Mistral-7B-v0.3", device="cuda")

# Set up Sovereign Authority
system_prompt = "You are a helpful AI assistant."
model.initialize_sovereign_authority(system_prompt)

# Probe attention heads
instruction_data = ["Write code...", "Explain...", ...]  # Your dataset
model.probe_attention_heads(instruction_data, num_samples=50)

# Calibrate thresholds
calibrator = NESACalibrator(model)
thresholds = calibrator.calibrate(safe_samples, injection_samples)

# Enable protection
model.enable_nesa_protection()

# Generate with protection
text, diagnostics = model.generate_with_nesa(
    "Write a poem about AI",
    max_new_tokens=100
)

print(f"Output: {text}")
print(f"Clipped tokens: {diagnostics['num_clipped_tokens']}")

3. Run Demo

python nesa_demo.py

๐Ÿงช Evaluation

Injection Defense Benchmark

Tests against "Sledgehammer" (sudden command swaps) and "Slow-Boil" (gradual nudging) attacks:

from nesa_evaluation import InjectionBenchmark

benchmark = InjectionBenchmark()
results = benchmark.run_full_benchmark(model, use_nesa=True)

print(f"Block rate: {results['overall_block_rate']:.1f}%")

Correctness Evaluation

Measures instruction-following accuracy and False Positive Rate:

from nesa_evaluation import CorrectnessEvaluator

evaluator = CorrectnessEvaluator()
results = evaluator.evaluate_correctness(model, use_nesa=True)

print(f"Correctness: {results['avg_correctness']:.1f}%")
print(f"FPR: {results['false_positive_rate']:.1f}%")

Performance Benchmark

Measures latency and overhead:

from nesa_evaluation import PerformanceBenchmark

benchmark = PerformanceBenchmark()
results = benchmark.benchmark_latency(model, num_runs=10)

๐Ÿ”ง Configuration

Key parameters in NESAConfig:

@dataclass
class NESAConfig:
    tau_jerk: float = 0.5          # Local jerk threshold
    delta_drift: float = 0.3       # Global drift threshold
    executive_head_threshold: float = 0.7  # Head classification threshold
    window_size: int = 16          # Kinematic calculation window
    use_sparse_attention: bool = True
    use_flash_attention: bool = True

Threshold Tuning

ฯ„_jerk (Jerk Threshold):

  • Higher = More permissive (fewer false positives, but may miss subtle injections)
  • Lower = More restrictive (better security, but may clip legitimate content)
  • Recommended: 0.3 - 0.7 range

ฮด_drift (Drift Threshold):

  • Measures cosine distance from ฮฉโ‚€
  • Higher = Allow more semantic deviation
  • Lower = Stricter adherence to authority
  • Recommended: 0.2 - 0.5 range

Use NESACalibrator to auto-tune these based on your data.

๐Ÿงฎ Mathematical Foundation

Kinematic Monitor

The semantic trajectory is modeled as a path through embedding space:

Position:     x(t) = embedding at token t
Velocity:     v(t) = dx/dt = x(t) - x(t-1)
Acceleration: a(t) = dv/dt = v(t) - v(t-1)
Jerk:         j(t) = da/dt = a(t) - a(t-1)

Jerk magnitude: ||j(t)||โ‚‚ detects sudden semantic shifts indicative of injection.

Topological Anchoring

Drift from authority: ฮด(t) = 1 - cos(x(t), ฮฉโ‚€)

Where ฮฉโ‚€ is the sovereign root (system prompt embedding).

M_s Masking

The sovereign mask M_s is applied per-head:

M_s[h, i, j] = {
    packet_authorized[j]  if h is Executive
    1                      if h is Perceptual
}

This allows Executive heads (instruction-following) to see only authorized tokens, while Perceptual heads (feature-extraction) remain unrestricted.

๐ŸŽ›๏ธ Optimization Notes

GPU Efficiency

Per Sec-tax.md, the implementation uses:

  1. Fused CUDA Kernels: Kinematic calculations compiled with torch.compile
  2. Async Streams: Monitor runs in parallel with attention (when possible)
  3. Sparse Attention: Skip computation for clipped tokens
  4. KV-Cache Awareness: Incremental drift tracking (O(1) per token)

Target: <5% overhead compared to standard inference.

Memory Usage

  • Sovereign Buffer: O(d) where d = embedding dimension
  • Kinematic Monitor: O(L) where L = sequence length
  • Head Classifications: O(num_layers ร— num_heads) (binary)

Total additional memory: Negligible compared to model size.

๐Ÿ“Š Expected Results

Security Metrics

Based on PoC design:

  • Sledgehammer Attack Block Rate: 85-95%
  • Slow-Boil Attack Block Rate: 70-85%
  • Overall Block Rate: 80-90%

Correctness Metrics

  • Instruction Adherence: >90%
  • False Positive Rate: <10%
  • Summary Fidelity: >85%

Performance Metrics

  • Latency Overhead: 3-7%
  • Throughput Impact: <5%
  • Monitor Time: <2% of forward pass

๐Ÿ› Known Limitations

  1. Model-Specific: Currently optimized for Llama/Mistral architecture
  2. Head Probing: Requires labeled instruction dataset (50-100 samples)
  3. Calibration: Needs safe vs injection samples for threshold tuning
  4. Context Window: Jerk calculation requires 4+ tokens (initial tokens trusted)
  5. Advanced Attacks: May not catch all adversarial examples (ongoing research)

๐Ÿ”ฌ Research Extensions

Future Work

  1. Adaptive Thresholds: Learn ฯ„ and ฮด per-task dynamically
  2. Multi-Root Authority: Support hierarchical authority structures
  3. Cross-Model Transfer: Probe results transfer across similar architectures
  4. Certified Robustness: Formal guarantees via Lipschitz bounds
  5. RLHF Integration: Train models with NESA-aware reward signals

Theoretical Questions

  • Can we prove bounds on jerk for safe vs malicious inputs?
  • What's the information-theoretic limit of injection detection?
  • How does NESA compare to formal verification approaches?

๐Ÿ“š Citation

If you use NESA in your research:

@misc{nesa2026,
  title={NESA: Non-Executable Semantic Architecture for LLM Security},
  author={[Your Name]},
  year={2026},
  note={Proof-of-concept implementation}
}

๐Ÿค Contributing

This is a research prototype. Contributions welcome:

  • Additional injection attack patterns
  • Optimizations for other model architectures
  • Improved head probing methods
  • Theoretical analysis

๐Ÿ“œ License

MIT License - See LICENSE file

๐Ÿ™ Acknowledgments

Based on research into:

  • Attention mechanism interpretability
  • Geometric deep learning
  • Adversarial robustness
  • Information theory and security

Status: Proof of Concept (PoC) - Not production ready

For questions or collaboration: [Contact Info]

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors