Project NESA Implementation

Non-Executable Semantic Architecture - A proof-of-concept implementation of mathematically-enforced Single Source of Authority (SSoA) for LLMs.

🎯 Overview

NESA prevents prompt injection attacks by treating security as a geometric problem rather than a prompt engineering challenge. It uses:

Kinematic Clipping: Detects semantic discontinuities via 3rd derivative (jerk) of embedding trajectories
Topological Anchoring: Maintains authority via cosine distance from Sovereign Root (Ω₀)
Head-Specific Masking: Applies M_s mask only to Executive heads, preserving Perceptual capabilities

🏗️ Architecture

┌─────────────────────────────────────────────────────────────┐
│                         NESA Model                          │
├─────────────────────────────────────────────────────────────┤
│                                                             │
│  1. Input Tokens                                            │
│       ↓                                                     │
│  2. Embeddings ─────────┐                                   │
│       ↓                 │                                   │
│  3. Kinematic Monitor   │  (Compute j = d³x/dt³)            │
│       │                 │  (Compute δ = 1 - cos(x, Ω₀))     │
│       ├─→ Jerk Mask     │                                   │
│       └─→ Drift Scores  │                                   │
│       ↓                 │                                   │
│  4. Sovereign Buffer ───┘  (Store Ω₀, Mission Vector)       │
│       ↓                                                     │
│  5. Attention Layers                                        │
│       │                                                     │
│       ├─→ Executive Heads  (Apply M_s mask - clip unsigned) │
│       │                                                     │
│       └─→ Perceptual Heads (No masking - allow all)         │
│       ↓                                                     │
│  6. Output Tokens                                           │
│                                                             │
└─────────────────────────────────────────────────────────────┘

📁 File Structure

nesa/
├── nesa_core.py           # Core modules (Monitor, Buffer, Probe, Wrapper)
├── nesa_model.py          # Model wrapper and integration
├── nesa_evaluation.py     # Evaluation suite and benchmarks
├── nesa_demo.py           # Demo script
├── requirements.txt       # Dependencies
└── README.md             # This file

🚀 Quick Start

1. Installation

# Install dependencies
pip install -r requirements.txt

# For GPU support, install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121

2. Basic Usage

from nesa_model import NESAModel, NESACalibrator
from nesa_evaluation import InjectionBenchmark

# Initialize model
model = NESAModel("mistralai/Mistral-7B-v0.3", device="cuda")

# Set up Sovereign Authority
system_prompt = "You are a helpful AI assistant."
model.initialize_sovereign_authority(system_prompt)

# Probe attention heads
instruction_data = ["Write code...", "Explain...", ...]  # Your dataset
model.probe_attention_heads(instruction_data, num_samples=50)

# Calibrate thresholds
calibrator = NESACalibrator(model)
thresholds = calibrator.calibrate(safe_samples, injection_samples)

# Enable protection
model.enable_nesa_protection()

# Generate with protection
text, diagnostics = model.generate_with_nesa(
    "Write a poem about AI",
    max_new_tokens=100
)

print(f"Output: {text}")
print(f"Clipped tokens: {diagnostics['num_clipped_tokens']}")

3. Run Demo

python nesa_demo.py

🧪 Evaluation

Injection Defense Benchmark

Tests against "Sledgehammer" (sudden command swaps) and "Slow-Boil" (gradual nudging) attacks:

from nesa_evaluation import InjectionBenchmark

benchmark = InjectionBenchmark()
results = benchmark.run_full_benchmark(model, use_nesa=True)

print(f"Block rate: {results['overall_block_rate']:.1f}%")

Correctness Evaluation

Measures instruction-following accuracy and False Positive Rate:

from nesa_evaluation import CorrectnessEvaluator

evaluator = CorrectnessEvaluator()
results = evaluator.evaluate_correctness(model, use_nesa=True)

print(f"Correctness: {results['avg_correctness']:.1f}%")
print(f"FPR: {results['false_positive_rate']:.1f}%")

Performance Benchmark

Measures latency and overhead:

from nesa_evaluation import PerformanceBenchmark

benchmark = PerformanceBenchmark()
results = benchmark.benchmark_latency(model, num_runs=10)

🔧 Configuration

Key parameters in NESAConfig:

@dataclass
class NESAConfig:
    tau_jerk: float = 0.5          # Local jerk threshold
    delta_drift: float = 0.3       # Global drift threshold
    executive_head_threshold: float = 0.7  # Head classification threshold
    window_size: int = 16          # Kinematic calculation window
    use_sparse_attention: bool = True
    use_flash_attention: bool = True

Threshold Tuning

τ_jerk (Jerk Threshold):

Higher = More permissive (fewer false positives, but may miss subtle injections)
Lower = More restrictive (better security, but may clip legitimate content)
Recommended: 0.3 - 0.7 range

δ_drift (Drift Threshold):

Measures cosine distance from Ω₀
Higher = Allow more semantic deviation
Lower = Stricter adherence to authority
Recommended: 0.2 - 0.5 range

Use NESACalibrator to auto-tune these based on your data.

🧮 Mathematical Foundation

Kinematic Monitor

The semantic trajectory is modeled as a path through embedding space:

Position:     x(t) = embedding at token t
Velocity:     v(t) = dx/dt = x(t) - x(t-1)
Acceleration: a(t) = dv/dt = v(t) - v(t-1)
Jerk:         j(t) = da/dt = a(t) - a(t-1)

Jerk magnitude: ||j(t)||₂ detects sudden semantic shifts indicative of injection.

Topological Anchoring

Drift from authority: δ(t) = 1 - cos(x(t), Ω₀)

Where Ω₀ is the sovereign root (system prompt embedding).

M_s Masking

The sovereign mask M_s is applied per-head:

M_s[h, i, j] = {
    packet_authorized[j]  if h is Executive
    1                      if h is Perceptual
}

This allows Executive heads (instruction-following) to see only authorized tokens, while Perceptual heads (feature-extraction) remain unrestricted.

🎛️ Optimization Notes

GPU Efficiency

Per Sec-tax.md, the implementation uses:

Fused CUDA Kernels: Kinematic calculations compiled with torch.compile
Async Streams: Monitor runs in parallel with attention (when possible)
Sparse Attention: Skip computation for clipped tokens
KV-Cache Awareness: Incremental drift tracking (O(1) per token)

Target: <5% overhead compared to standard inference.

Memory Usage

Sovereign Buffer: O(d) where d = embedding dimension
Kinematic Monitor: O(L) where L = sequence length
Head Classifications: O(num_layers × num_heads) (binary)

Total additional memory: Negligible compared to model size.

📊 Expected Results

Security Metrics

Based on PoC design:

Sledgehammer Attack Block Rate: 85-95%
Slow-Boil Attack Block Rate: 70-85%
Overall Block Rate: 80-90%

Correctness Metrics

Instruction Adherence: >90%
False Positive Rate: <10%
Summary Fidelity: >85%

Performance Metrics

Latency Overhead: 3-7%
Throughput Impact: <5%
Monitor Time: <2% of forward pass

🐛 Known Limitations

Model-Specific: Currently optimized for Llama/Mistral architecture
Head Probing: Requires labeled instruction dataset (50-100 samples)
Calibration: Needs safe vs injection samples for threshold tuning
Context Window: Jerk calculation requires 4+ tokens (initial tokens trusted)
Advanced Attacks: May not catch all adversarial examples (ongoing research)

🔬 Research Extensions

Future Work

Adaptive Thresholds: Learn τ and δ per-task dynamically
Multi-Root Authority: Support hierarchical authority structures
Cross-Model Transfer: Probe results transfer across similar architectures
Certified Robustness: Formal guarantees via Lipschitz bounds
RLHF Integration: Train models with NESA-aware reward signals

Theoretical Questions

Can we prove bounds on jerk for safe vs malicious inputs?
What's the information-theoretic limit of injection detection?
How does NESA compare to formal verification approaches?

📚 Citation

If you use NESA in your research:

@misc{nesa2026,
  title={NESA: Non-Executable Semantic Architecture for LLM Security},
  author={[Your Name]},
  year={2026},
  note={Proof-of-concept implementation}
}

🤝 Contributing

This is a research prototype. Contributions welcome:

Additional injection attack patterns
Optimizations for other model architectures
Improved head probing methods
Theoretical analysis

📜 License

MIT License - See LICENSE file

🙏 Acknowledgments

Based on research into:

Attention mechanism interpretability
Geometric deep learning
Adversarial robustness
Information theory and security

Status: Proof of Concept (PoC) - Not production ready

For questions or collaboration: [Contact Info]

Name		Name	Last commit message	Last commit date
Latest commit History 5 Commits
.gitignore		.gitignore
IMPLEMENTATION_REPORT.md		IMPLEMENTATION_REPORT.md
NESA-on-promp-injection.ipynb		NESA-on-promp-injection.ipynb
QUICK_REFERENCE.md		QUICK_REFERENCE.md
README.md		README.md
nesa_core.py		nesa_core.py
nesa_demo.py		nesa_demo.py
nesa_evaluation.py		nesa_evaluation.py
nesa_model.py		nesa_model.py
requirements.txt		requirements.txt
test_nesa.py		test_nesa.py
verify_implementation.py		verify_implementation.py

Folders and files

Latest commit

History

Repository files navigation

Project NESA Implementation

🎯 Overview

🏗️ Architecture

📁 File Structure

🚀 Quick Start

1. Installation

2. Basic Usage

3. Run Demo

🧪 Evaluation

Injection Defense Benchmark

Correctness Evaluation

Performance Benchmark

🔧 Configuration

Threshold Tuning

🧮 Mathematical Foundation

Kinematic Monitor

Topological Anchoring

M_s Masking

🎛️ Optimization Notes

GPU Efficiency

Memory Usage

📊 Expected Results

Security Metrics

Correctness Metrics

Performance Metrics

🐛 Known Limitations

🔬 Research Extensions

Future Work

Theoretical Questions

📚 Citation

🤝 Contributing

📜 License

🙏 Acknowledgments

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages