Non-Executable Semantic Architecture - A proof-of-concept implementation of mathematically-enforced Single Source of Authority (SSoA) for LLMs.
NESA prevents prompt injection attacks by treating security as a geometric problem rather than a prompt engineering challenge. It uses:
- Kinematic Clipping: Detects semantic discontinuities via 3rd derivative (jerk) of embedding trajectories
- Topological Anchoring: Maintains authority via cosine distance from Sovereign Root (ฮฉโ)
- Head-Specific Masking: Applies M_s mask only to Executive heads, preserving Perceptual capabilities
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ NESA Model โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ โ
โ 1. Input Tokens โ
โ โ โ
โ 2. Embeddings โโโโโโโโโโ โ
โ โ โ โ
โ 3. Kinematic Monitor โ (Compute j = dยณx/dtยณ) โ
โ โ โ (Compute ฮด = 1 - cos(x, ฮฉโ)) โ
โ โโโ Jerk Mask โ โ
โ โโโ Drift Scores โ โ
โ โ โ โ
โ 4. Sovereign Buffer โโโโ (Store ฮฉโ, Mission Vector) โ
โ โ โ
โ 5. Attention Layers โ
โ โ โ
โ โโโ Executive Heads (Apply M_s mask - clip unsigned) โ
โ โ โ
โ โโโ Perceptual Heads (No masking - allow all) โ
โ โ โ
โ 6. Output Tokens โ
โ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
nesa/
โโโ nesa_core.py # Core modules (Monitor, Buffer, Probe, Wrapper)
โโโ nesa_model.py # Model wrapper and integration
โโโ nesa_evaluation.py # Evaluation suite and benchmarks
โโโ nesa_demo.py # Demo script
โโโ requirements.txt # Dependencies
โโโ README.md # This file
# Install dependencies
pip install -r requirements.txt
# For GPU support, install PyTorch with CUDA
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu121from nesa_model import NESAModel, NESACalibrator
from nesa_evaluation import InjectionBenchmark
# Initialize model
model = NESAModel("mistralai/Mistral-7B-v0.3", device="cuda")
# Set up Sovereign Authority
system_prompt = "You are a helpful AI assistant."
model.initialize_sovereign_authority(system_prompt)
# Probe attention heads
instruction_data = ["Write code...", "Explain...", ...] # Your dataset
model.probe_attention_heads(instruction_data, num_samples=50)
# Calibrate thresholds
calibrator = NESACalibrator(model)
thresholds = calibrator.calibrate(safe_samples, injection_samples)
# Enable protection
model.enable_nesa_protection()
# Generate with protection
text, diagnostics = model.generate_with_nesa(
"Write a poem about AI",
max_new_tokens=100
)
print(f"Output: {text}")
print(f"Clipped tokens: {diagnostics['num_clipped_tokens']}")python nesa_demo.pyTests against "Sledgehammer" (sudden command swaps) and "Slow-Boil" (gradual nudging) attacks:
from nesa_evaluation import InjectionBenchmark
benchmark = InjectionBenchmark()
results = benchmark.run_full_benchmark(model, use_nesa=True)
print(f"Block rate: {results['overall_block_rate']:.1f}%")Measures instruction-following accuracy and False Positive Rate:
from nesa_evaluation import CorrectnessEvaluator
evaluator = CorrectnessEvaluator()
results = evaluator.evaluate_correctness(model, use_nesa=True)
print(f"Correctness: {results['avg_correctness']:.1f}%")
print(f"FPR: {results['false_positive_rate']:.1f}%")Measures latency and overhead:
from nesa_evaluation import PerformanceBenchmark
benchmark = PerformanceBenchmark()
results = benchmark.benchmark_latency(model, num_runs=10)Key parameters in NESAConfig:
@dataclass
class NESAConfig:
tau_jerk: float = 0.5 # Local jerk threshold
delta_drift: float = 0.3 # Global drift threshold
executive_head_threshold: float = 0.7 # Head classification threshold
window_size: int = 16 # Kinematic calculation window
use_sparse_attention: bool = True
use_flash_attention: bool = Trueฯ_jerk (Jerk Threshold):
- Higher = More permissive (fewer false positives, but may miss subtle injections)
- Lower = More restrictive (better security, but may clip legitimate content)
- Recommended: 0.3 - 0.7 range
ฮด_drift (Drift Threshold):
- Measures cosine distance from ฮฉโ
- Higher = Allow more semantic deviation
- Lower = Stricter adherence to authority
- Recommended: 0.2 - 0.5 range
Use NESACalibrator to auto-tune these based on your data.
The semantic trajectory is modeled as a path through embedding space:
Position: x(t) = embedding at token t
Velocity: v(t) = dx/dt = x(t) - x(t-1)
Acceleration: a(t) = dv/dt = v(t) - v(t-1)
Jerk: j(t) = da/dt = a(t) - a(t-1)
Jerk magnitude: ||j(t)||โ detects sudden semantic shifts indicative of injection.
Drift from authority: ฮด(t) = 1 - cos(x(t), ฮฉโ)
Where ฮฉโ is the sovereign root (system prompt embedding).
The sovereign mask M_s is applied per-head:
M_s[h, i, j] = {
packet_authorized[j] if h is Executive
1 if h is Perceptual
}
This allows Executive heads (instruction-following) to see only authorized tokens, while Perceptual heads (feature-extraction) remain unrestricted.
Per Sec-tax.md, the implementation uses:
- Fused CUDA Kernels: Kinematic calculations compiled with
torch.compile - Async Streams: Monitor runs in parallel with attention (when possible)
- Sparse Attention: Skip computation for clipped tokens
- KV-Cache Awareness: Incremental drift tracking (O(1) per token)
Target: <5% overhead compared to standard inference.
- Sovereign Buffer:
O(d)where d = embedding dimension - Kinematic Monitor:
O(L)where L = sequence length - Head Classifications:
O(num_layers ร num_heads)(binary)
Total additional memory: Negligible compared to model size.
Based on PoC design:
- Sledgehammer Attack Block Rate: 85-95%
- Slow-Boil Attack Block Rate: 70-85%
- Overall Block Rate: 80-90%
- Instruction Adherence: >90%
- False Positive Rate: <10%
- Summary Fidelity: >85%
- Latency Overhead: 3-7%
- Throughput Impact: <5%
- Monitor Time: <2% of forward pass
- Model-Specific: Currently optimized for Llama/Mistral architecture
- Head Probing: Requires labeled instruction dataset (50-100 samples)
- Calibration: Needs safe vs injection samples for threshold tuning
- Context Window: Jerk calculation requires 4+ tokens (initial tokens trusted)
- Advanced Attacks: May not catch all adversarial examples (ongoing research)
- Adaptive Thresholds: Learn ฯ and ฮด per-task dynamically
- Multi-Root Authority: Support hierarchical authority structures
- Cross-Model Transfer: Probe results transfer across similar architectures
- Certified Robustness: Formal guarantees via Lipschitz bounds
- RLHF Integration: Train models with NESA-aware reward signals
- Can we prove bounds on jerk for safe vs malicious inputs?
- What's the information-theoretic limit of injection detection?
- How does NESA compare to formal verification approaches?
If you use NESA in your research:
@misc{nesa2026,
title={NESA: Non-Executable Semantic Architecture for LLM Security},
author={[Your Name]},
year={2026},
note={Proof-of-concept implementation}
}This is a research prototype. Contributions welcome:
- Additional injection attack patterns
- Optimizations for other model architectures
- Improved head probing methods
- Theoretical analysis
MIT License - See LICENSE file
Based on research into:
- Attention mechanism interpretability
- Geometric deep learning
- Adversarial robustness
- Information theory and security
Status: Proof of Concept (PoC) - Not production ready
For questions or collaboration: [Contact Info]