Research Laboratory
NurAQL is an independent research laboratory working at the frontier of machine learning, reinforcement learning, and AI safety. We build tools, benchmarks, and frameworks that make AI systems more reliable, interpretable, and robust.
Building rigorous evaluation frameworks that go beyond episode reward. Measuring behavioral stability, robustness under distribution shift, and policy degeneracy under structured stress.
Active ResearchDesigning training curricula, stress schedules, and evaluation protocols that surface fragile policies before deployment. From sim-to-real to multi-perturbation analysis.
Active ResearchOpen-source infrastructure for reproducible ML research. Harnesses, analysis pipelines, and evaluation suites designed for scientific validity and practical usability.
Open SourceFeatured Project
Behavioral Stability Benchmark for Reinforcement Learning
ARCUS-H is a post-hoc evaluation harness that measures the behavioral stability of trained RL policies under structured stress — without retraining, without model internals access. It applies a three-phase protocol (pre / shock / post) to any Stable-Baselines3 policy and decomposes stability into five interpretable behavioral channels.
Standard benchmarks measure peak performance under ideal conditions. ARCUS-H measures what happens when sensors are noisy, actuators degrade, or reward feedback is corrupted — conditions that describe every real-world deployment.
40 episodes
Establish behavioral fingerprint. Calibrate adaptive threshold. No stress applied.
40 episodes
Apply stressor: perception / execution / feedback axis. Measure 5 behavioral channels.
40 episodes
Remove stressor. Measure recovery. Compute composite ARCUS stability score.
| Channel | RL Name | What it measures |
|---|---|---|
| Competence | Competence | Return relative to pre-phase baseline |
| Coherence | Policy Consistency | Action jitter / switch rate |
| Continuity | Temporal Stability | Episode-to-episode behavioral change |
| Integrity | Observation Reliability | Deviation from pre-phase anchor |
| Meaning | Action Entropy Divergence | Goal-directed structure of action dist |
The primary correlation between ARCUS stability scores and normalized reward is r = 0.286 [0.149, 0.411] on environment stressors. This means 92% of stability variance is not explained by return alone. High-performing agents and fragile agents are not the same population.
SAC collapses at 90.2% under observation noise. TD3 collapses at 61.1% under the identical stressor. Same environments, same training budget, both off-policy actor-critic. SAC's entropy maximization — its greatest strength for exploration — becomes a liability under sensor noise.
MuJoCo state-based MLP policies collapse at 79.8% under environmental stressors despite achieving the highest returns. Atari CNN policies collapse at only 26% under observation noise. The architectural prior, not the performance, determines stress robustness.
Cumulative observation shift
i.i.d. Gaussian sensor noise
Contiguous zero-observation windows
Reward magnitude compression
Beta-sampled action corruption
Reward sign flipped
Gaussian reward corruption
Excluded from primary analysis
NurAQL (Noor AL AQL) is an independent AI research laboratory. Our work spans reinforcement learning evaluation, robustness analysis, and the infrastructure needed to make ML research reproducible and scientifically valid.
We believe that the gap between benchmark performance and real-world reliability is one of the most important open problems in applied ML. Our research builds rigorous tools to measure, understand, and reduce that gap.
Current focus areas: behavioral stability evaluation · compound stress analysis · robustness-aware training curricula · open benchmarks.
NurAQL Research Laboratory, 2025
We introduce ARCUS-H, a post-hoc evaluation harness for measuring the behavioral stability of trained reinforcement learning policies under structured stress. Unlike standard benchmarks that measure peak performance under ideal conditions, ARCUS-H applies a three-phase protocol (pre / shock / post) to decompose stability into five interpretable behavioral channels: Competence, Coherence, Continuity, Integrity, and Meaning. Our evaluation across 51 policy-environment pairs reveals that reward explains only 8% of stability variance, and that architectural priors determine robustness more than training performance.