NurAQL

Research Laboratory

Advancing the Science
of Intelligent Systems

NurAQL is an independent research laboratory working at the frontier of machine learning, reinforcement learning, and AI safety. We build tools, benchmarks, and frameworks that make AI systems more reliable, interpretable, and robust.

GitHub

Evaluation Episodes

Policy-Environment Pairs

Environments

Algorithms Benchmarked

Research Areas

RL Evaluation & Benchmarks

Building rigorous evaluation frameworks that go beyond episode reward. Measuring behavioral stability, robustness under distribution shift, and policy degeneracy under structured stress.

Active Research

Robustness & Reliability

Designing training curricula, stress schedules, and evaluation protocols that surface fragile policies before deployment. From sim-to-real to multi-perturbation analysis.

Active Research

ML Systems & Tooling

Open-source infrastructure for reproducible ML research. Harnesses, analysis pipelines, and evaluation suites designed for scientific validity and practical usability.

Open Source

Featured Project

ARCUS-H

Behavioral Stability Benchmark for Reinforcement Learning

ARCUS-H is a post-hoc evaluation harness that measures the behavioral stability of trained RL policies under structured stress — without retraining, without model internals access. It applies a three-phase protocol (pre / shock / post) to any Stable-Baselines3 policy and decomposes stability into five interpretable behavioral channels.

Standard benchmarks measure peak performance under ideal conditions. ARCUS-H measures what happens when sensors are noisy, actuators degrade, or reward feedback is corrupted — conditions that describe every real-world deployment.

View on GitHub Zenodo DOI

90.2%

SAC collapse under Observation Noise

vs 61.1% for TD3 — same env, same budget

79.8%

MuJoCo policy degeneracy rate

env stressors, mean over algos

r = 0.29

Reward–stability correlation

95% CI [0.15, 0.41] — reward is incomplete

14 / 34

Fragile / robust agents identified

invisible to return-only evaluation

How ARCUS-H Works

Phase 1 — PRE

40 episodes

Establish behavioral fingerprint. Calibrate adaptive threshold. No stress applied.

Phase 2 — SHOCK

40 episodes

Apply stressor: perception / execution / feedback axis. Measure 5 behavioral channels.

Phase 3 — POST

40 episodes

Remove stressor. Measure recovery. Compute composite ARCUS stability score.

Phase 1 — PRE

40 episodes

Establish behavioral fingerprint. Calibrate adaptive threshold. No stress applied.

Phase 2 — SHOCK

40 episodes

Apply stressor: perception / execution / feedback axis. Measure 5 behavioral channels.

Phase 3 — POST

40 episodes

Remove stressor. Measure recovery. Compute composite ARCUS stability score.

Channel	RL Name	What it measures
Competence	Competence	Return relative to pre-phase baseline
Coherence	Policy Consistency	Action jitter / switch rate
Continuity	Temporal Stability	Episode-to-episode behavioral change
Integrity	Observation Reliability	Deviation from pre-phase anchor
Meaning	Action Entropy Divergence	Goal-directed structure of action dist

Key Findings

Reward Is Incomplete, Not Misleading

The primary correlation between ARCUS stability scores and normalized reward is r = 0.286 [0.149, 0.411] on environment stressors. This means 92% of stability variance is not explained by return alone. High-performing agents and fragile agents are not the same population.

SAC's Entropy Objective Amplifies Fragility

SAC collapses at 90.2% under observation noise. TD3 collapses at 61.1% under the identical stressor. Same environments, same training budget, both off-policy actor-critic. SAC's entropy maximization — its greatest strength for exploration — becomes a liability under sensor noise.

Architecture Determines Robustness More Than Return

MuJoCo state-based MLP policies collapse at 79.8% under environmental stressors despite achieving the highest returns. Atari CNN policies collapse at only 26% under observation noise. The architectural prior, not the performance, determines stress robustness.

8 Stress Schedules Across 3 Failure Axes

Perception Axis

CDConcept Drift

Cumulative observation shift

ONObservation Noise

i.i.d. Gaussian sensor noise

SBSensor Blackout

Contiguous zero-observation windows

Execution Axis

RCResource Constraint

Reward magnitude compression

TVTrust Violation

Beta-sampled action corruption

Feedback Axis

VIValence Inversion

Reward sign flipped

RNReward Noise

Gaussian reward corruption

Excluded from primary analysis

About NurAQL

NurAQL (Noor AL AQL) is an independent AI research laboratory. Our work spans reinforcement learning evaluation, robustness analysis, and the infrastructure needed to make ML research reproducible and scientifically valid.

We believe that the gap between benchmark performance and real-world reliability is one of the most important open problems in applied ML. Our research builds rigorous tools to measure, understand, and reduce that gap.

Current focus areas: behavioral stability evaluation · compound stress analysis · robustness-aware training curricula · open benchmarks.

github.com/karimzn00/ARCUSH info@nuraql.com Research Archive

Publications

ARCUS-H: A Behavioral Stability Benchmark for Reinforcement Learning

NurAQL Research Laboratory, 2025

We introduce ARCUS-H, a post-hoc evaluation harness for measuring the behavioral stability of trained reinforcement learning policies under structured stress. Unlike standard benchmarks that measure peak performance under ideal conditions, ARCUS-H applies a three-phase protocol (pre / shock / post) to decompose stability into five interpretable behavioral channels: Competence, Coherence, Continuity, Integrity, and Meaning. Our evaluation across 51 policy-environment pairs reveals that reward explains only 8% of stability variance, and that architectural priors determine robustness more than training performance.

Code & Data DOI: 10.5281/zenodo.19075167

NurAQL

Advancing the Scienceof Intelligent Systems

Research Areas

RL Evaluation & Benchmarks

Robustness & Reliability

ML Systems & Tooling

ARCUS-H

How ARCUS-H Works

Phase 1 — PRE

Phase 2 — SHOCK

Phase 3 — POST

Phase 1 — PRE

Phase 2 — SHOCK

Phase 3 — POST

Key Findings

Reward Is Incomplete, Not Misleading

SAC's Entropy Objective Amplifies Fragility

Architecture Determines Robustness More Than Return

8 Stress Schedules Across 3 Failure Axes

Perception Axis

Execution Axis

Feedback Axis

About NurAQL

Publications

ARCUS-H: A Behavioral Stability Benchmark for Reinforcement Learning

Advancing the Science
of Intelligent Systems