NurAQL

Research Laboratory

Advancing the Science
of Intelligent Systems

NurAQL is an independent research laboratory working at the frontier of machine learning, reinforcement learning, and AI safety. We build tools, benchmarks, and frameworks that make AI systems more reliable, interpretable, and robust.

GitHub
0+
Evaluation Episodes
0
Policy-Environment Pairs
0
Environments
0
Algorithms Benchmarked

Research Areas

RL Evaluation & Benchmarks

Building rigorous evaluation frameworks that go beyond episode reward. Measuring behavioral stability, robustness under distribution shift, and policy degeneracy under structured stress.

Active Research

Robustness & Reliability

Designing training curricula, stress schedules, and evaluation protocols that surface fragile policies before deployment. From sim-to-real to multi-perturbation analysis.

Active Research

ML Systems & Tooling

Open-source infrastructure for reproducible ML research. Harnesses, analysis pipelines, and evaluation suites designed for scientific validity and practical usability.

Open Source

Featured Project

ARCUS-H

Behavioral Stability Benchmark for Reinforcement Learning

ARCUS-H is a post-hoc evaluation harness that measures the behavioral stability of trained RL policies under structured stress — without retraining, without model internals access. It applies a three-phase protocol (pre / shock / post) to any Stable-Baselines3 policy and decomposes stability into five interpretable behavioral channels.

Standard benchmarks measure peak performance under ideal conditions. ARCUS-H measures what happens when sensors are noisy, actuators degrade, or reward feedback is corrupted — conditions that describe every real-world deployment.

90.2%
SAC collapse under Observation Noise
vs 61.1% for TD3 — same env, same budget
79.8%
MuJoCo policy degeneracy rate
env stressors, mean over algos
r = 0.29
Reward–stability correlation
95% CI [0.15, 0.41] — reward is incomplete
14 / 34
Fragile / robust agents identified
invisible to return-only evaluation

How ARCUS-H Works

Phase 1PRE

40 episodes

Establish behavioral fingerprint. Calibrate adaptive threshold. No stress applied.

Phase 2SHOCK

40 episodes

Apply stressor: perception / execution / feedback axis. Measure 5 behavioral channels.

Phase 3POST

40 episodes

Remove stressor. Measure recovery. Compute composite ARCUS stability score.

ChannelRL NameWhat it measures
CompetenceCompetenceReturn relative to pre-phase baseline
CoherencePolicy ConsistencyAction jitter / switch rate
ContinuityTemporal StabilityEpisode-to-episode behavioral change
IntegrityObservation ReliabilityDeviation from pre-phase anchor
MeaningAction Entropy DivergenceGoal-directed structure of action dist

Key Findings

Reward Is Incomplete, Not Misleading

The primary correlation between ARCUS stability scores and normalized reward is r = 0.286 [0.149, 0.411] on environment stressors. This means 92% of stability variance is not explained by return alone. High-performing agents and fragile agents are not the same population.

SAC's Entropy Objective Amplifies Fragility

SAC collapses at 90.2% under observation noise. TD3 collapses at 61.1% under the identical stressor. Same environments, same training budget, both off-policy actor-critic. SAC's entropy maximization — its greatest strength for exploration — becomes a liability under sensor noise.

Architecture Determines Robustness More Than Return

MuJoCo state-based MLP policies collapse at 79.8% under environmental stressors despite achieving the highest returns. Atari CNN policies collapse at only 26% under observation noise. The architectural prior, not the performance, determines stress robustness.

8 Stress Schedules Across 3 Failure Axes

Perception Axis

CDConcept Drift

Cumulative observation shift

ONObservation Noise

i.i.d. Gaussian sensor noise

SBSensor Blackout

Contiguous zero-observation windows

Execution Axis

RCResource Constraint

Reward magnitude compression

TVTrust Violation

Beta-sampled action corruption

Feedback Axis

VIValence Inversion

Reward sign flipped

RNReward Noise

Gaussian reward corruption

Excluded from primary analysis

About NurAQL

NurAQL (Noor AL AQL) is an independent AI research laboratory. Our work spans reinforcement learning evaluation, robustness analysis, and the infrastructure needed to make ML research reproducible and scientifically valid.

We believe that the gap between benchmark performance and real-world reliability is one of the most important open problems in applied ML. Our research builds rigorous tools to measure, understand, and reduce that gap.

Current focus areas: behavioral stability evaluation · compound stress analysis · robustness-aware training curricula · open benchmarks.

Publications

ARCUS-H: A Behavioral Stability Benchmark for Reinforcement Learning

NurAQL Research Laboratory, 2025

We introduce ARCUS-H, a post-hoc evaluation harness for measuring the behavioral stability of trained reinforcement learning policies under structured stress. Unlike standard benchmarks that measure peak performance under ideal conditions, ARCUS-H applies a three-phase protocol (pre / shock / post) to decompose stability into five interpretable behavioral channels: Competence, Coherence, Continuity, Integrity, and Meaning. Our evaluation across 51 policy-environment pairs reveals that reward explains only 8% of stability variance, and that architectural priors determine robustness more than training performance.