Methodology
A model-agnostic framework for measuring the quality of AI-generated images and video. Nine dimensions, three gates, automated scorecards.
In video QA, “artifact” often names a defect. HarteFact scores outputs anyway — assets, streams, pixels, facts.
Local-first. No cloud dependencies. Designed to run on Apple Silicon using open-source components. The framework is incremental — each phase produces infrastructure consumed by later phases.
Core principles
Model-agnostic by design
Most metrics measure properties of the output file — resolution, texture, temporal stability, color accuracy, identity consistency — regardless of which model produced it. Scoring does not require recalibration when models change.
Algorithmic vs. AI-evaluated
Every score is labeled algorithmic or ai_evaluated. VLM scores are reported with mean and variance and are never presented as equivalent to deterministic metrics.
Tiered gating
Three gates avoid wasting compute on content that has already failed. A clip with the wrong codec never consumes GPU cycles on identity-drift analysis.
Versioned, reproducible
Every run logs framework version, calibration version, and model versions. Re-evaluations are new runs, not silent replacements. Score history is queryable per asset.
Pipeline architecture
Three gates separate fast, cheap checks from expensive deep analysis. Failed content gets immediate, specific feedback identifying the failure dimension — without the cost of downstream scoring.
- Gate 1Technical specsDimension 1
Pass / fail on file specs, codec, resolution, audio packaging.
- Gate 2Spatial qualityDimension 2
Pass / fail on catastrophic spatial failures (severe artifacts, banding).
- Gate 3Temporal & audio basicsDimensions 3 + 4 (parallel)
Pass / fail on flicker, scene-cut sanity, audio levels, sync offset.
- DeepIdentity, lighting, brand, prompt adherenceDimensions 5 – 9
Per-character analysis, scene integrity, client-compliance scoring.
- OutputVersioned scorecard
Pass/fail summary, per-dimension detail, annotated frame thumbnails, timeline visualization, per-frame metric trends, client threshold reference.
The nine dimensions
Each dimension owns a distinct axis of output quality. Build phases follow the dependency map: each phase produces infrastructure later phases reuse, so no work is thrown away.
Technical Delivery Compliance
File specs, codecs, container, color space, VMAF, audio packaging. The non-negotiable foundation.
- Resolution / frame rate
- Codec & container
- VMAF score
- Color space
Spatial & Texture Integrity
Per-frame visual quality. Compression artifacts, texture noise, banding, VAE seam detection.
- BRISQUE / NIQE
- Laplacian sharpness
- Color banding
- Wavelet noise analysis
Temporal Consistency & Motion
Stability across frames. Background flicker, optical flow consistency, scene-cut detection.
- Background SSIM
- Optical flow
- Flicker detection
- Scene cuts
Audio Quality
Loudness, clipping, sync offset. Runs in parallel with the temporal pipeline.
- LUFS measurement
- Clipping detection
- Sync offset
- Spectral integrity
Lip Sync Precision
Combines mouth aspect ratio (MAR) with audio phoneme timing via DTW alignment.
- MAR extraction
- DTW alignment
- WhisperX phonemes
- Sync drift over time
Character & Identity Integrity
Face identity drift, hand failures, body proportions, teeth, clothing consistency.
- InsightFace cosine similarity
- Hand failure logging
- Body proportions
- Skin tone stability
Lighting & Scene Integrity
Shadow coherence, luminance tracking, color temperature stability, reflection plausibility.
- Shadow masking
- Luminance per region
- Color temperature drift
- Reflection flagging
Brand & Client Compliance
Per-client palette, talent reference, logo placement, LUT comparison, typography.
- Brand HEX Delta-E
- Talent face match
- LUT comparison
- Logo / wordmark presence
Prompt & Action Adherence
VLM-evaluated framing, composition, physics plausibility, object/spatial flagging.
- VLM scene description
- Framing & composition
- Physics flags
- Slideshow detection
Includes ai_evaluated scores; reported with mean + variance.
What this framework is not
- —Not a scoring rubric for taste, creativity, or commercial appeal. Aesthetic judgment remains human.
- —Not a model leaderboard. The framework benchmarks output properties; model comparisons are a separate activity built on top of the same infrastructure.
- —Not a SaaS dashboard. Phase 1 ships a local pipeline and a versioned scorecard format, not a hosted product.
- —Not a substitute for human QC on edge cases. The system is designed to scale review, not to replace the final sign-off on high-stakes deliverables.
Print-on-demand extension
A separate addendum extends the framework with print-specific quality metrics: CMYK gamut warnings, ink coverage limits, transparency edge fringing, design placement safety, and pre-generation input validation.
Read the POD addendumPilot engagements
Phase 1 (Technical Delivery) and Phase 1b (Identity Consistency) are in active build. We're scoping a small number of pilot engagements with production studios, agencies, and POD operators for the second half of 2026.
Get in touch