LabelsDatasetsQuality
Dec 20, 2025•4 min•In review
Deterministic Label IDs for Synthetic Datasets
A taxonomy-first label registry that assigns stable IDs (and palettes) so RGB, segmentation, and LiDAR agree across builds, scenes, and clients.

ArticleComing soon
Full article will drop soon
Deterministic IDs, across every sensor
If label IDs change between builds, datasets become hard to merge: metrics drift, training labels break, and downstream tooling needs constant re-mapping. We treat label IDs as an API.
At SiRLab, every class is defined once in a hierarchical tag registry (for example environment.weather.fog, static.infrastructure.automotive.road.asphalt, dynamic.vehicle.automotive.car). A build step compiles that registry into deterministic artifacts that every sensor and client uses.
Design goals
- Stable IDs across exports and versions
- Hierarchical taxonomy, not a flat list
- Deterministic ordering + hashing for change detection
- Safe evolution (ignored/deprecated) without breaking mappings
- Palette QA (duplicate / too-close colors)
- Portable mapping shipped as JSON for external consumers
Pipeline summary (high level)
- Define the taxonomy as dotted tags grouped by categories like
environment,vegetation,static,dynamic,fx,meta. - Compile to a canonical ordered list:
id = 0reserved for a root tag (for exampleroot)- IDs assigned deterministically per category (optionally sorted within-category)
- per-category reserved tails for future growth without renumbering everything
- Validate:
- unique tags
- exact and perceptual “too close” color collisions (DeltaE) so the palette stays usable
- overrides for
ignored/deprecatedtags so the taxonomy can evolve without churn
- Emit artifacts:
- registry JSON (schema, generator version, hashes) + a fast
labels_indexmap - optional palette LUT texture (for stencil/palette QA and decoding)
- optional Unreal gameplay-tag codegen so runtime and exports share the same taxonomy
- registry JSON (schema, generator version, hashes) + a fast
How it reaches clients (runtime)
- Registry JSON can be published to shared memory (
JLBLJSON) so external clients can map IDs → taxonomy without shipping code updates. - Static ground-truth (
STATICGT) provides world-anchored labeled objects (id, instance, OBB, radius) for QA and evaluation. - Sensors publish payloads (LiDAR, cameras) with label IDs derived in the render path; clients consume IDs + the registry together.
What we already ship
- Stencil IDs captured in the render pass and carried into LiDAR outputs.
- Per-point label IDs packed with XYZ so every return has a class.
- Label registry JSON published to shared memory (
JLBLJSON) for clients to map ids → taxonomy. - Static ground-truth region (
STATICGT) for world-anchored objects (id, instance, OBB + radius) and quick QA. - Deterministic sim-step seeds so label noise stays stable across runs.
What we’re adding next
- Real JSON ingestion for label assets (currently a stub in the import factory) so the engine can ingest registry updates directly.
- Expanded validation rules (asset + export-time) so breaking taxonomy changes are caught before dataset export.
- Coverage reporting for rare classes at dataset scale.
Full technical breakdown is on the way.