Deterministic Label IDs for Synthetic Datasets

A taxonomy-first label registry that assigns stable IDs (and palettes) so RGB, segmentation, and LiDAR agree across builds, scenes, and clients.

ArticleComing soon

Full article will drop soon

Deterministic IDs, across every sensor

If label IDs change between builds, datasets become hard to merge: metrics drift, training labels break, and downstream tooling needs constant re-mapping. We treat label IDs as an API.

At SiRLab, every class is defined once in a hierarchical tag registry (for example environment.weather.fog, static.infrastructure.automotive.road.asphalt, dynamic.vehicle.automotive.car). A build step compiles that registry into deterministic artifacts that every sensor and client uses.

Design goals

Stable IDs across exports and versions
Hierarchical taxonomy, not a flat list
Deterministic ordering + hashing for change detection
Safe evolution (ignored/deprecated) without breaking mappings
Palette QA (duplicate / too-close colors)
Portable mapping shipped as JSON for external consumers

Pipeline summary (high level)

Define the taxonomy as dotted tags grouped by categories like environment, vegetation, static, dynamic, fx, meta.
Compile to a canonical ordered list:
- id = 0 reserved for a root tag (for example root)
- IDs assigned deterministically per category (optionally sorted within-category)
- per-category reserved tails for future growth without renumbering everything
Validate:
- unique tags
- exact and perceptual “too close” color collisions (DeltaE) so the palette stays usable
- overrides for ignored / deprecated tags so the taxonomy can evolve without churn
Emit artifacts:
- registry JSON (schema, generator version, hashes) + a fast labels_index map
- optional palette LUT texture (for stencil/palette QA and decoding)
- optional Unreal gameplay-tag codegen so runtime and exports share the same taxonomy

How it reaches clients (runtime)

Registry JSON can be published to shared memory (JLBLJSON) so external clients can map IDs → taxonomy without shipping code updates.
Static ground-truth (STATICGT) provides world-anchored labeled objects (id, instance, OBB, radius) for QA and evaluation.
Sensors publish payloads (LiDAR, cameras) with label IDs derived in the render path; clients consume IDs + the registry together.

What we already ship

Stencil IDs captured in the render pass and carried into LiDAR outputs.
Per-point label IDs packed with XYZ so every return has a class.
Label registry JSON published to shared memory (JLBLJSON) for clients to map ids → taxonomy.
Static ground-truth region (STATICGT) for world-anchored objects (id, instance, OBB + radius) and quick QA.
Deterministic sim-step seeds so label noise stays stable across runs.

What we’re adding next

Real JSON ingestion for label assets (currently a stub in the import factory) so the engine can ingest registry updates directly.
Expanded validation rules (asset + export-time) so breaking taxonomy changes are caught before dataset export.
Coverage reporting for rare classes at dataset scale.

Full technical breakdown is on the way.

Back to docs