LabelsDatasetsQuality
Dec 20, 20254 minIn review

Deterministic Label IDs for Synthetic Datasets

A taxonomy-first label registry that assigns stable IDs (and palettes) so RGB, segmentation, and LiDAR agree across builds, scenes, and clients.

Semantic labels in an urban scene
ArticleComing soon
Full article will drop soon

Deterministic IDs, across every sensor

If label IDs change between builds, datasets become hard to merge: metrics drift, training labels break, and downstream tooling needs constant re-mapping. We treat label IDs as an API.

At SiRLab, every class is defined once in a hierarchical tag registry (for example environment.weather.fog, static.infrastructure.automotive.road.asphalt, dynamic.vehicle.automotive.car). A build step compiles that registry into deterministic artifacts that every sensor and client uses.

Design goals

  • Stable IDs across exports and versions
  • Hierarchical taxonomy, not a flat list
  • Deterministic ordering + hashing for change detection
  • Safe evolution (ignored/deprecated) without breaking mappings
  • Palette QA (duplicate / too-close colors)
  • Portable mapping shipped as JSON for external consumers

Pipeline summary (high level)

  1. Define the taxonomy as dotted tags grouped by categories like environment, vegetation, static, dynamic, fx, meta.
  2. Compile to a canonical ordered list:
    • id = 0 reserved for a root tag (for example root)
    • IDs assigned deterministically per category (optionally sorted within-category)
    • per-category reserved tails for future growth without renumbering everything
  3. Validate:
    • unique tags
    • exact and perceptual “too close” color collisions (DeltaE) so the palette stays usable
    • overrides for ignored / deprecated tags so the taxonomy can evolve without churn
  4. Emit artifacts:
    • registry JSON (schema, generator version, hashes) + a fast labels_index map
    • optional palette LUT texture (for stencil/palette QA and decoding)
    • optional Unreal gameplay-tag codegen so runtime and exports share the same taxonomy

How it reaches clients (runtime)

  • Registry JSON can be published to shared memory (JLBLJSON) so external clients can map IDs → taxonomy without shipping code updates.
  • Static ground-truth (STATICGT) provides world-anchored labeled objects (id, instance, OBB, radius) for QA and evaluation.
  • Sensors publish payloads (LiDAR, cameras) with label IDs derived in the render path; clients consume IDs + the registry together.

What we already ship

  • Stencil IDs captured in the render pass and carried into LiDAR outputs.
  • Per-point label IDs packed with XYZ so every return has a class.
  • Label registry JSON published to shared memory (JLBLJSON) for clients to map ids → taxonomy.
  • Static ground-truth region (STATICGT) for world-anchored objects (id, instance, OBB + radius) and quick QA.
  • Deterministic sim-step seeds so label noise stays stable across runs.

What we’re adding next

  • Real JSON ingestion for label assets (currently a stub in the import factory) so the engine can ingest registry updates directly.
  • Expanded validation rules (asset + export-time) so breaking taxonomy changes are caught before dataset export.
  • Coverage reporting for rare classes at dataset scale.

Full technical breakdown is on the way.

© 2025–2026 SiRLab. All rights reserved.