Guidance

How to use HFA datasets

Harmonic Frontier Audio datasets are designed for teams building, evaluating, and deploying modern audio and multimodal AI systems. This page explains how HFA data is structured, how it fits into common workflows, and how to choose the right starting point.

Not a sound library. HFA datasets are engineered for model training and evaluation.

Structure matters. File organization, metadata, and labeling are part of the product.

Designed for defensibility. Rights clearance and provenance are built in from capture.

Browse datasets Discuss licensing

Common ways teams use HFA datasets

HFA datasets are built to support different stages of development—from early evaluation to production deployment. Your licensing tier typically reflects usage scope and governance needs, not “better audio.”

Research

Evaluation & academic work

Benchmark model behavior, test controllability, and evaluate timbral realism using structured examples and consistent capture quality.

Model fit checks (timbre, articulation coverage)
Comparative evaluation across datasets
Non-commercial research usage

AI Startup

Prototyping & product R&D

Train internal models, build demos, and validate new features using rights-cleared audio and predictable metadata structure.

Internal tooling + experiments (scope-defined)
Feature development (steering, technique adherence)
Early product testing and iteration

Enterprise

Production deployment & governance

Use HFA datasets inside formal workflows that require defensible provenance, documentation, and predictable long-term availability.

Compliance review and vendor diligence
Production model training & monitoring
Long-term licensing and multi-dataset programs

View licensing tiers Proteus Standard™

For applied examples across industries, see Data Use Cases.

What you receive

HFA deliveries are designed as layered infrastructure. Foundations provides the recordings and baseline structure. Proteus Standard™ adds defensibility and audit-ready integrity artifacts. Orpheus Suite adds modeling-optimized enrichment for teams who need instruction-ready and multimodal-aligned structure.

Always included

Foundations

Rights-cleared performance audio plus consistent organization and core metadata—built for repeatable ingestion and evaluation.

Clean exports and predictable foldering
Baseline labeling and dataset descriptors
Recording notes and QC context (dataset-dependent)

Foundations Suite

Integrity spine

Proteus Standard™

Provenance and verification artifacts that help teams defend dataset origin, integrity, and delivery history under review.

Hashing and signed delivery manifests (as applicable)
Versioning support for iterative drops
Fingerprint references for audit/dispute workflows

Proteus Standard™

Enterprise add-on

Orpheus Suite

Modeling-optimized enrichment: instruction-style examples, aligned metadata fields, and export formats designed for modern pipelines.

Instruction pairs (prompt → expected behavior)
Multimodal alignment fields (audio ↔ segments)
Optional JSONL/Parquet exports for training

Orpheus Suite

Important note

These layers describe delivery packaging. Your licensing tier (Research, Startup, Enterprise) describes usage scope and governance needs. If you’re unsure, start with your intended use case and the smallest relevant dataset set—HFA can recommend a path.

Licensing tiers Discuss licensing

How to choose the right dataset set

Most teams don’t need “everything.” The fastest path is to start with a minimal set aligned to your use case, validate technical fit, then expand only when you know what your model needs.

1

Start with a Catalog Series

Choose the series that matches your domain (e.g., Celtic instruments, world percussion, extended vocal techniques). Series pages group related datasets so you can expand coherently later.

Explore the catalog

2

Decide: Preview or Full Dataset

Previews are designed for evaluation—enough to test timbre, structure, and relevance. Full datasets are complete deliveries intended for training, deployment, and long-term programs.

Preview: technical fit checks and early exploration
Full: model training, productization, and deployment

3

Select the smallest relevant set

Start narrow. Pick the instrument family, technique subset, or gesture type you actually need first—then expand based on results. This keeps training clean and reduces irrelevant coverage.

One instrument / technique subset first
Validate controllability and adherence
Expand systematically (not impulsively)

4

Add Orpheus only when needed

If your workflow requires instruction tuning, consistent evaluation prompts, or multimodal alignment fields, Orpheus Suite adds modeling-optimized enrichment on top of Foundations.

Orpheus Suite

If you’re unsure where to start

Tell us what you’re building and the behavior you need (timbre realism, technique adherence, controllability, evaluation, deployment scope). HFA can recommend a minimal starting set and an expansion path.

Start a licensing conversation View licensing tiers

Data structure & ingestion

HFA datasets are organized to be predictable for engineering teams: clean audio exports, consistent foldering, and structured metadata designed to support filtering, batching, and repeatable evaluation. Exact fields can vary by dataset, but the structure is designed to be pipeline-friendly.

What you’ll typically see

Audio exported in clean, model-friendly formats (dataset-defined).
Metadata describing instrument, technique, articulation, and capture context.
Documentation covering recording setup, QC notes, and intended use guidance.
Integrity artifacts (Proteus Standard™) for verification and audit workflows.

How teams integrate HFA data

Ingest metadata into a table store (Parquet/CSV) for filtering and sampling.
Build train/eval slices by technique, articulation, and recording context.
Use evaluation subsets to track controllability and adherence over time.
Keep manifests and hashes alongside data for governance review.

Practical recommendation

Start with a small, well-scoped slice. Validate ingestion, labeling fit, and model behavior first—then expand coverage once you know what “success” looks like in your system.

Rights, provenance & defensibility

HFA datasets are built for teams who need rights-cleared audio they can defend internally. Proteus Standard™ is designed to support governance review, vendor diligence, and long-term dataset integrity—without turning every license into a legal project.

Rights-cleared by design

Data is produced with explicit permission and clear provenance paths. This reduces ambiguity and helps teams avoid “unknown source” risk in training pipelines.

Clear licensing terms by tier
Documented origin and capture context
Designed for enterprise diligence

Integrity artifacts for verification

Proteus includes verification-friendly artifacts that help teams confirm delivery integrity and maintain internal traceability across versions and expansions.

Hashes and delivery manifests (as applicable)
Versioned documentation and QC notes
Fingerprint references for audit/dispute workflows

Read Proteus Standard™ Discuss licensing

Common pitfalls & best practices

HFA datasets are designed for model development and evaluation. The practices below help teams get value quickly while avoiding common mistakes that slow progress or create unnecessary friction later.

Treating HFA as a sample library

HFA datasets are not intended for browsing or ad-hoc sound selection. They are structured collections meant for systematic ingestion and evaluation.

Best practice Ingest metadata first, then select audio programmatically.

Skipping evaluation subsets

Training without stable evaluation slices makes it difficult to measure controllability, regressions, or improvements over time.

Best practice Hold out technique- and articulation-balanced evaluation sets.

Mixing licensing scopes

Using data beyond the scope of your licensing tier can create confusion during audits or future diligence.

Best practice Align dataset usage with your declared use case and tier.

Discarding manifests and hashes

Integrity artifacts are easy to ignore early—but become critical if questions arise later.

Best practice Store manifests and hashes alongside your data in version control.

Overbuying too early

Acquiring more coverage than you can meaningfully evaluate can slow iteration and increase noise.

Best practice Start narrow, validate behavior, then expand deliberately.

Ready to move forward?

Whether you’re evaluating fit, building a prototype, or preparing for production deployment, HFA datasets are designed to scale with your needs. Start small, validate behavior, and expand deliberately—with defensibility built in.

Browse the catalog View licensing tiers Discuss licensing