TAD: Terrarum Advanced Audio to use with video compression

This commit is contained in:
minjaesong
2025-10-23 18:56:57 +09:00
parent 6f669f4fd9
commit a9319fd812
10 changed files with 1887 additions and 22 deletions

View File

@@ -965,6 +965,7 @@ transmission capability, and region-of-interest coding.
0x21: Zstd-compressed 8-bit PCM (32 KHz, audio hardware's native format)
0x22: Zstd-compressed 16-bit PCM (32 KHz, little endian)
0x23: Zstd-compressed ADPCM
0x24: Zstd-compressed TAD
<subtitles>
0x30: Subtitle in "Simple" format
0x31: Subtitle in "Karaoke" format
@@ -1065,6 +1066,13 @@ transmission capability, and region-of-interest coding.
uint32 Compressed Size
* Zstd-compressed Block Data
## TAD Packet Structure
uint8 Packet type (0x24)
uint32 Compressed Size + 2
uint16 Sample Count
uint32 Compressed Size
* Zstd-compressed TAD
## GOP Unified Packet Structure (0x12)
Implemented on 2025-10-15 for temporal 3D DWT with unified preprocessing.
@@ -1507,6 +1515,241 @@ Number|Index
4096|255
--------------------------------------------------------------------------------
TSVM Advanced Audio (TAD) Format
Created by CuriousTorvald and Claude on 2025-10-23
TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT)
with 4-tap interpolating Deslauriers-Dubuc wavelets, providing efficient compression
through M/S stereo decorrelation, frequency-dependent quantization, and significance
map encoding. Designed as an includable API for integration with TAV video encoder.
When used inside of a video codec, only zstd-compressed payload is stored, chunk length
is stored separately and quality index is shared with that of the video.
# Suggested File Structure
\x1F T S V M T A D
[HEADER]
[CHUNK 0]
[CHUNK 1]
[CHUNK 2]
...
## Header (16 bytes)
uint8 Magic[8]: "\x1F TSVM TAD"
uint8 Version: 1
uint8 Quality Level: 0-5 (0=lowest quality/smallest, 5=highest quality/largest)
uint8 Flags:
- bit 0: Zstd compression enabled (1=compressed, 0=uncompressed)
- bits 1-7: Reserved (must be 0)
uint32 Sample Rate: audio sample rate in Hz (always 32000 for TSVM)
uint8 Channels: number of audio channels (always 2 for stereo)
uint8 Reserved[2]: fill with zeros
## Audio Properties
- **Sample Rate**: 32000 Hz (TSVM audio hardware native format)
- **Channels**: 2 (stereo)
- **Input Format**: PCM16LE (16-bit signed little-endian PCM)
- **Preprocessing**: 16 Hz highpass filter applied during extraction
- **Internal Representation**: Signed PCM8 with error-diffusion dithering
- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2)
- Default: 32768 samples (1.024 seconds at 32 kHz)
- Minimum: 1024 samples (32 ms at 32 kHz)
- DWT levels calculated dynamically: log2(chunk_size) - 1
- **Target Compression**: 2:1 against PCMu8 baseline
## Chunk Structure
Each chunk encodes a variable number of stereo samples (power of 2, minimum 1024).
Default is 32768 samples (65536 total samples, 1.024 seconds).
If the audio duration doesn't align to chunk boundaries, the final chunk can use
a smaller power-of-2 size or be zero-padded.
uint8 Significance Map Method: always 1 (2-bit twobitmap)
uint8 Compression Flag: 1=Zstd compressed, 0=uncompressed
uint16 Sample Count: number of samples per channel (must be power of 2, min 1024)
uint32 Chunk Payload Size: size of following payload in bytes
* Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
### Chunk Payload Structure (before optional Zstd compression)
* Mid Channel Encoded Data
* Side Channel Encoded Data
### Encoded Channel Data (2-bit Twobitmap Significance Map)
uint8 Significance Map[(num_samples * 2 + 7) / 8] // 2 bits per coefficient
int16 Other Values[variable length] // Non-{-1,0,+1} values
#### 2-bit Twobitmap Encoding
Each DWT coefficient is encoded using 2 bits in the significance map:
- 00: coefficient is 0
- 01: coefficient is +1
- 10: coefficient is -1
- 11: coefficient is "other" (value stored in Other Values array)
This encoding exploits the sparsity of quantized DWT coefficients where most
values are 0, ±1 after quantization. "Other" values are stored sequentially
as int16 in the order they appear.
## Encoding Pipeline
### Step 1: PCM16 to PCM8 Conversion with Error-Diffusion Dithering
Input stereo PCM16LE is converted to signed PCM8 using error-diffusion dithering
to minimize quantization noise:
dithered_value = pcm16_value / 256 + error
pcm8_value = clamp(round(dithered_value), -128, 127)
error = dithered_value - pcm8_value
Error is propagated to the next sample (alternating between left/right channels).
### Step 2: M/S Stereo Decorrelation
Mid-Side transformation exploits stereo correlation:
Mid = (Left + Right) / 2
Side = (Left - Right) / 2
This typically concentrates energy in the Mid channel while the Side channel
contains mostly small values, improving compression efficiency.
### Step 3: Variable-Level DD-4 DWT
Each channel (Mid and Side) undergoes Deslauriers-Dubuc 4-tap interpolating wavelet
decomposition. The number of DWT levels is calculated dynamically based on chunk size:
DWT Levels = log2(chunk_size) - 1
For the default 32768-sample chunks, this produces 14 levels with frequency subbands:
Level 0-13: High to low frequency coefficients
DC band: Low-frequency approximation coefficients
Sideband boundaries are calculated dynamically:
first_band_size = chunk_size >> dwt_levels
sideband[0] = 0
sideband[1] = first_band_size
sideband[i+1] = sideband[i] + (first_band_size << (i-1))
For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768
For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
### Step 4: Frequency-Dependent Quantization
DWT coefficients are quantized using perceptually-tuned frequency-dependent weights:
Base Weights by Level:
Level 0 (16-8 KHz): 3.0
Level 1 (8-4 KHz): 2.0
Level 2 (4-2 KHz): 1.5
Level 3 (2-1 KHz): 1.0
Level 4 (1-0.5 KHz): 0.75
Level 5 (0.5-0.25 KHz): 0.5
Level 6-7 (DC-0.25 KHz): 0.25
Quality scaling factor: 1.0 + (5 - quality) * 0.3
Final quantization step: base_weight * quality_scale
#### Dead Zone Quantization
High-frequency coefficients (Level 0: 8-16 KHz) use dead zone quantization
where coefficients smaller than half the quantization step are zeroed:
if (abs(coefficient) < quantization_step / 2)
coefficient = 0
This aggressively removes high-frequency noise while preserving important
mid-frequency content (2-4 KHz critical for speech intelligibility).
### Step 5: 2-bit Significance Map Encoding
Quantized coefficients are encoded using the 2-bit twobitmap method (see above).
### Step 6: Optional Zstd Compression
If enabled (default), the concatenated Mid+Side encoded data is compressed
using Zstd level 3 for additional compression without significant CPU overhead.
## Decoding Pipeline
### Step 1: Chunk Extraction
Read chunk header to determine significance map method and compression status.
If compressed, decompress payload using Zstd.
### Step 2: Decode Significance Maps
Decode Mid and Side channel data using 2-bit twobitmap decoder:
- Read 2-bit codes from significance map
- Reconstruct coefficients: 0, +1, -1, or read from Other Values array
### Step 3: Dequantization
Multiply quantized coefficients by frequency-dependent quantization steps
(same weights as encoder).
### Step 4: Variable-Level Inverse DD-4 DWT
Reconstruct PCM8 audio from DWT coefficients using inverse DD-4 transform,
progressively doubling length from the deepest level to chunk_size samples.
The number of inverse DWT levels matches the forward transform (log2(chunk_size) - 1).
### Step 5: M/S to L/R Conversion
Convert Mid/Side back to Left/Right stereo:
Left = Mid + Side
Right = Mid - Side
### Step 6: PCM8 to PCM16 Upsampling
Convert signed PCM8 back to PCM16LE by multiplying by 256:
pcm16_value = pcm8_value * 256
## Compression Performance
- **Target Ratio**: 2:1 against PCMu8 (4:1 against PCM16LE input)
- **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3
- **Quality**: Perceptually transparent at Q3+, preserves full 0-16 KHz bandwidth
- **Sparsity**: 86.9% zeros in Mid channel, 97.8% in Side channel (typical)
## Integration with TAV Encoder
TAD is designed as an includable API for TAV video encoder integration.
The encoder can be invoked programmatically to compress audio tracks:
#include "tad_encoder.h"
size_t encoded_size = tad_encode_from_file(
input_audio_path,
output_tad_path,
quality_level,
use_zstd,
verbose
);
This allows TAV video files to embed TAD-compressed audio using packet type 0x24.
## Audio Extraction Command
TAD encoder uses two-pass FFmpeg extraction for optimal quality:
# Pass 1: Extract at original sample rate
ffmpeg -i input.mp4 -f s16le -ac 2 temp.pcm
# Pass 2: High-quality resample with SoXR and highpass filter
ffmpeg -f s16le -ar {original_rate} -ac 2 -i temp.pcm \
-ar 32000 -af "aresample=resampler=soxr:precision=28:cutoff=0.99,highpass=f=16" \
output.pcm
This ensures resampling happens after extraction with optimal quality parameters.
## Hardware Acceleration API
TAD decoder may be accelerated using hardware functions in GraphicsJSR223Delegate:
- tadDecode(): Main decoding function (chunk-based)
- tadHaarIDWT(): Fast inverse Haar DWT
- tadDequantize(): Frequency-dependent dequantization
## Usage Examples
# Encode with default quality (Q3)
tad_encoder -i input.mp4 -o output.tad
# Encode with highest quality
tad_encoder -i input.mp4 -o output.tad -q 5
# Encode without Zstd compression
tad_encoder -i input.mp4 -o output.tad --no-zstd
# Verbose output with statistics
tad_encoder -i input.mp4 -o output.tad -v
--------------------------------------------------------------------------------
TSVM Universal Cue format