TAD: Terrarum Advanced Audio to use with video compression

2026-06-13 08:04:03 +09:00 · 2025-10-23 18:56:57 +09:00
parent 6f669f4fd9
commit a9319fd812
10 changed files with 1887 additions and 22 deletions
--- a/terranmon.txt
+++ b/terranmon.txt
@@ -965,6 +965,7 @@ transmission capability, and region-of-interest coding.
    0x21: Zstd-compressed 8-bit PCM (32 KHz, audio hardware's native format)
    0x22: Zstd-compressed 16-bit PCM (32 KHz, little endian)
    0x23: Zstd-compressed ADPCM
+    0x24: Zstd-compressed TAD
    <subtitles>
    0x30: Subtitle in "Simple" format
    0x31: Subtitle in "Karaoke" format
@@ -1065,6 +1066,13 @@ transmission capability, and region-of-interest coding.
    uint32 Compressed Size
    *      Zstd-compressed Block Data

+## TAD Packet Structure
+    uint8  Packet type (0x24)
+    uint32 Compressed Size + 2
+    uint16 Sample Count
+    uint32 Compressed Size
+    *      Zstd-compressed TAD
+
 ## GOP Unified Packet Structure (0x12)
 Implemented on 2025-10-15 for temporal 3D DWT with unified preprocessing.

@@ -1507,6 +1515,241 @@ Number|Index
 4096|255


+--------------------------------------------------------------------------------
+
+TSVM Advanced Audio (TAD) Format
+Created by CuriousTorvald and Claude on 2025-10-23
+
+TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT)
+with 4-tap interpolating Deslauriers-Dubuc wavelets, providing efficient compression
+through M/S stereo decorrelation, frequency-dependent quantization, and significance
+map encoding. Designed as an includable API for integration with TAV video encoder.
+
+When used inside of a video codec, only zstd-compressed payload is stored, chunk length
+is stored separately and quality index is shared with that of the video.
+
+# Suggested File Structure
+\x1F T S V M T A D
+[HEADER]
+[CHUNK 0]
+[CHUNK 1]
+[CHUNK 2]
+...
+
+## Header (16 bytes)
+    uint8  Magic[8]: "\x1F TSVM TAD"
+    uint8  Version: 1
+    uint8  Quality Level: 0-5 (0=lowest quality/smallest, 5=highest quality/largest)
+    uint8  Flags:
+            - bit 0: Zstd compression enabled (1=compressed, 0=uncompressed)
+            - bits 1-7: Reserved (must be 0)
+    uint32 Sample Rate: audio sample rate in Hz (always 32000 for TSVM)
+    uint8  Channels: number of audio channels (always 2 for stereo)
+    uint8  Reserved[2]: fill with zeros
+
+## Audio Properties
+- **Sample Rate**: 32000 Hz (TSVM audio hardware native format)
+- **Channels**: 2 (stereo)
+- **Input Format**: PCM16LE (16-bit signed little-endian PCM)
+- **Preprocessing**: 16 Hz highpass filter applied during extraction
+- **Internal Representation**: Signed PCM8 with error-diffusion dithering
+- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2)
+  - Default: 32768 samples (1.024 seconds at 32 kHz)
+  - Minimum: 1024 samples (32 ms at 32 kHz)
+  - DWT levels calculated dynamically: log2(chunk_size) - 1
+- **Target Compression**: 2:1 against PCMu8 baseline
+
+## Chunk Structure
+Each chunk encodes a variable number of stereo samples (power of 2, minimum 1024).
+Default is 32768 samples (65536 total samples, 1.024 seconds).
+If the audio duration doesn't align to chunk boundaries, the final chunk can use
+a smaller power-of-2 size or be zero-padded.
+
+    uint8  Significance Map Method: always 1 (2-bit twobitmap)
+    uint8  Compression Flag: 1=Zstd compressed, 0=uncompressed
+    uint16 Sample Count: number of samples per channel (must be power of 2, min 1024)
+    uint32 Chunk Payload Size: size of following payload in bytes
+    *      Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
+
+### Chunk Payload Structure (before optional Zstd compression)
+    *      Mid Channel Encoded Data
+    *      Side Channel Encoded Data
+
+### Encoded Channel Data (2-bit Twobitmap Significance Map)
+    uint8  Significance Map[(num_samples * 2 + 7) / 8]  // 2 bits per coefficient
+    int16  Other Values[variable length]                // Non-{-1,0,+1} values
+
+#### 2-bit Twobitmap Encoding
+Each DWT coefficient is encoded using 2 bits in the significance map:
+    - 00: coefficient is 0
+    - 01: coefficient is +1
+    - 10: coefficient is -1
+    - 11: coefficient is "other" (value stored in Other Values array)
+
+This encoding exploits the sparsity of quantized DWT coefficients where most
+values are 0, ±1 after quantization. "Other" values are stored sequentially
+as int16 in the order they appear.
+
+## Encoding Pipeline
+
+### Step 1: PCM16 to PCM8 Conversion with Error-Diffusion Dithering
+Input stereo PCM16LE is converted to signed PCM8 using error-diffusion dithering
+to minimize quantization noise:
+
+    dithered_value = pcm16_value / 256 + error
+    pcm8_value = clamp(round(dithered_value), -128, 127)
+    error = dithered_value - pcm8_value
+
+Error is propagated to the next sample (alternating between left/right channels).
+
+### Step 2: M/S Stereo Decorrelation
+Mid-Side transformation exploits stereo correlation:
+
+    Mid = (Left + Right) / 2
+    Side = (Left - Right) / 2
+
+This typically concentrates energy in the Mid channel while the Side channel
+contains mostly small values, improving compression efficiency.
+
+### Step 3: Variable-Level DD-4 DWT
+Each channel (Mid and Side) undergoes Deslauriers-Dubuc 4-tap interpolating wavelet
+decomposition. The number of DWT levels is calculated dynamically based on chunk size:
+
+    DWT Levels = log2(chunk_size) - 1
+
+For the default 32768-sample chunks, this produces 14 levels with frequency subbands:
+
+    Level 0-13: High to low frequency coefficients
+    DC band: Low-frequency approximation coefficients
+
+Sideband boundaries are calculated dynamically:
+    first_band_size = chunk_size >> dwt_levels
+    sideband[0] = 0
+    sideband[1] = first_band_size
+    sideband[i+1] = sideband[i] + (first_band_size << (i-1))
+
+For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768
+For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
+
+### Step 4: Frequency-Dependent Quantization
+DWT coefficients are quantized using perceptually-tuned frequency-dependent weights:
+
+    Base Weights by Level:
+    Level 0 (16-8 KHz):     3.0
+    Level 1 (8-4 KHz):      2.0
+    Level 2 (4-2 KHz):      1.5
+    Level 3 (2-1 KHz):      1.0
+    Level 4 (1-0.5 KHz):    0.75
+    Level 5 (0.5-0.25 KHz): 0.5
+    Level 6-7 (DC-0.25 KHz): 0.25
+
+Quality scaling factor: 1.0 + (5 - quality) * 0.3
+
+Final quantization step: base_weight * quality_scale
+
+#### Dead Zone Quantization
+High-frequency coefficients (Level 0: 8-16 KHz) use dead zone quantization
+where coefficients smaller than half the quantization step are zeroed:
+
+    if (abs(coefficient) < quantization_step / 2)
+        coefficient = 0
+
+This aggressively removes high-frequency noise while preserving important
+mid-frequency content (2-4 KHz critical for speech intelligibility).
+
+### Step 5: 2-bit Significance Map Encoding
+Quantized coefficients are encoded using the 2-bit twobitmap method (see above).
+
+### Step 6: Optional Zstd Compression
+If enabled (default), the concatenated Mid+Side encoded data is compressed
+using Zstd level 3 for additional compression without significant CPU overhead.
+
+## Decoding Pipeline
+
+### Step 1: Chunk Extraction
+Read chunk header to determine significance map method and compression status.
+If compressed, decompress payload using Zstd.
+
+### Step 2: Decode Significance Maps
+Decode Mid and Side channel data using 2-bit twobitmap decoder:
+    - Read 2-bit codes from significance map
+    - Reconstruct coefficients: 0, +1, -1, or read from Other Values array
+
+### Step 3: Dequantization
+Multiply quantized coefficients by frequency-dependent quantization steps
+(same weights as encoder).
+
+### Step 4: Variable-Level Inverse DD-4 DWT
+Reconstruct PCM8 audio from DWT coefficients using inverse DD-4 transform,
+progressively doubling length from the deepest level to chunk_size samples.
+The number of inverse DWT levels matches the forward transform (log2(chunk_size) - 1).
+
+### Step 5: M/S to L/R Conversion
+Convert Mid/Side back to Left/Right stereo:
+
+    Left = Mid + Side
+    Right = Mid - Side
+
+### Step 6: PCM8 to PCM16 Upsampling
+Convert signed PCM8 back to PCM16LE by multiplying by 256:
+
+    pcm16_value = pcm8_value * 256
+
+## Compression Performance
+- **Target Ratio**: 2:1 against PCMu8 (4:1 against PCM16LE input)
+- **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3
+- **Quality**: Perceptually transparent at Q3+, preserves full 0-16 KHz bandwidth
+- **Sparsity**: 86.9% zeros in Mid channel, 97.8% in Side channel (typical)
+
+## Integration with TAV Encoder
+TAD is designed as an includable API for TAV video encoder integration.
+The encoder can be invoked programmatically to compress audio tracks:
+
+    #include "tad_encoder.h"
+
+    size_t encoded_size = tad_encode_from_file(
+        input_audio_path,
+        output_tad_path,
+        quality_level,
+        use_zstd,
+        verbose
+    );
+
+This allows TAV video files to embed TAD-compressed audio using packet type 0x24.
+
+## Audio Extraction Command
+TAD encoder uses two-pass FFmpeg extraction for optimal quality:
+
+    # Pass 1: Extract at original sample rate
+    ffmpeg -i input.mp4 -f s16le -ac 2 temp.pcm
+
+    # Pass 2: High-quality resample with SoXR and highpass filter
+    ffmpeg -f s16le -ar {original_rate} -ac 2 -i temp.pcm \
+           -ar 32000 -af "aresample=resampler=soxr:precision=28:cutoff=0.99,highpass=f=16" \
+           output.pcm
+
+This ensures resampling happens after extraction with optimal quality parameters.
+
+## Hardware Acceleration API
+TAD decoder may be accelerated using hardware functions in GraphicsJSR223Delegate:
+- tadDecode(): Main decoding function (chunk-based)
+- tadHaarIDWT(): Fast inverse Haar DWT
+- tadDequantize(): Frequency-dependent dequantization
+
+## Usage Examples
+    # Encode with default quality (Q3)
+    tad_encoder -i input.mp4 -o output.tad
+
+    # Encode with highest quality
+    tad_encoder -i input.mp4 -o output.tad -q 5
+
+    # Encode without Zstd compression
+    tad_encoder -i input.mp4 -o output.tad --no-zstd
+
+    # Verbose output with statistics
+    tad_encoder -i input.mp4 -o output.tad -v
+
+
 --------------------------------------------------------------------------------

 TSVM Universal Cue format