mirror of
https://github.com/curioustorvald/tsvm.git
synced 2026-03-07 11:51:49 +09:00
TAD: Terrarum Advanced Audio to use with video compression
This commit is contained in:
243
terranmon.txt
243
terranmon.txt
@@ -965,6 +965,7 @@ transmission capability, and region-of-interest coding.
|
||||
0x21: Zstd-compressed 8-bit PCM (32 KHz, audio hardware's native format)
|
||||
0x22: Zstd-compressed 16-bit PCM (32 KHz, little endian)
|
||||
0x23: Zstd-compressed ADPCM
|
||||
0x24: Zstd-compressed TAD
|
||||
<subtitles>
|
||||
0x30: Subtitle in "Simple" format
|
||||
0x31: Subtitle in "Karaoke" format
|
||||
@@ -1065,6 +1066,13 @@ transmission capability, and region-of-interest coding.
|
||||
uint32 Compressed Size
|
||||
* Zstd-compressed Block Data
|
||||
|
||||
## TAD Packet Structure
|
||||
uint8 Packet type (0x24)
|
||||
uint32 Compressed Size + 2
|
||||
uint16 Sample Count
|
||||
uint32 Compressed Size
|
||||
* Zstd-compressed TAD
|
||||
|
||||
## GOP Unified Packet Structure (0x12)
|
||||
Implemented on 2025-10-15 for temporal 3D DWT with unified preprocessing.
|
||||
|
||||
@@ -1507,6 +1515,241 @@ Number|Index
|
||||
4096|255
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
TSVM Advanced Audio (TAD) Format
|
||||
Created by CuriousTorvald and Claude on 2025-10-23
|
||||
|
||||
TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT)
|
||||
with 4-tap interpolating Deslauriers-Dubuc wavelets, providing efficient compression
|
||||
through M/S stereo decorrelation, frequency-dependent quantization, and significance
|
||||
map encoding. Designed as an includable API for integration with TAV video encoder.
|
||||
|
||||
When used inside of a video codec, only zstd-compressed payload is stored, chunk length
|
||||
is stored separately and quality index is shared with that of the video.
|
||||
|
||||
# Suggested File Structure
|
||||
\x1F T S V M T A D
|
||||
[HEADER]
|
||||
[CHUNK 0]
|
||||
[CHUNK 1]
|
||||
[CHUNK 2]
|
||||
...
|
||||
|
||||
## Header (16 bytes)
|
||||
uint8 Magic[8]: "\x1F TSVM TAD"
|
||||
uint8 Version: 1
|
||||
uint8 Quality Level: 0-5 (0=lowest quality/smallest, 5=highest quality/largest)
|
||||
uint8 Flags:
|
||||
- bit 0: Zstd compression enabled (1=compressed, 0=uncompressed)
|
||||
- bits 1-7: Reserved (must be 0)
|
||||
uint32 Sample Rate: audio sample rate in Hz (always 32000 for TSVM)
|
||||
uint8 Channels: number of audio channels (always 2 for stereo)
|
||||
uint8 Reserved[2]: fill with zeros
|
||||
|
||||
## Audio Properties
|
||||
- **Sample Rate**: 32000 Hz (TSVM audio hardware native format)
|
||||
- **Channels**: 2 (stereo)
|
||||
- **Input Format**: PCM16LE (16-bit signed little-endian PCM)
|
||||
- **Preprocessing**: 16 Hz highpass filter applied during extraction
|
||||
- **Internal Representation**: Signed PCM8 with error-diffusion dithering
|
||||
- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2)
|
||||
- Default: 32768 samples (1.024 seconds at 32 kHz)
|
||||
- Minimum: 1024 samples (32 ms at 32 kHz)
|
||||
- DWT levels calculated dynamically: log2(chunk_size) - 1
|
||||
- **Target Compression**: 2:1 against PCMu8 baseline
|
||||
|
||||
## Chunk Structure
|
||||
Each chunk encodes a variable number of stereo samples (power of 2, minimum 1024).
|
||||
Default is 32768 samples (65536 total samples, 1.024 seconds).
|
||||
If the audio duration doesn't align to chunk boundaries, the final chunk can use
|
||||
a smaller power-of-2 size or be zero-padded.
|
||||
|
||||
uint8 Significance Map Method: always 1 (2-bit twobitmap)
|
||||
uint8 Compression Flag: 1=Zstd compressed, 0=uncompressed
|
||||
uint16 Sample Count: number of samples per channel (must be power of 2, min 1024)
|
||||
uint32 Chunk Payload Size: size of following payload in bytes
|
||||
* Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
|
||||
|
||||
### Chunk Payload Structure (before optional Zstd compression)
|
||||
* Mid Channel Encoded Data
|
||||
* Side Channel Encoded Data
|
||||
|
||||
### Encoded Channel Data (2-bit Twobitmap Significance Map)
|
||||
uint8 Significance Map[(num_samples * 2 + 7) / 8] // 2 bits per coefficient
|
||||
int16 Other Values[variable length] // Non-{-1,0,+1} values
|
||||
|
||||
#### 2-bit Twobitmap Encoding
|
||||
Each DWT coefficient is encoded using 2 bits in the significance map:
|
||||
- 00: coefficient is 0
|
||||
- 01: coefficient is +1
|
||||
- 10: coefficient is -1
|
||||
- 11: coefficient is "other" (value stored in Other Values array)
|
||||
|
||||
This encoding exploits the sparsity of quantized DWT coefficients where most
|
||||
values are 0, ±1 after quantization. "Other" values are stored sequentially
|
||||
as int16 in the order they appear.
|
||||
|
||||
## Encoding Pipeline
|
||||
|
||||
### Step 1: PCM16 to PCM8 Conversion with Error-Diffusion Dithering
|
||||
Input stereo PCM16LE is converted to signed PCM8 using error-diffusion dithering
|
||||
to minimize quantization noise:
|
||||
|
||||
dithered_value = pcm16_value / 256 + error
|
||||
pcm8_value = clamp(round(dithered_value), -128, 127)
|
||||
error = dithered_value - pcm8_value
|
||||
|
||||
Error is propagated to the next sample (alternating between left/right channels).
|
||||
|
||||
### Step 2: M/S Stereo Decorrelation
|
||||
Mid-Side transformation exploits stereo correlation:
|
||||
|
||||
Mid = (Left + Right) / 2
|
||||
Side = (Left - Right) / 2
|
||||
|
||||
This typically concentrates energy in the Mid channel while the Side channel
|
||||
contains mostly small values, improving compression efficiency.
|
||||
|
||||
### Step 3: Variable-Level DD-4 DWT
|
||||
Each channel (Mid and Side) undergoes Deslauriers-Dubuc 4-tap interpolating wavelet
|
||||
decomposition. The number of DWT levels is calculated dynamically based on chunk size:
|
||||
|
||||
DWT Levels = log2(chunk_size) - 1
|
||||
|
||||
For the default 32768-sample chunks, this produces 14 levels with frequency subbands:
|
||||
|
||||
Level 0-13: High to low frequency coefficients
|
||||
DC band: Low-frequency approximation coefficients
|
||||
|
||||
Sideband boundaries are calculated dynamically:
|
||||
first_band_size = chunk_size >> dwt_levels
|
||||
sideband[0] = 0
|
||||
sideband[1] = first_band_size
|
||||
sideband[i+1] = sideband[i] + (first_band_size << (i-1))
|
||||
|
||||
For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768
|
||||
For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
|
||||
|
||||
### Step 4: Frequency-Dependent Quantization
|
||||
DWT coefficients are quantized using perceptually-tuned frequency-dependent weights:
|
||||
|
||||
Base Weights by Level:
|
||||
Level 0 (16-8 KHz): 3.0
|
||||
Level 1 (8-4 KHz): 2.0
|
||||
Level 2 (4-2 KHz): 1.5
|
||||
Level 3 (2-1 KHz): 1.0
|
||||
Level 4 (1-0.5 KHz): 0.75
|
||||
Level 5 (0.5-0.25 KHz): 0.5
|
||||
Level 6-7 (DC-0.25 KHz): 0.25
|
||||
|
||||
Quality scaling factor: 1.0 + (5 - quality) * 0.3
|
||||
|
||||
Final quantization step: base_weight * quality_scale
|
||||
|
||||
#### Dead Zone Quantization
|
||||
High-frequency coefficients (Level 0: 8-16 KHz) use dead zone quantization
|
||||
where coefficients smaller than half the quantization step are zeroed:
|
||||
|
||||
if (abs(coefficient) < quantization_step / 2)
|
||||
coefficient = 0
|
||||
|
||||
This aggressively removes high-frequency noise while preserving important
|
||||
mid-frequency content (2-4 KHz critical for speech intelligibility).
|
||||
|
||||
### Step 5: 2-bit Significance Map Encoding
|
||||
Quantized coefficients are encoded using the 2-bit twobitmap method (see above).
|
||||
|
||||
### Step 6: Optional Zstd Compression
|
||||
If enabled (default), the concatenated Mid+Side encoded data is compressed
|
||||
using Zstd level 3 for additional compression without significant CPU overhead.
|
||||
|
||||
## Decoding Pipeline
|
||||
|
||||
### Step 1: Chunk Extraction
|
||||
Read chunk header to determine significance map method and compression status.
|
||||
If compressed, decompress payload using Zstd.
|
||||
|
||||
### Step 2: Decode Significance Maps
|
||||
Decode Mid and Side channel data using 2-bit twobitmap decoder:
|
||||
- Read 2-bit codes from significance map
|
||||
- Reconstruct coefficients: 0, +1, -1, or read from Other Values array
|
||||
|
||||
### Step 3: Dequantization
|
||||
Multiply quantized coefficients by frequency-dependent quantization steps
|
||||
(same weights as encoder).
|
||||
|
||||
### Step 4: Variable-Level Inverse DD-4 DWT
|
||||
Reconstruct PCM8 audio from DWT coefficients using inverse DD-4 transform,
|
||||
progressively doubling length from the deepest level to chunk_size samples.
|
||||
The number of inverse DWT levels matches the forward transform (log2(chunk_size) - 1).
|
||||
|
||||
### Step 5: M/S to L/R Conversion
|
||||
Convert Mid/Side back to Left/Right stereo:
|
||||
|
||||
Left = Mid + Side
|
||||
Right = Mid - Side
|
||||
|
||||
### Step 6: PCM8 to PCM16 Upsampling
|
||||
Convert signed PCM8 back to PCM16LE by multiplying by 256:
|
||||
|
||||
pcm16_value = pcm8_value * 256
|
||||
|
||||
## Compression Performance
|
||||
- **Target Ratio**: 2:1 against PCMu8 (4:1 against PCM16LE input)
|
||||
- **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3
|
||||
- **Quality**: Perceptually transparent at Q3+, preserves full 0-16 KHz bandwidth
|
||||
- **Sparsity**: 86.9% zeros in Mid channel, 97.8% in Side channel (typical)
|
||||
|
||||
## Integration with TAV Encoder
|
||||
TAD is designed as an includable API for TAV video encoder integration.
|
||||
The encoder can be invoked programmatically to compress audio tracks:
|
||||
|
||||
#include "tad_encoder.h"
|
||||
|
||||
size_t encoded_size = tad_encode_from_file(
|
||||
input_audio_path,
|
||||
output_tad_path,
|
||||
quality_level,
|
||||
use_zstd,
|
||||
verbose
|
||||
);
|
||||
|
||||
This allows TAV video files to embed TAD-compressed audio using packet type 0x24.
|
||||
|
||||
## Audio Extraction Command
|
||||
TAD encoder uses two-pass FFmpeg extraction for optimal quality:
|
||||
|
||||
# Pass 1: Extract at original sample rate
|
||||
ffmpeg -i input.mp4 -f s16le -ac 2 temp.pcm
|
||||
|
||||
# Pass 2: High-quality resample with SoXR and highpass filter
|
||||
ffmpeg -f s16le -ar {original_rate} -ac 2 -i temp.pcm \
|
||||
-ar 32000 -af "aresample=resampler=soxr:precision=28:cutoff=0.99,highpass=f=16" \
|
||||
output.pcm
|
||||
|
||||
This ensures resampling happens after extraction with optimal quality parameters.
|
||||
|
||||
## Hardware Acceleration API
|
||||
TAD decoder may be accelerated using hardware functions in GraphicsJSR223Delegate:
|
||||
- tadDecode(): Main decoding function (chunk-based)
|
||||
- tadHaarIDWT(): Fast inverse Haar DWT
|
||||
- tadDequantize(): Frequency-dependent dequantization
|
||||
|
||||
## Usage Examples
|
||||
# Encode with default quality (Q3)
|
||||
tad_encoder -i input.mp4 -o output.tad
|
||||
|
||||
# Encode with highest quality
|
||||
tad_encoder -i input.mp4 -o output.tad -q 5
|
||||
|
||||
# Encode without Zstd compression
|
||||
tad_encoder -i input.mp4 -o output.tad --no-zstd
|
||||
|
||||
# Verbose output with statistics
|
||||
tad_encoder -i input.mp4 -o output.tad -v
|
||||
|
||||
|
||||
--------------------------------------------------------------------------------
|
||||
|
||||
TSVM Universal Cue format
|
||||
|
||||
Reference in New Issue
Block a user