TAV/TAD doc update

This commit is contained in:
minjaesong
2025-11-10 17:01:44 +09:00
parent edb951fb1a
commit c1d6a959f5
18 changed files with 512 additions and 423 deletions

View File

@@ -866,8 +866,8 @@ When KSF is interleaved with MP2 audio, the payload must be inserted in-between
0x30 = reveal text normally (arguments: UTF-8 text. The reveal text must contain spaces when required)
0x31 = reveal text slowly (arguments: UTF-8 text. The effect is implementation-dependent)
0x40 = reveal text normally with emphasize (arguments: UTF-8 text. On TEV/TAV player, the text will be white; otherwise, implementation-dependent)
0x41 = reveal text slowly with emphasize (arguments: UTF-8 text)
0x40 = reveal text normally with emphasise (arguments: UTF-8 text. On TEV/TAV player, the text will be white; otherwise, implementation-dependent)
0x41 = reveal text slowly with emphasise (arguments: UTF-8 text)
0x50 = reveal text normally with target colour (arguments: uint8 target colour; UTF-8 text)
0x51 = reveal text slowly with target colour (arguments: uint8 target colour; UTF-8 text)
@@ -887,7 +887,7 @@ When KSF is interleaved with MP2 audio, the payload must be inserted in-between
TSVM Advanced Video (TAV) Format
Created by CuriousTorvald and Claude on 2025-09-13
TAV is a next-generation video codec for TSVM utilizing Discrete Wavelet Transform (DWT)
TAV is a next-generation video codec for TSVM utilising Discrete Wavelet Transform (DWT)
similar to JPEG2000, providing superior compression efficiency and scalability compared
to DCT-based codecs like TEV. Features include multi-resolution encoding, progressive
transmission capability, and region-of-interest coding.
@@ -1134,7 +1134,7 @@ resulting in superior compression compared to per-frame encoding.
2. Determine GOP slicing from the scene detection
3. Apply 1D DWT across temporal axis (GOP frames)
4. Apply 2D DWT on each spatial slice of temporal subbands
5. Perceptual quantization with temporal-spatial awareness
5. Perceptual quantisation with temporal-spatial awareness
6. Unified significance map preprocessing across all frames/channels
7. Single Zstd compression of entire GOP block
@@ -1246,7 +1246,7 @@ The encoder expects linear alpha.
## Compression Features
- Single DWT tiles vs 16x16 DCT blocks in TEV
- Multi-resolution representation enables scalable decoding
- Better frequency localization than DCT
- Better frequency localisation than DCT
- Reduced blocking artifacts due to overlapping basis functions
## Hardware Acceleration Functions
@@ -1533,9 +1533,9 @@ TSVM Advanced Audio (TAD) Format
Created by CuriousTorvald and Claude on 2025-10-23
Updated: 2025-10-30 (fixed non-power-of-2 sample count support)
TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT)
TAD is a perceptual audio codec for TSVM utilising Discrete Wavelet Transform (DWT)
with CDF 9/7 biorthogonal wavelets, providing efficient compression through M/S stereo
decorrelation, frequency-dependent quantization, and raw int8 coefficient storage.
decorrelation, frequency-dependent quantisation, and raw int8 coefficient storage.
Designed as an includable API for integration with TAV video encoder.
When used inside of a video codec, only zstd-compressed payload is stored, chunk length
@@ -1584,20 +1584,34 @@ TAV integration uses exact GOP sample counts (e.g., 32016 samples for 1 second a
uint32 Chunk Payload Size: size of following payload in bytes
* Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
### Chunk Payload Structure (before optional Zstd compression)
* Mid Channel Encoded Data (raw int8 values)
* Side Channel Encoded Data (raw int8 values)
### Chunk Payload Structure (before Zstd compression)
* Mid Channel EZBC Data (embedded zero block coded bitstream)
* Side Channel EZBC Data (embedded zero block coded bitstream)
Each EZBC channel structure:
uint8 MSB Bitplane: highest bitplane with significant coefficient
uint16 Coefficient Count: number of coefficients in this channel
* Binary Tree EZBC Bitstream: significance map + refinement bits
## Encoding Pipeline
### Step 1: Dynamic Range Compression (Gamma Compression)
Input stereo PCM32fLE undergoes gamma compression for perceptual uniformity:
### Step 1: Pre-emphasis Filter
Input stereo PCM32fLE undergoes first-order IIR pre-emphasis filtering (α=0.5):
encode(x) = sign(x) * |x|^γ where γ=0.707 (1/√2)
H(z) = 1 - α·z⁻¹
This compresses dynamic range before quantization, improving perceptual quality.
This shifts quantisation noise toward lower frequencies where it's more maskable by
the psychoacoustic model. The filter has persistent state across chunks to prevent
discontinuities at chunk boundaries.
### Step 2: M/S Stereo Decorrelation
### Step 2: Dynamic Range Compression (Gamma Compression)
Pre-emphasised audio undergoes gamma compression for perceptual uniformity:
encode(x) = sign(x) * |x|^γ where γ=0.5
This compresses dynamic range before quantisation, improving perceptual quality.
### Step 3: M/S Stereo Decorrelation
Mid-Side transformation exploits stereo correlation:
Mid = (Left + Right) / 2
@@ -1606,7 +1620,7 @@ Mid-Side transformation exploits stereo correlation:
This typically concentrates energy in the Mid channel while the Side channel
contains mostly small values, improving compression efficiency.
### Step 3: 9-Level CDF 9/7 DWT
### Step 4: 9-Level CDF 9/7 DWT
Each channel (Mid and Side) undergoes CDF 9/7 biorthogonal wavelet decomposition. The codec uses a fixed 9 decomposition levels for all chunk sizes:
DWT Levels = 9 (fixed)
@@ -1632,32 +1646,53 @@ CDF 9/7 lifting coefficients:
δ = 0.443506852
K = 1.230174105
### Step 4: Frequency-Dependent Quantization
DWT coefficients are quantized using perceptually-tuned frequency-dependent weights.
### Step 5: Frequency-Dependent Quantisation with Lambda Companding
DWT coefficients are quantized using:
1. **Lambda companding**: Maps normalised coefficients through Laplacian CDF with λ=6.0
2. **Perceptually-tuned weights**: Channel-specific (Mid/Side) frequency-dependent scaling
3. **Final quantisation**: base_weight[channel][subband] * quality_scale
Final quantization step: base_weight * quality_scale
The lambda companding provides perceptually uniform quantisation, allocating more bits
to perceptually important coefficient magnitudes.
#### Dead Zone Quantization
High-frequency coefficients (Level 0: 8-16 KHz) use dead zone quantization
where coefficients smaller than half the quantization step are zeroed:
Channel-specific base quantisation weights:
Mid (0): [4.0, 2.0, 1.8, 1.6, 1.4, 1.2, 1.0, 1.0, 1.3, 2.0]
Side (1): [6.0, 5.0, 2.6, 2.4, 1.8, 1.3, 1.0, 1.0, 1.6, 3.2]
if (abs(coefficient) < quantization_step / 2)
coefficient = 0
Output: Quantized int8 coefficients in range [-max_index, +max_index]
This aggressively removes high-frequency noise while preserving important
mid-frequency content (2-4 KHz critical for speech intelligibility).
### Step 6: EZBC Encoding (Embedded Zero Block Coding)
Quantized int8 coefficients are compressed using binary tree EZBC, a 1D variant of
the embedded zero-block coding.
### Step 5: Raw Int8 Coefficient Storage
Quantized coefficients are stored directly as signed int8 values (no significance map, better Zstd compression).
Concatenated format: [Mid_channel_data][Side_channel_data]
**EZBC Algorithm**:
1. Find MSB bitplane (highest bit position with significant coefficient)
2. Initialise root block covering all coefficients as insignificant
3. For each bitplane from MSB to LSB:
- **Insignificant Pass**: Test each insignificant block for significance
- If still zero at this bitplane: emit 0 bit, keep in insignificant queue
- If becomes significant: emit 1 bit, recursively subdivide using binary tree
- **Refinement Pass**: For already-significant coefficients, emit next bit
4. Binary tree subdivision continues until blocks of size 1 (single coefficients)
5. When coefficient becomes significant: emit sign bit and reconstruct value
### Step 6: Coefficient-Domain Dithering (Encoder)
Light triangular dithering (±0.5 quantization steps) added to coefficients before
quantization to reduce banding artifacts.
**EZBC Output Structure** (per channel):
uint8 MSB Bitplane (8 bits)
uint16 Coefficient Count (16 bits)
* Bitstream: [significance_bits][sign_bits][refinement_bits]
### Step 7: Zstd Compression
The concatenated Mid+Side encoded data is compressed
using Zstd level 7 for additional compression without significant CPU overhead.
**Compression Benefits**:
- Exploits coefficient sparsity through significance testing
- Progressive refinement enables quality scalability
- Binary tree exploits spatial clustering of significant coefficients
- Typical sparsity: 86.9% zeros (Mid), 97.8% zeros (Side)
### Step 7: Concatenation and Zstd Compression
The Mid and Side EZBC bitstreams are concatenated:
Payload = [Mid_EZBC_data][Side_EZBC_data]
Then compressed using Zstd level 7 for additional compression without significant
CPU overhead. Zstd exploits redundancy in the concatenated bitstreams.
## Decoding Pipeline
@@ -1665,16 +1700,25 @@ using Zstd level 7 for additional compression without significant CPU overhead.
Read chunk header (sample_count, max_index, payload_size).
If compressed (default), decompress payload using Zstd.
### Step 2: Coefficient Extraction
Extract Mid and Side channel int8 data from concatenated payload:
- Mid channel: bytes [0..sample_count-1]
- Side channel: bytes [sample_count..2*sample_count-1]
### Step 2: EZBC Decoding
Decode Mid and Side channels from concatenated EZBC bitstreams using binary tree
embedded zero block decoder:
### Step 3: Dequantization with Lambda Decompanding
For each channel:
1. Read EZBC header: MSB bitplane (8 bits), coefficient count (16 bits)
2. Initialise root block as insignificant, track coefficient states
3. Process bitplanes from MSB to LSB:
- **Insignificant Pass**: Read significance bits, recursively decode significant blocks
- **Refinement Pass**: Read refinement bits for already-significant coefficients
4. Reconstruct quantized int8 coefficients from bitplane representation
Output: Quantized int8 coefficients for Mid and Side channels
### Step 3: Dequantisation with Lambda Decompanding
Convert quantized int8 values back to float coefficients using:
1. Lambda decompanding (inverse of Laplacian CDF compression)
2. Multiply by frequency-dependent quantization steps
3. Apply coefficient-domain dithering (TPDF, ~-60 dBFS)
2. Multiply by frequency-dependent quantisation steps
3. [Optional] Apply coefficient-domain dithering (TPDF, ~-60 dBFS)
### Step 4: 9-Level Inverse CDF 9/7 DWT
Reconstruct Float32 audio from DWT coefficients using inverse CDF 9/7 transform.
@@ -1704,9 +1748,18 @@ Convert Mid/Side back to Left/Right stereo:
### Step 6: Gamma Expansion
Expand dynamic range (inverse of encoder's gamma compression):
decode(y) = sign(y) * |y|^(1/γ) where γ=0.707, so 1/γ=√2≈1.414
decode(y) = sign(y) * |y|^(1/γ) where γ=0.5, so 1/γ=2.0
### Step 7: PCM32f to PCM8 Conversion with Noise-Shaped Dithering
### Step 7: De-emphasis Filter
Apply de-emphasis filter to reverse the pre-emphasis (α=0.5):
H(z) = 1 / (1 - α·z⁻¹)
This is a first-order IIR filter with persistent state across chunks to prevent
discontinuities at chunk boundaries. The de-emphasis must be applied AFTER gamma
expansion but BEFORE PCM8 conversion to correctly reconstruct the original audio.
### Step 8: PCM32f to PCM8 Conversion with Noise-Shaped Dithering
Convert Float32 samples to unsigned PCM8 (PCMu8) using second-order error-diffusion
dithering with reduced amplitude (0.2× TPDF) to coordinate with coefficient-domain
dithering.