TAD documentation update

This commit is contained in:
minjaesong
2025-10-30 22:13:29 +09:00
parent c61bf7750f
commit 46ad919407
2 changed files with 131 additions and 66 deletions

View File

@@ -316,20 +316,23 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
- **Adaptive GOPs**: Scene change detection ensures optimal GOP boundaries
#### TAD Format (TSVM Advanced Audio)
- **Perceptual audio codec** for TSVM using DWT with 4-tap interpolating Deslauriers-Dubuc wavelets
- **Perceptual audio codec** for TSVM using CDF 9/7 biorthogonal wavelets
- **C Encoder**: `video_encoder/encoder_tad.c` - Core Encoder library; `video_encoder/encoder_tad_standalone.c` - Standalone encoder with FFmpeg integration
- How to build: `make tad`
- **Quality Levels**: 0-5 (0=lowest quality/smallest, 5=highest quality/largest; designed to be in sync with TAV encoder)
- **C Decoder**: `video_encoder/decoder_tad.c` - Standalone decoder for TAD format
- **Kotlin Decoder**: `AudioAdapter.kt` - Hardware-accelerated TAD decoder for TSVM runtime
- **Features**:
- **32 KHz stereo**: TSVM audio hardware native format
- **Variable chunk sizes**: 1024-32768+ samples, enables flexible TAV integration
- **Variable chunk sizes**: Any size ≥1024 samples, including non-power-of-2 (e.g., 32016 for TAV 1-second GOPs)
- **M/S stereo decorrelation**: Exploits stereo correlation for better compression
- **PCM16→PCM8 conversion**: Error-diffusion dithering to minimize quantization noise
- **Variable-level DD-4 DWT**: Dynamic levels (log2(chunk_size) - 2) for frequency domain analysis
- **Perceptual quantization**: Frequency-dependent weights preserving critical 2-4 KHz range
- **2-bit twobitmap significance map**: Efficient encoding of sparse coefficients
- **Optional Zstd compression**: Level 7 for additional compression
- **Gamma compression**: Dynamic range compression (γ=0.707) before quantization
- **9-level CDF 9/7 DWT**: Fixed 9 decomposition levels for all chunk sizes
- **Perceptual quantization**: Frequency-dependent weights with lambda companding
- **Raw int8 storage**: Direct coefficient storage (no significance map, better Zstd compression)
- **Coefficient-domain dithering**: Light TPDF dithering to reduce banding
- **Zstd compression**: Level 7 for additional compression
- **Non-power-of-2 support**: Fixed 2025-10-30 to handle arbitrary chunk sizes correctly
- **Usage Examples**:
```bash
# Encode with default quality (Q3)
@@ -348,7 +351,7 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
decoder_tad -i input.tad -o output.pcm
```
- **Format documentation**: `terranmon.txt` (search for "TSVM Advanced Audio (TAD) Format")
- **Version**: 1 (2-bit twobitmap significance map)
- **Version**: 1.1 (raw int8 storage with non-power-of-2 support, updated 2025-10-30)
**TAD Compression Performance**:
- **Target Compression**: 2:1 against PCMu8 baseline (4:1 against PCM16LE input)
@@ -358,15 +361,16 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
**TAD Encoding Pipeline**:
1. **FFmpeg Two-Pass Extraction**: High-quality SoXR resampling to 32 KHz with 16 Hz highpass filter
2. **PCM16→PCM8 with Dithering**: Error-diffusion dithering minimizes quantization noise
2. **Gamma Compression**: Dynamic range compression (γ=0.707) for perceptual uniformity
3. **M/S Stereo Decorrelation**: Transforms Left/Right to Mid/Side for better compression
4. **Variable-Level DD-4 DWT**: Deslauriers-Dubuc 4-tap interpolating wavelets with dynamic levels
- Default 32768 samples → 13 DWT levels
- Minimum 1024 samples → 8 DWT levels
5. **Frequency-Dependent Quantization**: Perceptual weights favor 2-4 KHz (speech intelligibility)
4. **9-Level CDF 9/7 DWT**: biorthogonal wavelets with fixed 9 levels
- All chunk sizes use 9 levels (sufficient for ≥512 samples after 9 halvings)
- Supports non-power-of-2 sizes through proper length tracking
5. **Frequency-Dependent Quantization**: Perceptual weights with lambda companding
6. **Dead Zone Quantization**: Zeros high-frequency noise (highest band)
7. **2-bit Twobitmap Encoding**: Maps coefficients to 00=0, 01=+1, 10=-1, 11=other
8. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data
7. **Coefficient-Domain Dithering**: Light TPDF dithering (±0.5 quantization steps)
8. **Raw Int8 Storage**: Direct coefficient storage as signed int8 values
9. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data
**TAD Integration with TAV**:
TAD is designed as an includable API for TAV video encoder integration. The variable chunk size
@@ -375,7 +379,20 @@ TAV embeds TAD-compressed audio using packet type 0x24 with Zstd compression.
**TAD Hardware Acceleration**:
TSVM accelerates TAD decoding with AudioAdapter.kt (backend) and AudioJSR223Delegate.kt (API):
- Backend decoder in AudioAdapter.kt with variable chunk size support
- Backend decoder in AudioAdapter.kt with non-power-of-2 chunk size support (fixed 2025-10-30)
- API functions in AudioJSR223Delegate.kt for JavaScript access
- Supports chunk sizes from 1024 to 32768+ samples
- Dynamic DWT level calculation for optimal performance
- Supports chunk sizes from 1024 to 32768+ samples (any size ≥1024)
- Fixed 9-level CDF 9/7 inverse DWT with correct length tracking for non-power-of-2 sizes
**Critical Implementation Note (Fixed 2025-10-30)**:
Multi-level inverse DWT must pre-calculate the exact sequence of lengths from forward transform:
```kotlin
val lengths = IntArray(levels + 1)
lengths[0] = chunk_size
for (i in 1..levels) {
lengths[i] = (lengths[i - 1] + 1) / 2
}
// Apply inverse DWT using lengths[level] for each level
```
Using simple doubling (`length *= 2`) is incorrect for non-power-of-2 sizes and causes
mirrored subband artifacts.

View File

@@ -1523,11 +1523,12 @@ Number|Index
TSVM Advanced Audio (TAD) Format
Created by CuriousTorvald and Claude on 2025-10-23
Updated: 2025-10-30 (fixed non-power-of-2 sample count support)
TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT)
with 4-tap interpolating Deslauriers-Dubuc wavelets, providing efficient compression
through M/S stereo decorrelation, frequency-dependent quantization, and significance
map encoding. Designed as an includable API for integration with TAV video encoder.
with CDF 9/7 biorthogonal wavelets, providing efficient compression through M/S stereo
decorrelation, frequency-dependent quantization, and raw int8 coefficient storage.
Designed as an includable API for integration with TAV video encoder.
When used inside of a video codec, only zstd-compressed payload is stored, chunk length
is stored separately and quality index is shared with that of the video.
@@ -1556,20 +1557,21 @@ is stored separately and quality index is shared with that of the video.
- **Channels**: 2 (stereo)
- **Input Format**: PCM32fLE (32-bit float little-endian PCM)
- **Preprocessing**: 16 Hz highpass filter applied during extraction
- **Internal Representation**: Signed PCM8 with error-diffusion dithering
- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2)
- Default: 32768 samples (1.024 seconds at 32 kHz)
- **Internal Representation**: Float32 throughout encoding, PCM8 conversion only at decoder
- **Chunk Size**: Variable (1024-32768+ samples per channel, any size ≥1024 supported)
- Default: 32768 samples (1.024 seconds at 32 kHz) for standalone files
- TAV integration: Uses exact GOP sample count (e.g., 32016 for 1 second at 32 kHz)
- Minimum: 1024 samples (32 ms at 32 kHz)
- DWT levels calculated dynamically: log2(chunk_size) - 1
- DWT levels: Fixed at 9 levels for all chunk sizes
- **Target Compression**: 2:1 against PCMu8 baseline
- **Wavelet**: CDF 9/7 biorthogonal
## Chunk Structure
Each chunk encodes a variable number of stereo samples (power of 2, minimum 1024).
Default is 32768 samples (65536 total samples, 1.024 seconds).
If the audio duration doesn't align to chunk boundaries, the final chunk can use
a smaller power-of-2 size or be zero-padded.
Each chunk encodes a variable number of stereo samples (minimum 1024, any size supported).
Default is 32768 samples (65536 total samples, 1.024 seconds) for standalone files.
TAV integration uses exact GOP sample counts (e.g., 32016 samples for 1 second at 32 kHz).
uint16 Sample Count: number of samples per channel (must be power of 2, min 1024)
uint16 Sample Count: number of samples per channel (min 1024, any size ≥1024)
uint8 Max quantisation index: this number * 2 + 1 is the total steps of quantisation
uint32 Chunk Payload Size: size of following payload in bytes
* Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
@@ -1580,11 +1582,12 @@ a smaller power-of-2 size or be zero-padded.
## Encoding Pipeline
### Step 1: PCM32f to PCM8 Conversion with Error-Diffusion Dithering
Input stereo PCM32fLE is converted to signed PCM8 using second-order noise-shaped
error-diffusion dithering to minimize quantization noise.
### Step 1: Dynamic Range Compression (Gamma Compression)
Input stereo PCM32fLE undergoes gamma compression for perceptual uniformity:
Error is propagated to the next sample (alternating between left/right channels).
encode(x) = sign(x) * |x|^γ where γ=0.707 (1/√2)
This compresses dynamic range before quantization, improving perceptual quality.
### Step 2: M/S Stereo Decorrelation
Mid-Side transformation exploits stereo correlation:
@@ -1595,16 +1598,18 @@ Mid-Side transformation exploits stereo correlation:
This typically concentrates energy in the Mid channel while the Side channel
contains mostly small values, improving compression efficiency.
### Step 3: Variable-Level DD-4 DWT
Each channel (Mid and Side) undergoes Deslauriers-Dubuc 4-tap interpolating wavelet
decomposition. The number of DWT levels is calculated dynamically based on chunk size:
### Step 3: 9-Level CDF 9/7 DWT
Each channel (Mid and Side) undergoes CDF 9/7 biorthogonal wavelet decomposition. The codec uses a fixed 9 decomposition levels for all chunk sizes:
DWT Levels = log2(chunk_size) - 1
DWT Levels = 9 (fixed)
For the default 32768-sample chunks, this produces 14 levels with frequency subbands:
For 32768-sample chunks:
- After 9 levels: 64 LL coefficients
- Frequency subbands: LL + 9 H bands (L9 to L1)
Level 0-13: High to low frequency coefficients
DC band: Low-frequency approximation coefficients
For 32016-sample chunks (TAV 1-second GOP):
- After 9 levels: 63 LL coefficients
- Supports non-power-of-2 sizes through proper length tracking (fixed 2025-10-30)
Sideband boundaries are calculated dynamically:
first_band_size = chunk_size >> dwt_levels
@@ -1612,8 +1617,12 @@ Sideband boundaries are calculated dynamically:
sideband[1] = first_band_size
sideband[i+1] = sideband[i] + (first_band_size << (i-1))
For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768
For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
CDF 9/7 lifting coefficients:
α = -1.586134342
β = -0.052980118
γ = 0.882911076
δ = 0.443506852
K = 1.230174105
### Step 4: Frequency-Dependent Quantization
DWT coefficients are quantized using perceptually-tuned frequency-dependent weights.
@@ -1630,32 +1639,53 @@ where coefficients smaller than half the quantization step are zeroed:
This aggressively removes high-frequency noise while preserving important
mid-frequency content (2-4 KHz critical for speech intelligibility).
### Step 5: 2-bit Significance Map Encoding
Quantized coefficients are encoded using the 2-bit twobitmap method (see above).
### Step 5: Raw Int8 Coefficient Storage
Quantized coefficients are stored directly as signed int8 values (no significance map, better Zstd compression).
Concatenated format: [Mid_channel_data][Side_channel_data]
### Step 6: Optional Zstd Compression
If enabled (default), the concatenated Mid+Side encoded data is compressed
using Zstd level 3 for additional compression without significant CPU overhead.
### Step 6: Coefficient-Domain Dithering (Encoder)
Light triangular dithering (±0.5 quantization steps) added to coefficients before
quantization to reduce banding artifacts.
### Step 7: Zstd Compression
The concatenated Mid+Side encoded data is compressed
using Zstd level 7 for additional compression without significant CPU overhead.
## Decoding Pipeline
### Step 1: Chunk Extraction
Read chunk header to determine significance map method and compression status.
If compressed, decompress payload using Zstd.
### Step 1: Chunk Extraction and Decompression
Read chunk header (sample_count, max_index, payload_size).
If compressed (default), decompress payload using Zstd.
### Step 2: Decode Significance Maps
Decode Mid and Side channel data using 2-bit twobitmap decoder:
- Read 2-bit codes from significance map
- Reconstruct coefficients: 0, +1, -1, or read from Other Values array
### Step 2: Coefficient Extraction
Extract Mid and Side channel int8 data from concatenated payload:
- Mid channel: bytes [0..sample_count-1]
- Side channel: bytes [sample_count..2*sample_count-1]
### Step 3: Dequantization
Multiply quantized coefficients by frequency-dependent quantization steps
(same weights as encoder).
### Step 3: Dequantization with Lambda Decompanding
Convert quantized int8 values back to float coefficients using:
1. Lambda decompanding (inverse of Laplacian CDF compression)
2. Multiply by frequency-dependent quantization steps
3. Apply coefficient-domain dithering (TPDF, ~-60 dBFS)
### Step 4: Variable-Level Inverse DD-4 DWT
Reconstruct PCM8 audio from DWT coefficients using inverse DD-4 transform,
progressively doubling length from the deepest level to chunk_size samples.
The number of inverse DWT levels matches the forward transform (log2(chunk_size) - 1).
### Step 4: 9-Level Inverse CDF 9/7 DWT
Reconstruct Float32 audio from DWT coefficients using inverse CDF 9/7 transform.
**Critical Implementation (Fixed 2025-10-30)**:
The multi-level inverse DWT must use the EXACT sequence of lengths from forward
transform, in reverse order. Using simple doubling (length *= 2) is INCORRECT
for non-power-of-2 sizes.
Correct approach:
1. Pre-calculate all forward transform lengths:
lengths[0] = chunk_size
lengths[i] = (lengths[i-1] + 1) / 2 for i=1..9
2. Apply inverse DWT in reverse order:
for level from 8 down to 0:
apply inverse_dwt(data, lengths[level])
This ensures correct reconstruction for all chunk sizes including non-power-of-2
values (e.g., 32016 samples for TAV 1-second GOPs).
### Step 5: M/S to L/R Conversion
Convert Mid/Side back to Left/Right stereo:
@@ -1663,6 +1693,16 @@ Convert Mid/Side back to Left/Right stereo:
Left = Mid + Side
Right = Mid - Side
### Step 6: Gamma Expansion
Expand dynamic range (inverse of encoder's gamma compression):
decode(y) = sign(y) * |y|^(1/γ) where γ=0.707, so 1/γ=√2≈1.414
### Step 7: PCM32f to PCM8 Conversion with Noise-Shaped Dithering
Convert Float32 samples to unsigned PCM8 (PCMu8) using second-order error-diffusion
dithering with reduced amplitude (0.2× TPDF) to coordinate with coefficient-domain
dithering.
## Compression Performance
- **Target Ratio**: 2:1 against PCMu8
- **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3
@@ -1699,10 +1739,18 @@ TAD encoder uses two-pass FFmpeg extraction for optimal quality:
This ensures resampling happens after extraction with optimal quality parameters.
## Hardware Acceleration API
TAD decoder may be accelerated using hardware functions in GraphicsJSR223Delegate:
- tadDecode(): Main decoding function (chunk-based)
- tadHaarIDWT(): Fast inverse Haar DWT
- tadDequantize(): Frequency-dependent dequantization
TAD decoder is accelerated through AudioAdapter.kt peripheral (backend) and
AudioJSR223Delegate.kt (JavaScript API):
Backend (AudioAdapter.kt):
- decodeTad(): Main decoding function (chunk-based, reads from tadInputBin)
- dwt97Inverse1d(): Single-level inverse CDF 9/7 DWT
- dwt97InverseMultilevel(): 9-level inverse DWT with non-power-of-2 support
JavaScript API (audio.* functions):
- audio.tadDecode(): Trigger TAD decoding from peripheral input buffer
- audio.tadUploadDecoded(offset, count): Upload decoded PCMu8 to playback buffer
- audio.getMemAddr(): Get peripheral memory base address for buffer access
## Usage Examples
# Encode with default quality (Q3)