TAD documentation update

This commit is contained in:
minjaesong
2025-10-30 22:13:29 +09:00
parent c61bf7750f
commit 46ad919407
2 changed files with 131 additions and 66 deletions

View File

@@ -316,20 +316,23 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
- **Adaptive GOPs**: Scene change detection ensures optimal GOP boundaries - **Adaptive GOPs**: Scene change detection ensures optimal GOP boundaries
#### TAD Format (TSVM Advanced Audio) #### TAD Format (TSVM Advanced Audio)
- **Perceptual audio codec** for TSVM using DWT with 4-tap interpolating Deslauriers-Dubuc wavelets - **Perceptual audio codec** for TSVM using CDF 9/7 biorthogonal wavelets
- **C Encoder**: `video_encoder/encoder_tad.c` - Core Encoder library; `video_encoder/encoder_tad_standalone.c` - Standalone encoder with FFmpeg integration - **C Encoder**: `video_encoder/encoder_tad.c` - Core Encoder library; `video_encoder/encoder_tad_standalone.c` - Standalone encoder with FFmpeg integration
- How to build: `make tad` - How to build: `make tad`
- **Quality Levels**: 0-5 (0=lowest quality/smallest, 5=highest quality/largest; designed to be in sync with TAV encoder) - **Quality Levels**: 0-5 (0=lowest quality/smallest, 5=highest quality/largest; designed to be in sync with TAV encoder)
- **C Decoder**: `video_encoder/decoder_tad.c` - Standalone decoder for TAD format - **C Decoder**: `video_encoder/decoder_tad.c` - Standalone decoder for TAD format
- **Kotlin Decoder**: `AudioAdapter.kt` - Hardware-accelerated TAD decoder for TSVM runtime
- **Features**: - **Features**:
- **32 KHz stereo**: TSVM audio hardware native format - **32 KHz stereo**: TSVM audio hardware native format
- **Variable chunk sizes**: 1024-32768+ samples, enables flexible TAV integration - **Variable chunk sizes**: Any size ≥1024 samples, including non-power-of-2 (e.g., 32016 for TAV 1-second GOPs)
- **M/S stereo decorrelation**: Exploits stereo correlation for better compression - **M/S stereo decorrelation**: Exploits stereo correlation for better compression
- **PCM16→PCM8 conversion**: Error-diffusion dithering to minimize quantization noise - **Gamma compression**: Dynamic range compression (γ=0.707) before quantization
- **Variable-level DD-4 DWT**: Dynamic levels (log2(chunk_size) - 2) for frequency domain analysis - **9-level CDF 9/7 DWT**: Fixed 9 decomposition levels for all chunk sizes
- **Perceptual quantization**: Frequency-dependent weights preserving critical 2-4 KHz range - **Perceptual quantization**: Frequency-dependent weights with lambda companding
- **2-bit twobitmap significance map**: Efficient encoding of sparse coefficients - **Raw int8 storage**: Direct coefficient storage (no significance map, better Zstd compression)
- **Optional Zstd compression**: Level 7 for additional compression - **Coefficient-domain dithering**: Light TPDF dithering to reduce banding
- **Zstd compression**: Level 7 for additional compression
- **Non-power-of-2 support**: Fixed 2025-10-30 to handle arbitrary chunk sizes correctly
- **Usage Examples**: - **Usage Examples**:
```bash ```bash
# Encode with default quality (Q3) # Encode with default quality (Q3)
@@ -348,7 +351,7 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
decoder_tad -i input.tad -o output.pcm decoder_tad -i input.tad -o output.pcm
``` ```
- **Format documentation**: `terranmon.txt` (search for "TSVM Advanced Audio (TAD) Format") - **Format documentation**: `terranmon.txt` (search for "TSVM Advanced Audio (TAD) Format")
- **Version**: 1 (2-bit twobitmap significance map) - **Version**: 1.1 (raw int8 storage with non-power-of-2 support, updated 2025-10-30)
**TAD Compression Performance**: **TAD Compression Performance**:
- **Target Compression**: 2:1 against PCMu8 baseline (4:1 against PCM16LE input) - **Target Compression**: 2:1 against PCMu8 baseline (4:1 against PCM16LE input)
@@ -358,15 +361,16 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
**TAD Encoding Pipeline**: **TAD Encoding Pipeline**:
1. **FFmpeg Two-Pass Extraction**: High-quality SoXR resampling to 32 KHz with 16 Hz highpass filter 1. **FFmpeg Two-Pass Extraction**: High-quality SoXR resampling to 32 KHz with 16 Hz highpass filter
2. **PCM16→PCM8 with Dithering**: Error-diffusion dithering minimizes quantization noise 2. **Gamma Compression**: Dynamic range compression (γ=0.707) for perceptual uniformity
3. **M/S Stereo Decorrelation**: Transforms Left/Right to Mid/Side for better compression 3. **M/S Stereo Decorrelation**: Transforms Left/Right to Mid/Side for better compression
4. **Variable-Level DD-4 DWT**: Deslauriers-Dubuc 4-tap interpolating wavelets with dynamic levels 4. **9-Level CDF 9/7 DWT**: biorthogonal wavelets with fixed 9 levels
- Default 32768 samples → 13 DWT levels - All chunk sizes use 9 levels (sufficient for ≥512 samples after 9 halvings)
- Minimum 1024 samples → 8 DWT levels - Supports non-power-of-2 sizes through proper length tracking
5. **Frequency-Dependent Quantization**: Perceptual weights favor 2-4 KHz (speech intelligibility) 5. **Frequency-Dependent Quantization**: Perceptual weights with lambda companding
6. **Dead Zone Quantization**: Zeros high-frequency noise (highest band) 6. **Dead Zone Quantization**: Zeros high-frequency noise (highest band)
7. **2-bit Twobitmap Encoding**: Maps coefficients to 00=0, 01=+1, 10=-1, 11=other 7. **Coefficient-Domain Dithering**: Light TPDF dithering (±0.5 quantization steps)
8. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data 8. **Raw Int8 Storage**: Direct coefficient storage as signed int8 values
9. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data
**TAD Integration with TAV**: **TAD Integration with TAV**:
TAD is designed as an includable API for TAV video encoder integration. The variable chunk size TAD is designed as an includable API for TAV video encoder integration. The variable chunk size
@@ -375,7 +379,20 @@ TAV embeds TAD-compressed audio using packet type 0x24 with Zstd compression.
**TAD Hardware Acceleration**: **TAD Hardware Acceleration**:
TSVM accelerates TAD decoding with AudioAdapter.kt (backend) and AudioJSR223Delegate.kt (API): TSVM accelerates TAD decoding with AudioAdapter.kt (backend) and AudioJSR223Delegate.kt (API):
- Backend decoder in AudioAdapter.kt with variable chunk size support - Backend decoder in AudioAdapter.kt with non-power-of-2 chunk size support (fixed 2025-10-30)
- API functions in AudioJSR223Delegate.kt for JavaScript access - API functions in AudioJSR223Delegate.kt for JavaScript access
- Supports chunk sizes from 1024 to 32768+ samples - Supports chunk sizes from 1024 to 32768+ samples (any size ≥1024)
- Dynamic DWT level calculation for optimal performance - Fixed 9-level CDF 9/7 inverse DWT with correct length tracking for non-power-of-2 sizes
**Critical Implementation Note (Fixed 2025-10-30)**:
Multi-level inverse DWT must pre-calculate the exact sequence of lengths from forward transform:
```kotlin
val lengths = IntArray(levels + 1)
lengths[0] = chunk_size
for (i in 1..levels) {
lengths[i] = (lengths[i - 1] + 1) / 2
}
// Apply inverse DWT using lengths[level] for each level
```
Using simple doubling (`length *= 2`) is incorrect for non-power-of-2 sizes and causes
mirrored subband artifacts.

View File

@@ -1523,11 +1523,12 @@ Number|Index
TSVM Advanced Audio (TAD) Format TSVM Advanced Audio (TAD) Format
Created by CuriousTorvald and Claude on 2025-10-23 Created by CuriousTorvald and Claude on 2025-10-23
Updated: 2025-10-30 (fixed non-power-of-2 sample count support)
TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT) TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT)
with 4-tap interpolating Deslauriers-Dubuc wavelets, providing efficient compression with CDF 9/7 biorthogonal wavelets, providing efficient compression through M/S stereo
through M/S stereo decorrelation, frequency-dependent quantization, and significance decorrelation, frequency-dependent quantization, and raw int8 coefficient storage.
map encoding. Designed as an includable API for integration with TAV video encoder. Designed as an includable API for integration with TAV video encoder.
When used inside of a video codec, only zstd-compressed payload is stored, chunk length When used inside of a video codec, only zstd-compressed payload is stored, chunk length
is stored separately and quality index is shared with that of the video. is stored separately and quality index is shared with that of the video.
@@ -1556,20 +1557,21 @@ is stored separately and quality index is shared with that of the video.
- **Channels**: 2 (stereo) - **Channels**: 2 (stereo)
- **Input Format**: PCM32fLE (32-bit float little-endian PCM) - **Input Format**: PCM32fLE (32-bit float little-endian PCM)
- **Preprocessing**: 16 Hz highpass filter applied during extraction - **Preprocessing**: 16 Hz highpass filter applied during extraction
- **Internal Representation**: Signed PCM8 with error-diffusion dithering - **Internal Representation**: Float32 throughout encoding, PCM8 conversion only at decoder
- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2) - **Chunk Size**: Variable (1024-32768+ samples per channel, any size ≥1024 supported)
- Default: 32768 samples (1.024 seconds at 32 kHz) - Default: 32768 samples (1.024 seconds at 32 kHz) for standalone files
- TAV integration: Uses exact GOP sample count (e.g., 32016 for 1 second at 32 kHz)
- Minimum: 1024 samples (32 ms at 32 kHz) - Minimum: 1024 samples (32 ms at 32 kHz)
- DWT levels calculated dynamically: log2(chunk_size) - 1 - DWT levels: Fixed at 9 levels for all chunk sizes
- **Target Compression**: 2:1 against PCMu8 baseline - **Target Compression**: 2:1 against PCMu8 baseline
- **Wavelet**: CDF 9/7 biorthogonal
## Chunk Structure ## Chunk Structure
Each chunk encodes a variable number of stereo samples (power of 2, minimum 1024). Each chunk encodes a variable number of stereo samples (minimum 1024, any size supported).
Default is 32768 samples (65536 total samples, 1.024 seconds). Default is 32768 samples (65536 total samples, 1.024 seconds) for standalone files.
If the audio duration doesn't align to chunk boundaries, the final chunk can use TAV integration uses exact GOP sample counts (e.g., 32016 samples for 1 second at 32 kHz).
a smaller power-of-2 size or be zero-padded.
uint16 Sample Count: number of samples per channel (must be power of 2, min 1024) uint16 Sample Count: number of samples per channel (min 1024, any size ≥1024)
uint8 Max quantisation index: this number * 2 + 1 is the total steps of quantisation uint8 Max quantisation index: this number * 2 + 1 is the total steps of quantisation
uint32 Chunk Payload Size: size of following payload in bytes uint32 Chunk Payload Size: size of following payload in bytes
* Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set) * Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
@@ -1580,11 +1582,12 @@ a smaller power-of-2 size or be zero-padded.
## Encoding Pipeline ## Encoding Pipeline
### Step 1: PCM32f to PCM8 Conversion with Error-Diffusion Dithering ### Step 1: Dynamic Range Compression (Gamma Compression)
Input stereo PCM32fLE is converted to signed PCM8 using second-order noise-shaped Input stereo PCM32fLE undergoes gamma compression for perceptual uniformity:
error-diffusion dithering to minimize quantization noise.
Error is propagated to the next sample (alternating between left/right channels). encode(x) = sign(x) * |x|^γ where γ=0.707 (1/√2)
This compresses dynamic range before quantization, improving perceptual quality.
### Step 2: M/S Stereo Decorrelation ### Step 2: M/S Stereo Decorrelation
Mid-Side transformation exploits stereo correlation: Mid-Side transformation exploits stereo correlation:
@@ -1595,16 +1598,18 @@ Mid-Side transformation exploits stereo correlation:
This typically concentrates energy in the Mid channel while the Side channel This typically concentrates energy in the Mid channel while the Side channel
contains mostly small values, improving compression efficiency. contains mostly small values, improving compression efficiency.
### Step 3: Variable-Level DD-4 DWT ### Step 3: 9-Level CDF 9/7 DWT
Each channel (Mid and Side) undergoes Deslauriers-Dubuc 4-tap interpolating wavelet Each channel (Mid and Side) undergoes CDF 9/7 biorthogonal wavelet decomposition. The codec uses a fixed 9 decomposition levels for all chunk sizes:
decomposition. The number of DWT levels is calculated dynamically based on chunk size:
DWT Levels = log2(chunk_size) - 1 DWT Levels = 9 (fixed)
For the default 32768-sample chunks, this produces 14 levels with frequency subbands: For 32768-sample chunks:
- After 9 levels: 64 LL coefficients
- Frequency subbands: LL + 9 H bands (L9 to L1)
Level 0-13: High to low frequency coefficients For 32016-sample chunks (TAV 1-second GOP):
DC band: Low-frequency approximation coefficients - After 9 levels: 63 LL coefficients
- Supports non-power-of-2 sizes through proper length tracking (fixed 2025-10-30)
Sideband boundaries are calculated dynamically: Sideband boundaries are calculated dynamically:
first_band_size = chunk_size >> dwt_levels first_band_size = chunk_size >> dwt_levels
@@ -1612,8 +1617,12 @@ Sideband boundaries are calculated dynamically:
sideband[1] = first_band_size sideband[1] = first_band_size
sideband[i+1] = sideband[i] + (first_band_size << (i-1)) sideband[i+1] = sideband[i] + (first_band_size << (i-1))
For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 CDF 9/7 lifting coefficients:
For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 α = -1.586134342
β = -0.052980118
γ = 0.882911076
δ = 0.443506852
K = 1.230174105
### Step 4: Frequency-Dependent Quantization ### Step 4: Frequency-Dependent Quantization
DWT coefficients are quantized using perceptually-tuned frequency-dependent weights. DWT coefficients are quantized using perceptually-tuned frequency-dependent weights.
@@ -1630,32 +1639,53 @@ where coefficients smaller than half the quantization step are zeroed:
This aggressively removes high-frequency noise while preserving important This aggressively removes high-frequency noise while preserving important
mid-frequency content (2-4 KHz critical for speech intelligibility). mid-frequency content (2-4 KHz critical for speech intelligibility).
### Step 5: 2-bit Significance Map Encoding ### Step 5: Raw Int8 Coefficient Storage
Quantized coefficients are encoded using the 2-bit twobitmap method (see above). Quantized coefficients are stored directly as signed int8 values (no significance map, better Zstd compression).
Concatenated format: [Mid_channel_data][Side_channel_data]
### Step 6: Optional Zstd Compression ### Step 6: Coefficient-Domain Dithering (Encoder)
If enabled (default), the concatenated Mid+Side encoded data is compressed Light triangular dithering (±0.5 quantization steps) added to coefficients before
using Zstd level 3 for additional compression without significant CPU overhead. quantization to reduce banding artifacts.
### Step 7: Zstd Compression
The concatenated Mid+Side encoded data is compressed
using Zstd level 7 for additional compression without significant CPU overhead.
## Decoding Pipeline ## Decoding Pipeline
### Step 1: Chunk Extraction ### Step 1: Chunk Extraction and Decompression
Read chunk header to determine significance map method and compression status. Read chunk header (sample_count, max_index, payload_size).
If compressed, decompress payload using Zstd. If compressed (default), decompress payload using Zstd.
### Step 2: Decode Significance Maps ### Step 2: Coefficient Extraction
Decode Mid and Side channel data using 2-bit twobitmap decoder: Extract Mid and Side channel int8 data from concatenated payload:
- Read 2-bit codes from significance map - Mid channel: bytes [0..sample_count-1]
- Reconstruct coefficients: 0, +1, -1, or read from Other Values array - Side channel: bytes [sample_count..2*sample_count-1]
### Step 3: Dequantization ### Step 3: Dequantization with Lambda Decompanding
Multiply quantized coefficients by frequency-dependent quantization steps Convert quantized int8 values back to float coefficients using:
(same weights as encoder). 1. Lambda decompanding (inverse of Laplacian CDF compression)
2. Multiply by frequency-dependent quantization steps
3. Apply coefficient-domain dithering (TPDF, ~-60 dBFS)
### Step 4: Variable-Level Inverse DD-4 DWT ### Step 4: 9-Level Inverse CDF 9/7 DWT
Reconstruct PCM8 audio from DWT coefficients using inverse DD-4 transform, Reconstruct Float32 audio from DWT coefficients using inverse CDF 9/7 transform.
progressively doubling length from the deepest level to chunk_size samples.
The number of inverse DWT levels matches the forward transform (log2(chunk_size) - 1). **Critical Implementation (Fixed 2025-10-30)**:
The multi-level inverse DWT must use the EXACT sequence of lengths from forward
transform, in reverse order. Using simple doubling (length *= 2) is INCORRECT
for non-power-of-2 sizes.
Correct approach:
1. Pre-calculate all forward transform lengths:
lengths[0] = chunk_size
lengths[i] = (lengths[i-1] + 1) / 2 for i=1..9
2. Apply inverse DWT in reverse order:
for level from 8 down to 0:
apply inverse_dwt(data, lengths[level])
This ensures correct reconstruction for all chunk sizes including non-power-of-2
values (e.g., 32016 samples for TAV 1-second GOPs).
### Step 5: M/S to L/R Conversion ### Step 5: M/S to L/R Conversion
Convert Mid/Side back to Left/Right stereo: Convert Mid/Side back to Left/Right stereo:
@@ -1663,6 +1693,16 @@ Convert Mid/Side back to Left/Right stereo:
Left = Mid + Side Left = Mid + Side
Right = Mid - Side Right = Mid - Side
### Step 6: Gamma Expansion
Expand dynamic range (inverse of encoder's gamma compression):
decode(y) = sign(y) * |y|^(1/γ) where γ=0.707, so 1/γ=√2≈1.414
### Step 7: PCM32f to PCM8 Conversion with Noise-Shaped Dithering
Convert Float32 samples to unsigned PCM8 (PCMu8) using second-order error-diffusion
dithering with reduced amplitude (0.2× TPDF) to coordinate with coefficient-domain
dithering.
## Compression Performance ## Compression Performance
- **Target Ratio**: 2:1 against PCMu8 - **Target Ratio**: 2:1 against PCMu8
- **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3 - **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3
@@ -1699,10 +1739,18 @@ TAD encoder uses two-pass FFmpeg extraction for optimal quality:
This ensures resampling happens after extraction with optimal quality parameters. This ensures resampling happens after extraction with optimal quality parameters.
## Hardware Acceleration API ## Hardware Acceleration API
TAD decoder may be accelerated using hardware functions in GraphicsJSR223Delegate: TAD decoder is accelerated through AudioAdapter.kt peripheral (backend) and
- tadDecode(): Main decoding function (chunk-based) AudioJSR223Delegate.kt (JavaScript API):
- tadHaarIDWT(): Fast inverse Haar DWT
- tadDequantize(): Frequency-dependent dequantization Backend (AudioAdapter.kt):
- decodeTad(): Main decoding function (chunk-based, reads from tadInputBin)
- dwt97Inverse1d(): Single-level inverse CDF 9/7 DWT
- dwt97InverseMultilevel(): 9-level inverse DWT with non-power-of-2 support
JavaScript API (audio.* functions):
- audio.tadDecode(): Trigger TAD decoding from peripheral input buffer
- audio.tadUploadDecoded(offset, count): Upload decoded PCMu8 to playback buffer
- audio.getMemAddr(): Get peripheral memory base address for buffer access
## Usage Examples ## Usage Examples
# Encode with default quality (Q3) # Encode with default quality (Q3)