TAD documentation update

2026-06-14 08:24:04 +09:00 · 2025-10-30 22:13:29 +09:00
parent c61bf7750f
commit 46ad919407
2 changed files with 131 additions and 66 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -316,20 +316,23 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
 - **Adaptive GOPs**: Scene change detection ensures optimal GOP boundaries
 #### TAD Format (TSVM Advanced Audio)
- **Perceptual audio codec** for TSVM using DWT with 4-tap interpolating Deslauriers-Dubuc wavelets
+- **Perceptual audio codec** for TSVM using CDF 9/7 biorthogonal wavelets
 - **C Encoder**: `video_encoder/encoder_tad.c` - Core Encoder library; `video_encoder/encoder_tad_standalone.c` - Standalone encoder with FFmpeg integration
  - How to build: `make tad`
  - **Quality Levels**: 0-5 (0=lowest quality/smallest, 5=highest quality/largest; designed to be in sync with TAV encoder)
 - **C Decoder**: `video_encoder/decoder_tad.c` - Standalone decoder for TAD format
 - **Kotlin Decoder**: `AudioAdapter.kt` - Hardware-accelerated TAD decoder for TSVM runtime
 - **Features**:
  - **32 KHz stereo**: TSVM audio hardware native format
-  - **Variable chunk sizes**: 1024-32768+ samples, enables flexible TAV integration
+  - **Variable chunk sizes**: Any size ≥1024 samples, including non-power-of-2 (e.g., 32016 for TAV 1-second GOPs)
  - **M/S stereo decorrelation**: Exploits stereo correlation for better compression
-  - **PCM16→PCM8 conversion**: Error-diffusion dithering to minimize quantization noise
+  - **Gamma compression**: Dynamic range compression (γ=0.707) before quantization
-  - **Variable-level DD-4 DWT**: Dynamic levels (log2(chunk_size) - 2) for frequency domain analysis
+  - **9-level CDF 9/7 DWT**: Fixed 9 decomposition levels for all chunk sizes
-  - **Perceptual quantization**: Frequency-dependent weights preserving critical 2-4 KHz range
+  - **Perceptual quantization**: Frequency-dependent weights with lambda companding
-  - **2-bit twobitmap significance map**: Efficient encoding of sparse coefficients
+  - **Raw int8 storage**: Direct coefficient storage (no significance map, better Zstd compression)
-  - **Optional Zstd compression**: Level 7 for additional compression
+  - **Coefficient-domain dithering**: Light TPDF dithering to reduce banding
  - **Zstd compression**: Level 7 for additional compression
  - **Non-power-of-2 support**: Fixed 2025-10-30 to handle arbitrary chunk sizes correctly
 - **Usage Examples**:
  ```bash
  # Encode with default quality (Q3)
@@ -348,7 +351,7 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
  decoder_tad -i input.tad -o output.pcm
  ```
 - **Format documentation**: `terranmon.txt` (search for "TSVM Advanced Audio (TAD) Format")
- **Version**: 1 (2-bit twobitmap significance map)
+- **Version**: 1.1 (raw int8 storage with non-power-of-2 support, updated 2025-10-30)
 **TAD Compression Performance**:
 - **Target Compression**: 2:1 against PCMu8 baseline (4:1 against PCM16LE input)
@@ -358,15 +361,16 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic
 **TAD Encoding Pipeline**:
 1. **FFmpeg Two-Pass Extraction**: High-quality SoXR resampling to 32 KHz with 16 Hz highpass filter
-2. **PCM16→PCM8 with Dithering**: Error-diffusion dithering minimizes quantization noise
+2. **Gamma Compression**: Dynamic range compression (γ=0.707) for perceptual uniformity
 3. **M/S Stereo Decorrelation**: Transforms Left/Right to Mid/Side for better compression
-4. **Variable-Level DD-4 DWT**: Deslauriers-Dubuc 4-tap interpolating wavelets with dynamic levels
+4. **9-Level CDF 9/7 DWT**: biorthogonal wavelets with fixed 9 levels
-   - Default 32768 samples → 13 DWT levels
+   - All chunk sizes use 9 levels (sufficient for ≥512 samples after 9 halvings)
-   - Minimum 1024 samples → 8 DWT levels
+   - Supports non-power-of-2 sizes through proper length tracking
-5. **Frequency-Dependent Quantization**: Perceptual weights favor 2-4 KHz (speech intelligibility)
+5. **Frequency-Dependent Quantization**: Perceptual weights with lambda companding
 6. **Dead Zone Quantization**: Zeros high-frequency noise (highest band)
-7. **2-bit Twobitmap Encoding**: Maps coefficients to 00=0, 01=+1, 10=-1, 11=other
+7. **Coefficient-Domain Dithering**: Light TPDF dithering (±0.5 quantization steps)
-8. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data
+8. **Raw Int8 Storage**: Direct coefficient storage as signed int8 values
 9. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data
 **TAD Integration with TAV**:
 TAD is designed as an includable API for TAV video encoder integration. The variable chunk size
@@ -375,7 +379,20 @@ TAV embeds TAD-compressed audio using packet type 0x24 with Zstd compression.
 **TAD Hardware Acceleration**:
 TSVM accelerates TAD decoding with AudioAdapter.kt (backend) and AudioJSR223Delegate.kt (API):
- Backend decoder in AudioAdapter.kt with variable chunk size support
+- Backend decoder in AudioAdapter.kt with non-power-of-2 chunk size support (fixed 2025-10-30)
 - API functions in AudioJSR223Delegate.kt for JavaScript access
- Supports chunk sizes from 1024 to 32768+ samples
+- Supports chunk sizes from 1024 to 32768+ samples (any size ≥1024)
- Dynamic DWT level calculation for optimal performance
+- Fixed 9-level CDF 9/7 inverse DWT with correct length tracking for non-power-of-2 sizes
 **Critical Implementation Note (Fixed 2025-10-30)**:
 Multi-level inverse DWT must pre-calculate the exact sequence of lengths from forward transform:
 ```kotlin
 val lengths = IntArray(levels + 1)
 lengths[0] = chunk_size
 for (i in 1..levels) {
    lengths[i] = (lengths[i - 1] + 1) / 2
 }
 // Apply inverse DWT using lengths[level] for each level
 ```
 Using simple doubling (`length *= 2`) is incorrect for non-power-of-2 sizes and causes
 mirrored subband artifacts.
--- a/terranmon.txt
+++ b/terranmon.txt
@@ -1523,11 +1523,12 @@ Number|Index
 TSVM Advanced Audio (TAD) Format
 Created by CuriousTorvald and Claude on 2025-10-23
 Updated: 2025-10-30 (fixed non-power-of-2 sample count support)
 TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT)
-with 4-tap interpolating Deslauriers-Dubuc wavelets, providing efficient compression
+with CDF 9/7 biorthogonal wavelets, providing efficient compression through M/S stereo
-through M/S stereo decorrelation, frequency-dependent quantization, and significance
+decorrelation, frequency-dependent quantization, and raw int8 coefficient storage.
-map encoding. Designed as an includable API for integration with TAV video encoder.
+Designed as an includable API for integration with TAV video encoder.
 When used inside of a video codec, only zstd-compressed payload is stored, chunk length
 is stored separately and quality index is shared with that of the video.
@@ -1556,20 +1557,21 @@ is stored separately and quality index is shared with that of the video.
 - **Channels**: 2 (stereo)
 - **Input Format**: PCM32fLE (32-bit float little-endian PCM)
 - **Preprocessing**: 16 Hz highpass filter applied during extraction
- **Internal Representation**: Signed PCM8 with error-diffusion dithering
+- **Internal Representation**: Float32 throughout encoding, PCM8 conversion only at decoder
- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2)
+- **Chunk Size**: Variable (1024-32768+ samples per channel, any size ≥1024 supported)
-  - Default: 32768 samples (1.024 seconds at 32 kHz)
+  - Default: 32768 samples (1.024 seconds at 32 kHz) for standalone files
  - TAV integration: Uses exact GOP sample count (e.g., 32016 for 1 second at 32 kHz)
  - Minimum: 1024 samples (32 ms at 32 kHz)
-  - DWT levels calculated dynamically: log2(chunk_size) - 1
+  - DWT levels: Fixed at 9 levels for all chunk sizes
 - **Target Compression**: 2:1 against PCMu8 baseline
 - **Wavelet**: CDF 9/7 biorthogonal
 ## Chunk Structure
-Each chunk encodes a variable number of stereo samples (power of 2, minimum 1024).
+Each chunk encodes a variable number of stereo samples (minimum 1024, any size supported).
-Default is 32768 samples (65536 total samples, 1.024 seconds).
+Default is 32768 samples (65536 total samples, 1.024 seconds) for standalone files.
-If the audio duration doesn't align to chunk boundaries, the final chunk can use
+TAV integration uses exact GOP sample counts (e.g., 32016 samples for 1 second at 32 kHz).
 a smaller power-of-2 size or be zero-padded.
-    uint16 Sample Count: number of samples per channel (must be power of 2, min 1024)
+    uint16 Sample Count: number of samples per channel (min 1024, any size ≥1024)
    uint8  Max quantisation index: this number * 2 + 1 is the total steps of quantisation
    uint32 Chunk Payload Size: size of following payload in bytes
    *      Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
@@ -1580,11 +1582,12 @@ a smaller power-of-2 size or be zero-padded.
 ## Encoding Pipeline
-### Step 1: PCM32f to PCM8 Conversion with Error-Diffusion Dithering
+### Step 1: Dynamic Range Compression (Gamma Compression)
-Input stereo PCM32fLE is converted to signed PCM8 using second-order noise-shaped
+Input stereo PCM32fLE undergoes gamma compression for perceptual uniformity:
 error-diffusion dithering to minimize quantization noise.
-Error is propagated to the next sample (alternating between left/right channels).
+    encode(x) = sign(x) * |x|^γ  where γ=0.707 (1/√2)
 This compresses dynamic range before quantization, improving perceptual quality.
 ### Step 2: M/S Stereo Decorrelation
 Mid-Side transformation exploits stereo correlation:
@@ -1595,16 +1598,18 @@ Mid-Side transformation exploits stereo correlation:
 This typically concentrates energy in the Mid channel while the Side channel
 contains mostly small values, improving compression efficiency.
-### Step 3: Variable-Level DD-4 DWT
+### Step 3: 9-Level CDF 9/7 DWT
-Each channel (Mid and Side) undergoes Deslauriers-Dubuc 4-tap interpolating wavelet
+Each channel (Mid and Side) undergoes CDF 9/7 biorthogonal wavelet decomposition. The codec uses a fixed 9 decomposition levels for all chunk sizes:
 decomposition. The number of DWT levels is calculated dynamically based on chunk size:
-    DWT Levels = log2(chunk_size) - 1
+    DWT Levels = 9 (fixed)
-For the default 32768-sample chunks, this produces 14 levels with frequency subbands:
+For 32768-sample chunks:
    - After 9 levels: 64 LL coefficients
    - Frequency subbands: LL + 9 H bands (L9 to L1)
-    Level 0-13: High to low frequency coefficients
+For 32016-sample chunks (TAV 1-second GOP):
-    DC band: Low-frequency approximation coefficients
+    - After 9 levels: 63 LL coefficients
    - Supports non-power-of-2 sizes through proper length tracking (fixed 2025-10-30)
 Sideband boundaries are calculated dynamically:
    first_band_size = chunk_size >> dwt_levels
@@ -1612,8 +1617,12 @@ Sideband boundaries are calculated dynamically:
    sideband[1] = first_band_size
    sideband[i+1] = sideband[i] + (first_band_size << (i-1))
-For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768
+CDF 9/7 lifting coefficients:
-For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
+    α = -1.586134342
    β = -0.052980118
    γ = 0.882911076
    δ = 0.443506852
    K = 1.230174105
 ### Step 4: Frequency-Dependent Quantization
 DWT coefficients are quantized using perceptually-tuned frequency-dependent weights.
@@ -1630,32 +1639,53 @@ where coefficients smaller than half the quantization step are zeroed:
 This aggressively removes high-frequency noise while preserving important
 mid-frequency content (2-4 KHz critical for speech intelligibility).
-### Step 5: 2-bit Significance Map Encoding
+### Step 5: Raw Int8 Coefficient Storage
-Quantized coefficients are encoded using the 2-bit twobitmap method (see above).
+Quantized coefficients are stored directly as signed int8 values (no significance map, better Zstd compression).
 Concatenated format: [Mid_channel_data][Side_channel_data]
-### Step 6: Optional Zstd Compression
+### Step 6: Coefficient-Domain Dithering (Encoder)
-If enabled (default), the concatenated Mid+Side encoded data is compressed
+Light triangular dithering (±0.5 quantization steps) added to coefficients before
-using Zstd level 3 for additional compression without significant CPU overhead.
+quantization to reduce banding artifacts.
 ### Step 7: Zstd Compression
 The concatenated Mid+Side encoded data is compressed
 using Zstd level 7 for additional compression without significant CPU overhead.
 ## Decoding Pipeline
-### Step 1: Chunk Extraction
+### Step 1: Chunk Extraction and Decompression
-Read chunk header to determine significance map method and compression status.
+Read chunk header (sample_count, max_index, payload_size).
-If compressed, decompress payload using Zstd.
+If compressed (default), decompress payload using Zstd.
-### Step 2: Decode Significance Maps
+### Step 2: Coefficient Extraction
-Decode Mid and Side channel data using 2-bit twobitmap decoder:
+Extract Mid and Side channel int8 data from concatenated payload:
-    - Read 2-bit codes from significance map
+    - Mid channel: bytes [0..sample_count-1]
-    - Reconstruct coefficients: 0, +1, -1, or read from Other Values array
+    - Side channel: bytes [sample_count..2*sample_count-1]
-### Step 3: Dequantization
+### Step 3: Dequantization with Lambda Decompanding
-Multiply quantized coefficients by frequency-dependent quantization steps
+Convert quantized int8 values back to float coefficients using:
-(same weights as encoder).
+    1. Lambda decompanding (inverse of Laplacian CDF compression)
    2. Multiply by frequency-dependent quantization steps
    3. Apply coefficient-domain dithering (TPDF, ~-60 dBFS)
-### Step 4: Variable-Level Inverse DD-4 DWT
+### Step 4: 9-Level Inverse CDF 9/7 DWT
-Reconstruct PCM8 audio from DWT coefficients using inverse DD-4 transform,
+Reconstruct Float32 audio from DWT coefficients using inverse CDF 9/7 transform.
-progressively doubling length from the deepest level to chunk_size samples.
+
-The number of inverse DWT levels matches the forward transform (log2(chunk_size) - 1).
+**Critical Implementation (Fixed 2025-10-30)**:
 The multi-level inverse DWT must use the EXACT sequence of lengths from forward
 transform, in reverse order. Using simple doubling (length *= 2) is INCORRECT
 for non-power-of-2 sizes.
 Correct approach:
    1. Pre-calculate all forward transform lengths:
       lengths[0] = chunk_size
       lengths[i] = (lengths[i-1] + 1) / 2  for i=1..9
    2. Apply inverse DWT in reverse order:
       for level from 8 down to 0:
           apply inverse_dwt(data, lengths[level])
 This ensures correct reconstruction for all chunk sizes including non-power-of-2
 values (e.g., 32016 samples for TAV 1-second GOPs).
 ### Step 5: M/S to L/R Conversion
 Convert Mid/Side back to Left/Right stereo:
@@ -1663,6 +1693,16 @@ Convert Mid/Side back to Left/Right stereo:
    Left = Mid + Side
    Right = Mid - Side
 ### Step 6: Gamma Expansion
 Expand dynamic range (inverse of encoder's gamma compression):
    decode(y) = sign(y) * |y|^(1/γ)  where γ=0.707, so 1/γ=√2≈1.414
 ### Step 7: PCM32f to PCM8 Conversion with Noise-Shaped Dithering
 Convert Float32 samples to unsigned PCM8 (PCMu8) using second-order error-diffusion
 dithering with reduced amplitude (0.2× TPDF) to coordinate with coefficient-domain
 dithering.
 ## Compression Performance
 - **Target Ratio**: 2:1 against PCMu8
 - **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3
@@ -1699,10 +1739,18 @@ TAD encoder uses two-pass FFmpeg extraction for optimal quality:
 This ensures resampling happens after extraction with optimal quality parameters.
 ## Hardware Acceleration API
-TAD decoder may be accelerated using hardware functions in GraphicsJSR223Delegate:
+TAD decoder is accelerated through AudioAdapter.kt peripheral (backend) and
- tadDecode(): Main decoding function (chunk-based)
+AudioJSR223Delegate.kt (JavaScript API):
- tadHaarIDWT(): Fast inverse Haar DWT
+
- tadDequantize(): Frequency-dependent dequantization
+Backend (AudioAdapter.kt):
 - decodeTad(): Main decoding function (chunk-based, reads from tadInputBin)
 - dwt97Inverse1d(): Single-level inverse CDF 9/7 DWT
 - dwt97InverseMultilevel(): 9-level inverse DWT with non-power-of-2 support
 JavaScript API (audio.* functions):
 - audio.tadDecode(): Trigger TAD decoding from peripheral input buffer
 - audio.tadUploadDecoded(offset, count): Upload decoded PCMu8 to playback buffer
 - audio.getMemAddr(): Get peripheral memory base address for buffer access
 ## Usage Examples
    # Encode with default quality (Q3)