diff --git a/CLAUDE.md b/CLAUDE.md index 6840625..6aeee31 100644 --- a/CLAUDE.md +++ b/CLAUDE.md @@ -316,20 +316,23 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic - **Adaptive GOPs**: Scene change detection ensures optimal GOP boundaries #### TAD Format (TSVM Advanced Audio) -- **Perceptual audio codec** for TSVM using DWT with 4-tap interpolating Deslauriers-Dubuc wavelets +- **Perceptual audio codec** for TSVM using CDF 9/7 biorthogonal wavelets - **C Encoder**: `video_encoder/encoder_tad.c` - Core Encoder library; `video_encoder/encoder_tad_standalone.c` - Standalone encoder with FFmpeg integration - How to build: `make tad` - **Quality Levels**: 0-5 (0=lowest quality/smallest, 5=highest quality/largest; designed to be in sync with TAV encoder) - **C Decoder**: `video_encoder/decoder_tad.c` - Standalone decoder for TAD format +- **Kotlin Decoder**: `AudioAdapter.kt` - Hardware-accelerated TAD decoder for TSVM runtime - **Features**: - **32 KHz stereo**: TSVM audio hardware native format - - **Variable chunk sizes**: 1024-32768+ samples, enables flexible TAV integration + - **Variable chunk sizes**: Any size ≥1024 samples, including non-power-of-2 (e.g., 32016 for TAV 1-second GOPs) - **M/S stereo decorrelation**: Exploits stereo correlation for better compression - - **PCM16→PCM8 conversion**: Error-diffusion dithering to minimize quantization noise - - **Variable-level DD-4 DWT**: Dynamic levels (log2(chunk_size) - 2) for frequency domain analysis - - **Perceptual quantization**: Frequency-dependent weights preserving critical 2-4 KHz range - - **2-bit twobitmap significance map**: Efficient encoding of sparse coefficients - - **Optional Zstd compression**: Level 7 for additional compression + - **Gamma compression**: Dynamic range compression (γ=0.707) before quantization + - **9-level CDF 9/7 DWT**: Fixed 9 decomposition levels for all chunk sizes + - **Perceptual quantization**: Frequency-dependent weights with lambda companding + - **Raw int8 storage**: Direct coefficient storage (no significance map, better Zstd compression) + - **Coefficient-domain dithering**: Light TPDF dithering to reduce banding + - **Zstd compression**: Level 7 for additional compression + - **Non-power-of-2 support**: Fixed 2025-10-30 to handle arbitrary chunk sizes correctly - **Usage Examples**: ```bash # Encode with default quality (Q3) @@ -348,7 +351,7 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic decoder_tad -i input.tad -o output.pcm ``` - **Format documentation**: `terranmon.txt` (search for "TSVM Advanced Audio (TAD) Format") -- **Version**: 1 (2-bit twobitmap significance map) +- **Version**: 1.1 (raw int8 storage with non-power-of-2 support, updated 2025-10-30) **TAD Compression Performance**: - **Target Compression**: 2:1 against PCMu8 baseline (4:1 against PCM16LE input) @@ -358,15 +361,16 @@ Implemented on 2025-10-15 for improved temporal compression through group-of-pic **TAD Encoding Pipeline**: 1. **FFmpeg Two-Pass Extraction**: High-quality SoXR resampling to 32 KHz with 16 Hz highpass filter -2. **PCM16→PCM8 with Dithering**: Error-diffusion dithering minimizes quantization noise +2. **Gamma Compression**: Dynamic range compression (γ=0.707) for perceptual uniformity 3. **M/S Stereo Decorrelation**: Transforms Left/Right to Mid/Side for better compression -4. **Variable-Level DD-4 DWT**: Deslauriers-Dubuc 4-tap interpolating wavelets with dynamic levels - - Default 32768 samples → 13 DWT levels - - Minimum 1024 samples → 8 DWT levels -5. **Frequency-Dependent Quantization**: Perceptual weights favor 2-4 KHz (speech intelligibility) +4. **9-Level CDF 9/7 DWT**: biorthogonal wavelets with fixed 9 levels + - All chunk sizes use 9 levels (sufficient for ≥512 samples after 9 halvings) + - Supports non-power-of-2 sizes through proper length tracking +5. **Frequency-Dependent Quantization**: Perceptual weights with lambda companding 6. **Dead Zone Quantization**: Zeros high-frequency noise (highest band) -7. **2-bit Twobitmap Encoding**: Maps coefficients to 00=0, 01=+1, 10=-1, 11=other -8. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data +7. **Coefficient-Domain Dithering**: Light TPDF dithering (±0.5 quantization steps) +8. **Raw Int8 Storage**: Direct coefficient storage as signed int8 values +9. **Optional Zstd Compression**: Level 7 compression on concatenated Mid+Side data **TAD Integration with TAV**: TAD is designed as an includable API for TAV video encoder integration. The variable chunk size @@ -375,7 +379,20 @@ TAV embeds TAD-compressed audio using packet type 0x24 with Zstd compression. **TAD Hardware Acceleration**: TSVM accelerates TAD decoding with AudioAdapter.kt (backend) and AudioJSR223Delegate.kt (API): -- Backend decoder in AudioAdapter.kt with variable chunk size support +- Backend decoder in AudioAdapter.kt with non-power-of-2 chunk size support (fixed 2025-10-30) - API functions in AudioJSR223Delegate.kt for JavaScript access -- Supports chunk sizes from 1024 to 32768+ samples -- Dynamic DWT level calculation for optimal performance +- Supports chunk sizes from 1024 to 32768+ samples (any size ≥1024) +- Fixed 9-level CDF 9/7 inverse DWT with correct length tracking for non-power-of-2 sizes + +**Critical Implementation Note (Fixed 2025-10-30)**: +Multi-level inverse DWT must pre-calculate the exact sequence of lengths from forward transform: +```kotlin +val lengths = IntArray(levels + 1) +lengths[0] = chunk_size +for (i in 1..levels) { + lengths[i] = (lengths[i - 1] + 1) / 2 +} +// Apply inverse DWT using lengths[level] for each level +``` +Using simple doubling (`length *= 2`) is incorrect for non-power-of-2 sizes and causes +mirrored subband artifacts. diff --git a/terranmon.txt b/terranmon.txt index 9b90be1..2aa3e39 100644 --- a/terranmon.txt +++ b/terranmon.txt @@ -1523,11 +1523,12 @@ Number|Index TSVM Advanced Audio (TAD) Format Created by CuriousTorvald and Claude on 2025-10-23 +Updated: 2025-10-30 (fixed non-power-of-2 sample count support) TAD is a perceptual audio codec for TSVM utilizing Discrete Wavelet Transform (DWT) -with 4-tap interpolating Deslauriers-Dubuc wavelets, providing efficient compression -through M/S stereo decorrelation, frequency-dependent quantization, and significance -map encoding. Designed as an includable API for integration with TAV video encoder. +with CDF 9/7 biorthogonal wavelets, providing efficient compression through M/S stereo +decorrelation, frequency-dependent quantization, and raw int8 coefficient storage. +Designed as an includable API for integration with TAV video encoder. When used inside of a video codec, only zstd-compressed payload is stored, chunk length is stored separately and quality index is shared with that of the video. @@ -1556,20 +1557,21 @@ is stored separately and quality index is shared with that of the video. - **Channels**: 2 (stereo) - **Input Format**: PCM32fLE (32-bit float little-endian PCM) - **Preprocessing**: 16 Hz highpass filter applied during extraction -- **Internal Representation**: Signed PCM8 with error-diffusion dithering -- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2) - - Default: 32768 samples (1.024 seconds at 32 kHz) +- **Internal Representation**: Float32 throughout encoding, PCM8 conversion only at decoder +- **Chunk Size**: Variable (1024-32768+ samples per channel, any size ≥1024 supported) + - Default: 32768 samples (1.024 seconds at 32 kHz) for standalone files + - TAV integration: Uses exact GOP sample count (e.g., 32016 for 1 second at 32 kHz) - Minimum: 1024 samples (32 ms at 32 kHz) - - DWT levels calculated dynamically: log2(chunk_size) - 1 + - DWT levels: Fixed at 9 levels for all chunk sizes - **Target Compression**: 2:1 against PCMu8 baseline +- **Wavelet**: CDF 9/7 biorthogonal ## Chunk Structure -Each chunk encodes a variable number of stereo samples (power of 2, minimum 1024). -Default is 32768 samples (65536 total samples, 1.024 seconds). -If the audio duration doesn't align to chunk boundaries, the final chunk can use -a smaller power-of-2 size or be zero-padded. +Each chunk encodes a variable number of stereo samples (minimum 1024, any size supported). +Default is 32768 samples (65536 total samples, 1.024 seconds) for standalone files. +TAV integration uses exact GOP sample counts (e.g., 32016 samples for 1 second at 32 kHz). - uint16 Sample Count: number of samples per channel (must be power of 2, min 1024) + uint16 Sample Count: number of samples per channel (min 1024, any size ≥1024) uint8 Max quantisation index: this number * 2 + 1 is the total steps of quantisation uint32 Chunk Payload Size: size of following payload in bytes * Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set) @@ -1580,11 +1582,12 @@ a smaller power-of-2 size or be zero-padded. ## Encoding Pipeline -### Step 1: PCM32f to PCM8 Conversion with Error-Diffusion Dithering -Input stereo PCM32fLE is converted to signed PCM8 using second-order noise-shaped -error-diffusion dithering to minimize quantization noise. +### Step 1: Dynamic Range Compression (Gamma Compression) +Input stereo PCM32fLE undergoes gamma compression for perceptual uniformity: -Error is propagated to the next sample (alternating between left/right channels). + encode(x) = sign(x) * |x|^γ where γ=0.707 (1/√2) + +This compresses dynamic range before quantization, improving perceptual quality. ### Step 2: M/S Stereo Decorrelation Mid-Side transformation exploits stereo correlation: @@ -1595,16 +1598,18 @@ Mid-Side transformation exploits stereo correlation: This typically concentrates energy in the Mid channel while the Side channel contains mostly small values, improving compression efficiency. -### Step 3: Variable-Level DD-4 DWT -Each channel (Mid and Side) undergoes Deslauriers-Dubuc 4-tap interpolating wavelet -decomposition. The number of DWT levels is calculated dynamically based on chunk size: +### Step 3: 9-Level CDF 9/7 DWT +Each channel (Mid and Side) undergoes CDF 9/7 biorthogonal wavelet decomposition. The codec uses a fixed 9 decomposition levels for all chunk sizes: - DWT Levels = log2(chunk_size) - 1 + DWT Levels = 9 (fixed) -For the default 32768-sample chunks, this produces 14 levels with frequency subbands: +For 32768-sample chunks: + - After 9 levels: 64 LL coefficients + - Frequency subbands: LL + 9 H bands (L9 to L1) - Level 0-13: High to low frequency coefficients - DC band: Low-frequency approximation coefficients +For 32016-sample chunks (TAV 1-second GOP): + - After 9 levels: 63 LL coefficients + - Supports non-power-of-2 sizes through proper length tracking (fixed 2025-10-30) Sideband boundaries are calculated dynamically: first_band_size = chunk_size >> dwt_levels @@ -1612,8 +1617,12 @@ Sideband boundaries are calculated dynamically: sideband[1] = first_band_size sideband[i+1] = sideband[i] + (first_band_size << (i-1)) -For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024, 2048, 4096, 8192, 16384, 32768 -For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024 +CDF 9/7 lifting coefficients: + α = -1.586134342 + β = -0.052980118 + γ = 0.882911076 + δ = 0.443506852 + K = 1.230174105 ### Step 4: Frequency-Dependent Quantization DWT coefficients are quantized using perceptually-tuned frequency-dependent weights. @@ -1630,32 +1639,53 @@ where coefficients smaller than half the quantization step are zeroed: This aggressively removes high-frequency noise while preserving important mid-frequency content (2-4 KHz critical for speech intelligibility). -### Step 5: 2-bit Significance Map Encoding -Quantized coefficients are encoded using the 2-bit twobitmap method (see above). +### Step 5: Raw Int8 Coefficient Storage +Quantized coefficients are stored directly as signed int8 values (no significance map, better Zstd compression). +Concatenated format: [Mid_channel_data][Side_channel_data] -### Step 6: Optional Zstd Compression -If enabled (default), the concatenated Mid+Side encoded data is compressed -using Zstd level 3 for additional compression without significant CPU overhead. +### Step 6: Coefficient-Domain Dithering (Encoder) +Light triangular dithering (±0.5 quantization steps) added to coefficients before +quantization to reduce banding artifacts. + +### Step 7: Zstd Compression +The concatenated Mid+Side encoded data is compressed +using Zstd level 7 for additional compression without significant CPU overhead. ## Decoding Pipeline -### Step 1: Chunk Extraction -Read chunk header to determine significance map method and compression status. -If compressed, decompress payload using Zstd. +### Step 1: Chunk Extraction and Decompression +Read chunk header (sample_count, max_index, payload_size). +If compressed (default), decompress payload using Zstd. -### Step 2: Decode Significance Maps -Decode Mid and Side channel data using 2-bit twobitmap decoder: - - Read 2-bit codes from significance map - - Reconstruct coefficients: 0, +1, -1, or read from Other Values array +### Step 2: Coefficient Extraction +Extract Mid and Side channel int8 data from concatenated payload: + - Mid channel: bytes [0..sample_count-1] + - Side channel: bytes [sample_count..2*sample_count-1] -### Step 3: Dequantization -Multiply quantized coefficients by frequency-dependent quantization steps -(same weights as encoder). +### Step 3: Dequantization with Lambda Decompanding +Convert quantized int8 values back to float coefficients using: + 1. Lambda decompanding (inverse of Laplacian CDF compression) + 2. Multiply by frequency-dependent quantization steps + 3. Apply coefficient-domain dithering (TPDF, ~-60 dBFS) -### Step 4: Variable-Level Inverse DD-4 DWT -Reconstruct PCM8 audio from DWT coefficients using inverse DD-4 transform, -progressively doubling length from the deepest level to chunk_size samples. -The number of inverse DWT levels matches the forward transform (log2(chunk_size) - 1). +### Step 4: 9-Level Inverse CDF 9/7 DWT +Reconstruct Float32 audio from DWT coefficients using inverse CDF 9/7 transform. + +**Critical Implementation (Fixed 2025-10-30)**: +The multi-level inverse DWT must use the EXACT sequence of lengths from forward +transform, in reverse order. Using simple doubling (length *= 2) is INCORRECT +for non-power-of-2 sizes. + +Correct approach: + 1. Pre-calculate all forward transform lengths: + lengths[0] = chunk_size + lengths[i] = (lengths[i-1] + 1) / 2 for i=1..9 + 2. Apply inverse DWT in reverse order: + for level from 8 down to 0: + apply inverse_dwt(data, lengths[level]) + +This ensures correct reconstruction for all chunk sizes including non-power-of-2 +values (e.g., 32016 samples for TAV 1-second GOPs). ### Step 5: M/S to L/R Conversion Convert Mid/Side back to Left/Right stereo: @@ -1663,6 +1693,16 @@ Convert Mid/Side back to Left/Right stereo: Left = Mid + Side Right = Mid - Side +### Step 6: Gamma Expansion +Expand dynamic range (inverse of encoder's gamma compression): + + decode(y) = sign(y) * |y|^(1/γ) where γ=0.707, so 1/γ=√2≈1.414 + +### Step 7: PCM32f to PCM8 Conversion with Noise-Shaped Dithering +Convert Float32 samples to unsigned PCM8 (PCMu8) using second-order error-diffusion +dithering with reduced amplitude (0.2× TPDF) to coordinate with coefficient-domain +dithering. + ## Compression Performance - **Target Ratio**: 2:1 against PCMu8 - **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3 @@ -1699,10 +1739,18 @@ TAD encoder uses two-pass FFmpeg extraction for optimal quality: This ensures resampling happens after extraction with optimal quality parameters. ## Hardware Acceleration API -TAD decoder may be accelerated using hardware functions in GraphicsJSR223Delegate: -- tadDecode(): Main decoding function (chunk-based) -- tadHaarIDWT(): Fast inverse Haar DWT -- tadDequantize(): Frequency-dependent dequantization +TAD decoder is accelerated through AudioAdapter.kt peripheral (backend) and +AudioJSR223Delegate.kt (JavaScript API): + +Backend (AudioAdapter.kt): +- decodeTad(): Main decoding function (chunk-based, reads from tadInputBin) +- dwt97Inverse1d(): Single-level inverse CDF 9/7 DWT +- dwt97InverseMultilevel(): 9-level inverse DWT with non-power-of-2 support + +JavaScript API (audio.* functions): +- audio.tadDecode(): Trigger TAD decoding from peripheral input buffer +- audio.tadUploadDecoded(offset, count): Upload decoded PCMu8 to playback buffer +- audio.getMemAddr(): Get peripheral memory base address for buffer access ## Usage Examples # Encode with default quality (Q3)