TAD: now processing entirely in float

This commit is contained in:
minjaesong
2025-10-24 05:31:38 +09:00
parent a9319fd812
commit 9dc71095a0
5 changed files with 139 additions and 319 deletions

View File

@@ -1550,7 +1550,7 @@ is stored separately and quality index is shared with that of the video.
## Audio Properties
- **Sample Rate**: 32000 Hz (TSVM audio hardware native format)
- **Channels**: 2 (stereo)
- **Input Format**: PCM16LE (16-bit signed little-endian PCM)
- **Input Format**: PCM32fLE (32-bit float little-endian PCM)
- **Preprocessing**: 16 Hz highpass filter applied during extraction
- **Internal Representation**: Signed PCM8 with error-diffusion dithering
- **Chunk Size**: Variable (1024-32768+ samples per channel, must be power of 2)
@@ -1565,8 +1565,6 @@ Default is 32768 samples (65536 total samples, 1.024 seconds).
If the audio duration doesn't align to chunk boundaries, the final chunk can use
a smaller power-of-2 size or be zero-padded.
uint8 Significance Map Method: always 1 (2-bit twobitmap)
uint8 Compression Flag: 1=Zstd compressed, 0=uncompressed
uint16 Sample Count: number of samples per channel (must be power of 2, min 1024)
uint32 Chunk Payload Size: size of following payload in bytes
* Chunk Payload: encoded M/S stereo data (Zstd compressed if flag set)
@@ -1592,13 +1590,9 @@ as int16 in the order they appear.
## Encoding Pipeline
### Step 1: PCM16 to PCM8 Conversion with Error-Diffusion Dithering
Input stereo PCM16LE is converted to signed PCM8 using error-diffusion dithering
to minimize quantization noise:
dithered_value = pcm16_value / 256 + error
pcm8_value = clamp(round(dithered_value), -128, 127)
error = dithered_value - pcm8_value
### Step 1: PCM32f to PCM8 Conversion with Error-Diffusion Dithering
Input stereo PCM32fLE is converted to signed PCM8 using second-order noise-shaped
error-diffusion dithering to minimize quantization noise.
Error is propagated to the next sample (alternating between left/right channels).
@@ -1632,18 +1626,7 @@ For 32768 samples with 14 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256
For 1024 samples with 9 levels: boundaries at 0, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024
### Step 4: Frequency-Dependent Quantization
DWT coefficients are quantized using perceptually-tuned frequency-dependent weights:
Base Weights by Level:
Level 0 (16-8 KHz): 3.0
Level 1 (8-4 KHz): 2.0
Level 2 (4-2 KHz): 1.5
Level 3 (2-1 KHz): 1.0
Level 4 (1-0.5 KHz): 0.75
Level 5 (0.5-0.25 KHz): 0.5
Level 6-7 (DC-0.25 KHz): 0.25
Quality scaling factor: 1.0 + (5 - quality) * 0.3
DWT coefficients are quantized using perceptually-tuned frequency-dependent weights.
Final quantization step: base_weight * quality_scale
@@ -1690,13 +1673,8 @@ Convert Mid/Side back to Left/Right stereo:
Left = Mid + Side
Right = Mid - Side
### Step 6: PCM8 to PCM16 Upsampling
Convert signed PCM8 back to PCM16LE by multiplying by 256:
pcm16_value = pcm8_value * 256
## Compression Performance
- **Target Ratio**: 2:1 against PCMu8 (4:1 against PCM16LE input)
- **Target Ratio**: 2:1 against PCMu8
- **Achieved Ratio**: 2.51:1 against PCMu8 at quality level 3
- **Quality**: Perceptually transparent at Q3+, preserves full 0-16 KHz bandwidth
- **Sparsity**: 86.9% zeros in Mid channel, 97.8% in Side channel (typical)
@@ -1721,10 +1699,10 @@ This allows TAV video files to embed TAD-compressed audio using packet type 0x24
TAD encoder uses two-pass FFmpeg extraction for optimal quality:
# Pass 1: Extract at original sample rate
ffmpeg -i input.mp4 -f s16le -ac 2 temp.pcm
ffmpeg -i input.mp4 -f f32le -ac 2 temp.pcm
# Pass 2: High-quality resample with SoXR and highpass filter
ffmpeg -f s16le -ar {original_rate} -ac 2 -i temp.pcm \
ffmpeg -f f32le -ar {original_rate} -ac 2 -i temp.pcm \
-ar 32000 -af "aresample=resampler=soxr:precision=28:cutoff=0.99,highpass=f=16" \
output.pcm