tsvm/video_encoder/TAD_README.md

# TAD - TSVM Advanced Audio Codec

A perceptually-optimised wavelet-based audio codec designed for resource-constrained systems, featuring CDF 9/7 wavelets, EZBC sparse coding, and sophisticated perceptual quantisation.

## Overview

TAD (TSVM Advanced Audio) is a modern audio codec built on discrete wavelet transform (DWT) using Cohen-Daubechies-Feauveau (CDF) 9/7 biorthogonal wavelets. It combines perceptual quantisation, advanced entropy coding, and careful optimisation for resource-constrained systems.

### Key Advantages

- **Perceptual optimisation**: HVS-aware quantisation preserves audio quality where it matters
- **Efficient sparse coding**: EZBC encoding exploits coefficient sparsity (86.9% zeros in typical content)
- **Variable chunk sizes**: Supports any chunk size ≥1024 samples, including non-power-of-2
- **Stereo decorrelation**: Mid/Side encoding exploits stereo correlation for better compression
- **Hardware-friendly**: Designed for efficient decoding on resource-constrained platforms

## Features

### Compression Technology

- **CDF 9/7 Biorthogonal Wavelets**
  - 9-level fixed decomposition for all chunk sizes
  - Lifting scheme implementation for efficient computation
  - Optimal frequency discrimination for audio signals

- **Pre-processing**
  - First-order IIR pre-emphasis filter (α=0.5) shifts quantisation noise to lower frequencies, where they are less objectionable to listeners
  - Gamma companding (γ=0.5) for dynamic range compression before quantisation
  - Mid/Side stereo transformation exploits stereo correlation
  - Lambda companding (λ=6.0) with Laplacian CDF mapping for full bit utilisation

- **Perceptual Quantisation**
  - Channel-specific (Mid/Side) frequency-dependent weights
  - Subband-aware quantisation preserves perceptually important frequencies

- **EZBC Encoding**
  - Binary tree embedded zero block coding
  - Exploits coefficient sparsity (86.9% Mid, 97.8% Side typical)
  - Progressive refinement structure
  - Spatial clustering of non-zero coefficients

- **Entropy Coding**
  - Zstandard compression (level 7) on concatenated EZBC bitstreams
  - Cross-channel compression optimisation
  - Optional Zstd bypass for debugging

### Audio Format

- **Sample Rate**: 32 KHz (TSVM audio hardware native format)
- **Channels**: Stereo (L/R input, Mid/Side internal representation)
- **Chunk Sizes**: Variable, any size ≥1024 samples (including non-power-of-2)
- **Bit Depth**: 32-bit float internal, 8-bit unsigned PCM output with noise-shaped dithering
- **Bandwidth**: Full 0-16 KHz frequency range preserved

### Quality Levels

Six quality levels (0-5) provide a wide range of compression/quality trade-offs:
- **Level 0**: Lowest quality, smallest file size
- **Level 3**: Default, balanced quality/compression (2.51:1 vs PCMu8)
- **Level 5**: Highest quality, largest file size

Quality levels are designed to be synchronised with TAV video codec for unified encoding.

## Building

### Prerequisites

- C compiler (GCC/Clang)
- Zstandard library (libzstd)
- Math library (libm)

### Compilation

```bash
# Build TAD encoder/decoder
make tad

# Build all tools
make all

# Clean build artifacts
make clean
```

### Build Targets

- `encoder_tad` - Standalone audio encoder with FFmpeg calls
- `decoder_tad` - Standalone audio decoder

## Usage

### Basic Encoding

Encoding requires FFmpeg executable installed in your system.

```bash
# Default encoding (quality level 3)
./encoder_tad -i input.mp3 -o output.tad

# Specify quality level (0-5)
./encoder_tad -i input.m4a -o output.tad -q 0    # Lowest quality
./encoder_tad -i input.ogg -o output.tad -q 5    # Highest quality

# Disable Zstd compression (for debugging)
./encoder_tad -i input.opus -o output.tad --no-zstd

# Verbose output with statistics
./encoder_tad -i input.flac -o output.tad -v
```

### Decoding

```bash
# Decode to PCMu8
./decoder_tad -i input.tad -o output.pcm --raw-pcm

# Decode to WAV
./decoder_tad -i input.tad -o output.wav
```

### Input Formats

TAD encoder accepts any audio format supported by FFmpeg:
- Audio files: WAV, MP3, FLAC, OGG, AAC, etc.
- Video files with audio streams: MP4, MKV, AVI, etc.
- Raw PCM formats

Audio is automatically resampled to 32 KHz stereo if necessary.

## Technical Architecture

### Encoder Pipeline

1. **Input Processing**
   - FFmpeg demuxing and audio stream extraction
   - Resampling to 32 KHz stereo
   - Conversion to PCM32f

2. **Pre-emphasis Filter**
   - First-order IIR filter with α=0.5
   - Shifts quantisation noise toward lower frequencies
   - Improves perceptual quality

3. **Gamma Companding**
   - Dynamic range compression with γ=0.5
   - Applied independently to each sample
   - Reduces quantisation error for low-amplitude signals

4. **Stereo Decorrelation**
   - Left/Right to Mid/Side transformation
   - Mid = (L + R) / 2
   - Side = (L - R) / 2
   - Exploits stereo correlation for better compression

5. **9-Level CDF 9/7 DWT**
   - Fixed 9 decomposition levels for all chunk sizes
   - Forward lifting scheme implementation
   - Correct length tracking for non-power-of-2 sizes

6. **Perceptual Quantisation**
   - Channel-specific (Mid/Side) subband weights
   - Lambda companding with λ=6.0
   - Laplacian CDF mapping: `sign(x) * floor(λ * log(1 + |x|/λ))`
   - Quantised to int8 coefficients

7. **EZBC Encoding**
   - Binary tree structure per channel
   - Progressive refinement by bitplanes
   - Zero block coding exploits sparsity
   - Independent bitstreams for Mid and Side

8. **Zstd Compression**
   - Level 7 compression on concatenated `[Mid_bitstream][Side_bitstream]`
   - Cross-channel optimisation opportunities
   - Adaptive compression based on content

### Decoder Pipeline

1. **Container Parsing**
   - TAD packet identification (type 0x24)
   - Chunk size extraction
   - Compressed data boundaries

2. **Zstd Decompression**
   - Decompress concatenated bitstreams
   - Split into Mid and Side EZBC streams

3. **EZBC Decoding**
   - Binary tree decoder per channel
   - Reconstruct quantised int8 coefficients
   - Progressive refinement reconstruction

4. **Lambda Decompanding**
   - Inverse Laplacian CDF with channel-specific weights
   - Reconstruct float32 DWT coefficients
   - Apply subband-specific perceptual weights

5. **9-Level Inverse CDF 9/7 DWT**
   - Inverse lifting scheme implementation
   - Correct length tracking for non-power-of-2 chunk sizes
   - Pre-calculated length sequence from forward transform

6. **Mid/Side to Left/Right**
   - L = Mid + Side
   - R = Mid - Side
   - Reconstruct stereo channels

7. **Gamma Decompanding**
   - Inverse gamma with γ⁻¹=2.0
   - Restore original dynamic range

8. **De-emphasis Filter**
   - Reverse pre-emphasis with α=0.5
   - Remove frequency shaping
   - Restore flat frequency response

9. **PCM32f to PCM8u Conversion**
   - Noise-shaped dithering for 8-bit output
   - Clamping to valid range
   - Final output format

### Wavelet Implementation

CDF 9/7 wavelet follows a **two-stage lifting scheme**:

```c
// Forward Transform: Predict → Update
// Predict step (generate high-pass)
temp[half + i] = data[odd] - α * (data[even_left] + data[even_right]);

// Update step (generate low-pass)
temp[i] = data[even] + β * (temp[half + i - 1] + temp[half + i]);

// Normalization (K factor)
temp[i] *= K;
temp[half + i] /= K;

// Inverse Transform: Denormalize → Undo Update → Undo Predict (reversed order)
temp[i] /= K;
temp[half + i] *= K;

temp[i] -= β * (temp[half + i - 1] + temp[half + i]);
data[odd] = temp[half + i] + α * (temp[i] + temp[i + 1]);
data[even] = temp[i];
```

**CDF 9/7 Coefficients**:
- α = -1.586134342
- β = -0.052980118
- γ = +0.882911075
- δ = +0.443506852
- K = 1.230174105

### Non-Power-of-2 Chunk Size Handling

Critical implementation detail for variable chunk sizes:

```c
// Pre-calculate exact length sequence from forward transform
int lengths[MAX_LEVELS + 1];
lengths[0] = chunk_size;
for (int i = 1; i <= levels; i++) {
    lengths[i] = (lengths[i - 1] + 1) / 2;
}

// Apply inverse DWT using lengths[level] for each level
// NEVER use simple doubling (length *= 2) - incorrect for non-power-of-2!
```

Incorrect length tracking causes mirrored subband artefacts in decoded audio.

### Perceptual Quantisation Weights

Channel-specific weights for Mid (channel 0) and Side (channel 1):

```c
// Base quantiser weights per subband (9 levels + approximation)
float BASE_QUANTISER_WEIGHTS[2][10] = {
    // Mid channel (0)
    {4.0f, 2.0f, 1.8f, 1.6f, 1.4f, 1.2f, 1.0f, 1.0f, 1.3f, 2.0f},

    // Side channel (1)
    {6.0f, 5.0f, 2.6f, 2.4f, 1.8f, 1.3f, 1.0f, 1.0f, 1.6f, 3.2f}
};

// During dequantisation:
float weight = BASE_QUANTISER_WEIGHTS[channel][subband] * quantiser_scale;
coeffs[i] = normalised_val * TAD32_COEFF_SCALARS[subband] * weight;
```

Different weights for Mid and Side channels reflect perceptual importance of frequency bands in each channel. DC frequency has highest weight (4.0 Mid, 6.0 Side) due to energy concentration.

## Performance Characteristics

### Compression Efficiency

- **Target Compression**: 2:1 against PCMu8 baseline (4:1 against PCM16LE input)
- **Achieved Compression**: 2.51:1 against PCMu8 at quality level 3
- **Audio Quality**: Preserves full 0-16 KHz bandwidth
- **Coefficient Sparsity**: 86.9% zeros in Mid channel, 97.8% in Side channel (typical)
- **EZBC Benefits**: Exploits sparsity, progressive refinement, spatial clustering

### Computational Complexity

- **Encoding**: O(n log n) per chunk for DWT, O(n) for EZBC encoding
- **Decoding**: O(n log n) per chunk for inverse DWT, O(n) for EZBC decoding
- **Memory**: O(n) working memory for chunk processing

### Quality Characteristics

- **Frequency Response**: Flat 0-16 KHz within perceptual limits
- **Dynamic Range**: Preserved through gamma companding
- **Stereo Imaging**: Maintained through Mid/Side decorrelation
- **Perceptual Quality**: Optimised for human auditory system characteristics

## Integration with TAV

TAD is designed as an includable API for TAV video encoder integration:

- **Variable Chunk Sizes**: Audio chunks can match video GOP boundaries (e.g., 32016 samples for 1-second TAV GOP)
- **Unified Quality Levels**: TAD quality 0-5 synchronised with TAV quality 0-5
- **Embedded Packets**: TAV embeds TAD-compressed audio using packet type 0x24
- **Shared Container**: Single .tav file contains both video and audio streams

### TAV Integration Example

```c
// TAD handles non-power-of-2 chunk size correctly
tad_encode_chunk(audio_buffer, audio_samples_per_gop, output_buffer, &output_size);

// TAV embeds TAD packet
tav_write_packet(TAV_PACKET_AUDIO, output_buffer, output_size);
```

## Format Specification

For complete packet structure and bitstream format details, refer to `format documentation.txt`.

### Key Packet Types

- `0x24`: TAD audio packet (used in standalone .tad files and embedded in .tav files)

## Related Projects

- **TAV** (TSVM Advanced Video): Wavelet-based video codec with integrated TAD audio
- **TSVM**: Target virtual machine platform for TAD playback

## Licence

MIT.