TAV: experimental 3D DWT encoder

2026-06-07 22:14:03 +09:00 · 2025-10-15 16:04:27 +09:00
parent b40b2ff0a1
commit 7e248bc83d
5 changed files with 1398 additions and 16 deletions
--- a/CLAUDE.md
+++ b/CLAUDE.md
@@ -189,6 +189,9 @@ Peripheral memories can be accessed using `vm.peek()` and `vm.poke()` functions,
  ./encoder_tav -i input.mp4 -q 0 -o output.tav         # Lowest quality, smallest file
  ./encoder_tav -i input.mp4 -q 5 -o output.tav         # Highest quality, largest file
  # Temporal 3D DWT (GOP-based encoding)
  ./encoder_tav -i input.mp4 --temporal-dwt -q 2 -o output.tav
  # Playback
  playtav output.tav
  ```
@@ -264,3 +267,49 @@ Concatenated Maps Layout:
 - **Compression improvement**: 16.4% from significance maps + 1.6% from concatenated layout
 - **Real-world impact**: 559 bytes saved per frame (5.59 MB per 10k frames)
 - **Cross-channel benefit**: Concatenated maps allow Zstd to exploit similarity between significance patterns
 #### TAV Temporal 3D DWT (GOP Unified Encoding)
 Implemented on 2025-10-15 for improved temporal compression through group-of-pictures (GOP) encoding:
 **Key Features**:
 - **3D DWT**: Applies DWT in both spatial (2D) and temporal (1D) dimensions for optimal spacetime compression
 - **Unified GOP Preprocessing**: Single significance map for all frames and channels in a GOP (width×height×N_frames×3_channels)
 - **FFT-based Phase Correlation**: Uses FFTW3 library for accurate global motion estimation with quarter-pixel precision
 - **GOP Size**: Typically 16 frames (configurable), with scene change detection for adaptive GOPs
 - **Single-frame Fallback**: GOP size of 1 automatically uses traditional I-frame encoding
 **Packet Format**:
 - **0x12 (GOP_UNIFIED)**: `[gop_size][motion_vectors...][compressed_size][compressed_data]`
  - Motion vectors stored as int16_t in quarter-pixel units for all frames in GOP
  - Unified significance map for entire GOP block enables cross-frame compression
 - **0xFC (GOP_SYNC)**: `[frame_count]` - Indicates N frames were decoded from GOP block
 - **Timecode Emission**: One timecode packet per GOP (not per frame)
 **Technical Implementation**:
 ```c
 // Unified preprocessing structure (encoder_tav.c:2371-2509)
 [All_Y_maps][All_Co_maps][All_Cg_maps][All_Y_values][All_Co_values][All_Cg_values]
 // Where maps are grouped by channel across all GOP frames for optimal Zstd compression
 // Phase correlation using FFT (encoder_tav.c:1246-1383)
 // - FFTW3 forward FFT on grayscale frames
 // - Cross-power spectrum computation
 // - Inverse FFT gives correlation peak at (dx, dy)
 // - Parabolic interpolation for quarter-pixel refinement
 ```
 **Usage**:
 ```bash
 # Enable temporal 3D DWT
 ./encoder_tav -i input.mp4 --temporal-dwt -q 2 -o output.tav
 # Inspect GOP structure
 ./tav_inspector output.tav -v
 ```
 **Compression Benefits**:
 - **Temporal Coherence**: Exploits similarity across consecutive frames
 - **Unified Compression**: Zstd compresses entire GOP as single block, finding patterns across time
 - **Motion Compensation**: FFT-based phase correlation provides accurate global motion estimation
 - **Adaptive GOPs**: Scene change detection ensures optimal GOP boundaries
--- a/terranmon.txt
+++ b/terranmon.txt
@@ -955,6 +955,7 @@ transmission capability, and region-of-interest coding.
    <video packets>
    0x10: I-frame (intra-coded frame)
    0x11: P-frame (delta/skip frame)
    0x12: GOP Unified (temporal 3D DWT with unified preprocessing)
    0x1F: (prohibited)
    0x20: MP2 audio packet
    0x30: Subtitle in "Simple" format
@@ -980,6 +981,7 @@ transmission capability, and region-of-interest coding.
    0xEF: TAV Extended Header
    0xF0: Loop point start (insert right AFTER the TC packet; no payload)
    0xF1: Loop point end (insert right AFTER the TC packet; no payload)
    0xFC: GOP Sync packet (indicates N frames decoded from GOP block)
    0xFD: Timecode (TC) Packet [for frame 0, insert at the beginning; otherwise, insert right AFTER the sync]
    0xFE: NTSC sync packet (used by player to calculate exact framerate-wise performance; no payload)
    0xFF: Sync packet (no payload)
@@ -991,7 +993,7 @@ transmission capability, and region-of-interest coding.
        2. Standard metadata payloads (if any)
        Frame group:
-        1. TC Packet (0xFD) or File packet (0x1F) [mutually exclusive!]
+        1. TC Packet (0xFD) or Next TAV File (0x1F) [mutually exclusive!]
        2. Loop point packets
        3. Audio packets
        4. Subtitle packets
@@ -1045,11 +1047,58 @@ transmission capability, and region-of-interest coding.
    uint8  Packet Type (0xFE)
    uint64 Time since stream start in nanoseconds (this may NOT start from zero if the video is coming from a livestream)
-## Video Packet Structure
+## Video Packet Structure (0x10, 0x11)
    uint8  Packet Type
    uint32 Compressed Size
    *      Zstd-compressed Block Data
 ## GOP Unified Packet Structure (0x12)
 Implemented on 2025-10-15 for temporal 3D DWT with unified preprocessing.
 This packet contains multiple frames encoded as a single spacetime block for optimal
 temporal compression.
    uint8  Packet Type (0x12)
    uint8  GOP Size (number of frames in this GOP, typically 16)
    int16  Motion Vectors X[GOP Size] (quarter-pixel precision for global motion compensation)
    int16  Motion Vectors Y[GOP Size] (quarter-pixel precision for global motion compensation)
    uint32 Compressed Size
    *      Zstd-compressed Unified Block Data
 ### Unified Block Data Format
 The entire GOP (width×height×N_frames×3_channels) is preprocessed as a single block:
    uint8  Y Significance Maps[(width*height + 7) / 8 * GOP Size]     // All Y frames concatenated
    uint8  Co Significance Maps[(width*height + 7) / 8 * GOP Size]    // All Co frames concatenated
    uint8  Cg Significance Maps[(width*height + 7) / 8 * GOP Size]    // All Cg frames concatenated
    int16  Y Non-zero Values[variable length]                          // All Y non-zero coefficients
    int16  Co Non-zero Values[variable length]                         // All Co non-zero coefficients
    int16  Cg Non-zero Values[variable length]                         // All Cg non-zero coefficients
 This layout enables Zstd to find patterns across both spatial and temporal dimensions,
 resulting in superior compression compared to per-frame encoding.
 ### Motion Vectors
 - Stored in quarter-pixel units (divide by 4.0 for pixel displacement)
 - Used for global motion compensation (camera movement, scene translation)
 - Computed using FFT-based phase correlation for accurate frame alignment
 - First frame (frame 0) typically has motion vector (0, 0)
 ### Temporal 3D DWT Process
 1. Apply 1D DWT across temporal axis (GOP frames)
 2. Apply 2D DWT on each spatial slice of temporal subbands
 3. Perceptual quantization with temporal-spatial awareness
 4. Unified significance map preprocessing across all frames/channels
 5. Single Zstd compression of entire GOP block
 ## GOP Sync Packet Structure (0xFC)
 Indicates that N frames were decoded from a GOP Unified block.
 Decoders must track this to maintain proper frame count and synchronization.
    uint8  Packet Type (0xFC)
    uint8  Frame Count (number of frames that were decoded from preceding GOP block)
 Note: GOP Sync packets have no payload size field (fixed 2-byte packet).
 ## Block Data (per frame)
    uint8  Mode: encoding mode
            0x00 = SKIP (just use frame data from previous frame)
--- a/video_encoder/Makefile
+++ b/video_encoder/Makefile
@@ -18,7 +18,7 @@ tev: encoder_tev.c
 tav: encoder_tav.c
 	rm -f encoder_tav
-	$(CC) $(CFLAGS) -o encoder_tav $< $(LIBS)
+	$(CC) $(CFLAGS) -o encoder_tav $< $(LIBS) -lfftw3f
 tav_decoder: decoder_tav.c
 	rm -f decoder_tav
--- a/video_encoder/encoder_tav.c
+++ b/video_encoder/encoder_tav.c
--- a/video_encoder/tav_inspector.c
+++ b/video_encoder/tav_inspector.c
@@ -16,6 +16,7 @@
 // Packet type constants
 #define TAV_PACKET_IFRAME         0x10
 #define TAV_PACKET_PFRAME         0x11
 #define TAV_PACKET_GOP_UNIFIED    0x12  // Unified 3D DWT GOP (all frames in single block)
 #define TAV_PACKET_AUDIO_MP2      0x20
 #define TAV_PACKET_SUBTITLE       0x30
 #define TAV_PACKET_SUBTITLE_KAR   0x31
@@ -43,6 +44,7 @@
 #define TAV_PACKET_EXTENDED_HDR   0xEF
 #define TAV_PACKET_LOOP_START     0xF0
 #define TAV_PACKET_LOOP_END       0xF1
 #define TAV_PACKET_GOP_SYNC       0xFC  // GOP sync packet (N frames decoded)
 #define TAV_PACKET_TIMECODE       0xFD
 #define TAV_PACKET_SYNC_NTSC      0xFE
 #define TAV_PACKET_SYNC           0xFF
@@ -55,6 +57,9 @@ typedef struct {
    int pframe_intra_count;
    int pframe_delta_count;
    int pframe_skip_count;
    int gop_unified_count;
    int gop_sync_count;
    int total_gop_frames;
    int audio_count;
    int subtitle_count;
    int timecode_count;
@@ -87,6 +92,7 @@ const char* get_packet_type_name(uint8_t type) {
    switch (type) {
        case TAV_PACKET_IFRAME: return "I-FRAME";
        case TAV_PACKET_PFRAME: return "P-FRAME";
        case TAV_PACKET_GOP_UNIFIED: return "GOP (3D DWT Unified)";
        case TAV_PACKET_AUDIO_MP2: return "AUDIO MP2";
        case TAV_PACKET_SUBTITLE: return "SUBTITLE (Simple)";
        case TAV_PACKET_SUBTITLE_KAR: return "SUBTITLE (Karaoke)";
@@ -98,6 +104,7 @@ const char* get_packet_type_name(uint8_t type) {
        case TAV_PACKET_EXTENDED_HDR: return "EXTENDED HEADER";
        case TAV_PACKET_LOOP_START: return "LOOP START";
        case TAV_PACKET_LOOP_END: return "LOOP END";
        case TAV_PACKET_GOP_SYNC: return "GOP SYNC";
        case TAV_PACKET_TIMECODE: return "TIMECODE";
        case TAV_PACKET_SYNC_NTSC: return "SYNC (NTSC)";
        case TAV_PACKET_SYNC: return "SYNC";
@@ -114,6 +121,7 @@ int should_display_packet(uint8_t type, display_options_t *opts) {
    if (opts->show_all) return 1;
    if (opts->show_video && (type == TAV_PACKET_IFRAME || type == TAV_PACKET_PFRAME ||
        type == TAV_PACKET_GOP_UNIFIED || type == TAV_PACKET_GOP_SYNC ||
        (type >= 0x70 && type <= 0x7F))) return 1;
    if (opts->show_audio && type == TAV_PACKET_AUDIO_MP2) return 1;
    if (opts->show_subtitles && (type == TAV_PACKET_SUBTITLE || type == TAV_PACKET_SUBTITLE_KAR)) return 1;
@@ -449,6 +457,64 @@ int main(int argc, char *argv[]) {
                break;
            }
            case TAV_PACKET_GOP_UNIFIED: {
                // Unified GOP packet: [gop_size][motion_vectors...][compressed_size][data]
                uint8_t gop_size;
                if (fread(&gop_size, 1, 1, fp) != 1) break;
                // Read all motion vectors
                int16_t *motion_x = malloc(gop_size * sizeof(int16_t));
                int16_t *motion_y = malloc(gop_size * sizeof(int16_t));
                for (int i = 0; i < gop_size; i++) {
                    if (fread(&motion_x[i], sizeof(int16_t), 1, fp) != 1) break;
                    if (fread(&motion_y[i], sizeof(int16_t), 1, fp) != 1) break;
                }
                // Read compressed data size
                uint32_t size;
                if (fread(&size, sizeof(uint32_t), 1, fp) != 1) {
                    free(motion_x);
                    free(motion_y);
                    break;
                }
                stats.total_video_bytes += size;
                stats.gop_unified_count++;
                stats.total_gop_frames += gop_size;
                if (!opts.summary_only && display) {
                    printf(" - GOP size=%u, data size=%u bytes (%.2f bytes/frame)",
                           gop_size, size, (double)size / gop_size);
                    // Always show motion vectors for GOP packets
                    if (gop_size > 0) {
                        printf("\n    Motion vectors (quarter-pixel):");
                        for (int i = 0; i < gop_size; i++) {
                            printf("\n      Frame %d: (%.2f, %.2f) px",
                                   i, motion_x[i] / 4.0, motion_y[i] / 4.0);
                        }
                    }
                }
                free(motion_x);
                free(motion_y);
                fseek(fp, size, SEEK_CUR);
                break;
            }
            case TAV_PACKET_GOP_SYNC: {
                // GOP sync packet: [frame_count]
                uint8_t frame_count;
                if (fread(&frame_count, 1, 1, fp) != 1) break;
                stats.gop_sync_count++;
                if (!opts.summary_only && display) {
                    printf(" - %u frames decoded from GOP block", frame_count);
                }
                break;
            }
            case TAV_PACKET_IFRAME:
            case TAV_PACKET_PFRAME:
            case TAV_PACKET_VIDEO_CH2_I:
@@ -610,6 +676,12 @@ int main(int argc, char *argv[]) {
        printf(")");
    }
    printf("\n");
    if (stats.gop_unified_count > 0) {
        printf("  3D GOP packets:     %d (total frames: %d, avg %.1f frames/GOP)\n",
               stats.gop_unified_count, stats.total_gop_frames,
               (double)stats.total_gop_frames / stats.gop_unified_count);
        printf("  GOP sync packets:   %d\n", stats.gop_sync_count);
    }
    printf("  Mux video:          %d\n", stats.mux_video_count);
    printf("  Total video bytes:  %llu (%.2f MB)\n",
           (unsigned long long)stats.total_video_bytes,