Files
momentry_core/docs/VIDEO_PROCESSING_SPEC.md
accusys 383201cacd feat: Initial v0.9 release with API Key authentication
## v0.9.20260325_144654

### Features
- API Key Authentication System
- Job Worker System
- V2 Backup Versioning

### Bug Fixes
- get_processor_results_by_job column mapping

Co-authored-by: OpenCode
2026-03-25 14:53:41 +08:00

33 KiB
Raw Blame History

Video 解析行為規範

項目 內容
建立者 Warren
建立時間 2026-03-16
文件版本 V1.0

版本歷史

版本 日期 目的 操作人 工具/模型
V1.0 2026-03-16 創建文件 Warren OpenCode / MiniMax M2.5

本文檔定義 Momentry Core 系統中影片解析的完整行為規範,涵蓋觸發、狀態、輸出、斷點續傳、多語言處理及各種識別標示。


1. Video 檔案觸發規範

1.1 支援的影片格式

格式 副檔名 說明
MP4 .mp4 最廣泛支援
MOV .mov QuickTime 格式
AVI .avi 傳統格式
MKV .mkv Matroska 格式
WebM .webm Web 格式
WMV .wmv Windows Media
FLV .flv Flash 格式

1.2 觸發方式

1.2.1 指令列註冊

cargo run -- register /path/to/video.mp4

1.2.2 監控目錄自動觸發

# monitor_config.yaml
watch:
  directories:
    - /path/to/watch
  recursive: true
  extensions: [".mp4", ".mov", ".avi", ".mkv", ".webm"]

1.2.3 API 觸發

# POST /api/v1/register
curl -X POST http://localhost:3002/api/v1/register \
  -H "Content-Type: application/json" \
  -d '{"path": "/path/to/video.mp4", "auto_process": true}'

1.3 觸發前驗證

pub fn validate_video_file(path: &str) -> Result<VideoValidation> {
    // 1. 檢查檔案存在
    // 2. 檢查副檔名
    // 3. 檢查檔案大小 > 0
    // 4. 檢查是否為有效的影片檔案 (魔數 Magic Number)
    
    // 回傳結構
    Ok(VideoValidation {
        path: path.to_string(),
        valid: true,
        codec: "h264".to_string(),
        has_video: true,
        has_audio: true,
    })
}

1.4 影片UUID 生成

UUID = MD5(檔案路徑)[0:16]

範例:
"/media/videos/clip.mp4" → "3a2f1b9c4d5e6f0a"

2. Video 處理過程狀態顯示規範

2.1 處理狀態定義

#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ProcessStatus {
    Pending,        // 等待處理
    Registered,     // 已註冊
    Probing,        // 探測中
    AsrProcessing,  // ASR 處理中
    AsrxProcessing, // 說話者分離中
    OcrProcessing,  // OCR 處理中
    YoloProcessing, // YOLO 處理中
    FaceProcessing, // 人臉偵測中
    PoseProcessing, // 姿態估計中
    Chunking,       // 分塊處理中
    Completed,      // 完成
    Failed,         // 失敗
    Paused,         // 暫停
    Resuming,       // 恢復中
}

2.2 狀態輸出格式

2.2.1 標準輸出 (stdout)

[REGISTER] Video registered: 1636719dc31f78ac
[PROBE] Starting probe for video.mp4
[PROBE] Duration: 120.5s, FPS: 24/1, Resolution: 1920x1080
[ASR_START] Loading Whisper model...
[ASR_LANGUAGE:en] Language detected: English (99.45%)
[ASR_PROGRESS:50] Processed 50 segments...
[ASR_PROGRESS:100] Processed 100 segments...
[ASR_COMPLETE:150] Completed! Total: 150 segments
[ASRX_START] Loading pyannote model...
[ASRX_PROGRESS] Speaker diarization: 3 speakers identified
[ASRX_COMPLETE] Speaker diarization complete
[OCR_START] Starting OCR processing...
[OCR_PROGRESS:30/60] Frame 30/60 processed
[OCR_COMPLETE] OCR complete: 25 text regions found
[YOLO_START] Starting YOLO processing...
[YOLO_PROGRESS:60/120] Frame 60/120 processed
[YOLO_COMPLETE] YOLO complete: 189 objects detected
[FACE_START] Starting face detection...
[FACE_PROGRESS:60/120] Frame 60/120 processed
[FACE_COMPLETE] Face detection complete: 5 unique faces
[POSE_START] Starting pose estimation...
[POSE_PROGRESS:60/120] Frame 60/120 processed
[POSE_COMPLETE] Pose estimation complete: 12 persons detected
[CHUNK_START] Creating chunks...
[CHUNK_COMPLETE] 450 chunks created
[COMPLETE] Video processing complete!

2.2.2 狀態訊息前綴

處理階段 前綴 範例
註冊 [REGISTER] [REGISTER] Video registered: 1636719dc31f78ac
探測 [PROBE] [PROBE] Duration: 120.5s
ASR [ASR_*] [ASR_START], [ASR_PROGRESS:50]
ASRx [ASRX_*] [ASRX_START], [ASRX_COMPLETE]
OCR [OCR_*] [OCR_START], [OCR_PROGRESS:30/60]
YOLO [YOLO_*] [YOLO_START], [YOLO_COMPLETE]
Face [FACE_*] [FACE_START], [FACE_PROGRESS:60/120]
Pose [POSE_*] [POSE_START], [POSE_COMPLETE]
Chunk [CHUNK_*] [CHUNK_START], [CHUNK_COMPLETE]
完成 [COMPLETE] [COMPLETE] Video processing complete!
錯誤 [ERROR] [ERROR] ASR processing failed
警告 [WARN] [WARN] No audio track detected

2.3 即時進度報告

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ProcessProgress {
    pub uuid: String,
    pub status: ProcessStatus,
    pub current_processor: String,
    pub total_frames: i64,
    pub processed_frames: i64,
    pub progress_percentage: f64,
    pub elapsed_seconds: f64,
    pub estimated_remaining_seconds: f64,
    pub last_checkpoint: Option<Checkpoint>,
}

範例輸出

{
  "uuid": "1636719dc31f78ac",
  "status": "asr_processing",
  "current_processor": "asr",
  "total_frames": 3000,
  "processed_frames": 1500,
  "progress_percentage": 50.0,
  "elapsed_seconds": 120.5,
  "estimated_remaining_seconds": 120.5,
  "last_checkpoint": {
    "timestamp": 60.0,
    "segments_processed": 50,
    "output_file": "1636719dc31f78ac.asr.partial.json"
  }
}

3. Video 處理輸出規範

3.1 輸出檔案命名

{UUID}.{處理類型}.json

範例:
1636719dc31f78ac.probe.json    # 探測結果
1636719dc31f78ac.asr.json      # ASR 結果
1636719dc31f78ac.asrx.json     # 說話者分離結果
1636719dc31f78ac.ocr.json      # OCR 結果
1636719dc31f78ac.yolo.json     # YOLO 結果
1636719dc31f78ac.face.json     # 人臉偵測結果
1636719dc31f78ac.pose.json     # 姿態估計結果
1636719dc31f78ac.chunks.json   # 分塊結果

3.2 輸出目錄結構

momentry_core/
├── output/
│   ├── {uuid}/
│   │   ├── {uuid}.probe.json
│   │   ├── {uuid}.asr.json
│   │   ├── {uuid}.asrx.json
│   │   ├── {uuid}.ocr.json
│   │   ├── {uuid}.yolo.json
│   │   ├── {uuid}.face.json
│   │   ├── {uuid}.pose.json
│   │   └── thumbnails/
│   │       ├── thumb_000.jpg
│   │       ├── thumb_001.jpg
│   │       └── ...
│   └── checkpoints/
│       └── {uuid}/
│           ├── {uuid}.asr.partial.001.json
│           ├── {uuid}.asr.partial.002.json
│           └── ...

3.3 完整處理結果 JSON 結構

{
  "uuid": "1636719dc31f78ac",
  "video_path": "/path/to/video.mp4",
  "video_info": {
    "duration": 120.5,
    "fps": "24/1",
    "fps_value": 24.0,
    "width": 1920,
    "height": 1080,
    "has_video": true,
    "has_audio": true,
    "has_music": false,
    "audio_codec": "aac",
    "video_codec": "h264"
  },
  "processing": {
    "status": "completed",
    "started_at": "2026-03-16T10:00:00Z",
    "completed_at": "2026-03-16T10:05:00Z",
    "elapsed_seconds": 300.0,
    "processors": {
      "asr": {
        "status": "completed",
        "language": "en",
        "language_probability": 0.9945,
        "segments_count": 150,
        "duration_seconds": 120.0
      },
      "asrx": {
        "status": "completed",
        "speakers_count": 3,
        "segments_count": 150,
        "duration_seconds": 60.0
      },
      "ocr": {
        "status": "completed",
        "text_regions_count": 25,
        "duration_seconds": 45.0
      },
      "yolo": {
        "status": "completed",
        "objects_count": 189,
        "unique_classes": ["person", "car", "dog"],
        "duration_seconds": 30.0
      },
      "face": {
        "status": "completed",
        "unique_faces_count": 5,
        "duration_seconds": 30.0
      },
      "pose": {
        "status": "completed",
        "persons_count": 12,
        "duration_seconds": 15.0
      }
    }
  },
  "asr": {
    "language": "en",
    "language_probability": 0.9945855736732483,
    "segments": [...]
  },
  "asrx": {
    "language": "en",
    "segments": [...]
  },
  "ocr": {
    "segments": [...]
  },
  "yolo": {
    "segments": [...]
  },
  "face": {
    "segments": [...]
  },
  "pose": {
    "segments": [...]
  }
}

4. Video 處理中分時輸出規範 (Checkpoint)

4.1 分時輸出目的

  • 避免處理異常中斷導致資料全部遺失
  • 提供中斷點,方便後續接續處理
  • 可設定輸出頻率(每 N 秒或每 N 幀)

4.2 配置參數

processing:
  checkpoint:
    enabled: true
    interval_seconds: 60      # 每 60 秒輸出一次
    interval_frames: 1500    # 或每 1500 幀 (二選一)
    output_dir: "checkpoints"
    keep_partial: true       # 保留部分完成檔案

4.3 Checkpoint 結構

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Checkpoint {
    pub uuid: String,
    pub processor: String,
    pub checkpoint_id: String,
    pub timestamp: f64,
    pub frame_number: i64,
    pub total_frames: i64,
    pub progress_percentage: f64,
    pub partial_data: serde_json::Value,
    pub created_at: DateTime<Utc>,
}

4.4 分時輸出檔案命名

{UUID}.{處理類型}.partial.{序號}.json

範例:
1636719dc31f78ac.asr.partial.001.json   # 第 1 次 checkpoint
1636719dc31f78ac.asr.partial.002.json   # 第 2 次 checkpoint
1636719dc31f78ac.asr.partial.003.json   # 第 3 次 checkpoint

4.5 分時輸出範例 (ASR)

{
  "uuid": "1636719dc31f78ac",
  "processor": "asr",
  "checkpoint_id": "partial_001",
  "timestamp": 60.0,
  "frame_number": 1440,
  "total_frames": 3000,
  "progress_percentage": 48.0,
  "partial_data": {
    "language": "en",
    "language_probability": 0.9945,
    "segments": [
      {"start": 0.0, "end": 5.0, "text": "Hello world"},
      {"start": 5.0, "end": 10.0, "text": "This is a test"},
      ...
    ]
  },
  "created_at": "2026-03-16T10:01:00Z"
}

4.6 分時合併邏輯

pub fn merge_checkpoints(checkpoints: Vec<Checkpoint>) -> serde_json::Value {
    // 按 checkpoint_id 排序
    let mut sorted = checkpoints;
    sorted.sort_by(|a, b| a.checkpoint_id.cmp(&b.checkpoint_id));
    
    // 合併 segments
    let mut merged_segments: Vec<serde_json::Value> = vec![];
    for checkpoint in sorted {
        if let Some(segments) = checkpoint.partial_data.get("segments") {
            if let Some(seg_array) = segments.as_array() {
                merged_segments.extend(seg_array.clone());
            }
        }
    }
    
    serde_json::json!({
        "segments": merged_segments
    })
}

5. Video 處理中斷接續規範

5.1 支援的中斷類型

中斷類型 說明 處理方式
程序崩潰 處理程序異常退出 從上次 checkpoint 恢復
系統關機 系統意外關機 從上次 checkpoint 恢復
資源不足 OOM/磁碟空間不足 釋放資源後重試
用戶暫停 用戶主動暫停 顯示 Paused 狀態
網路中斷 遠端資源不可用 重試連線後繼續

5.2 接續處理流程

1. 檢測中斷
   │
   ▼
2. 查找最新 checkpoint
   │
   ▼
3. 載入 partial data
   │
   ▼
4. 驗證數據完整性
   │
   ▼
5. 從 checkpoint 繼續處理
   │
   ▼
6. 輸出完整結果

5.3 接續狀態檢測

pub async fn check_resume_status(uuid: &str) -> Result<ResumeStatus> {
    // 1. 查找所有 checkpoint 檔案
    let checkpoints = find_checkpoints(uuid)?;
    
    // 2. 查找最後處理的進度
    let last_checkpoint = checkpoints.last();
    
    // 3. 檢查主要輸出檔案是否存在
    let main_output_exists = Path::new(&format!("{}.asr.json", uuid)).exists();
    
    // 4. 判斷可恢復的處理器
    let resumeable = ResumeStatus {
        can_resume: !checkpoints.is_empty() && !main_output_exists,
        last_checkpoint: last_checkpoint.cloned(),
        processed_processors: detect_processed_processors(uuid),
        remaining_processors: detect_remaining_processors(uuid),
    };
    
    Ok(resumeable)
}

5.4 Resume Status 結構

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResumeStatus {
    pub can_resume: bool,
    pub last_checkpoint: Option<Checkpoint>,
    pub processed_processors: Vec<String>,
    pub remaining_processors: Vec<String>,
    pub suggested_action: String,
}

5.5 接續命令

# 自動檢測並恢復
cargo run -- resume /path/to/video.mp4

# 強制從頭開始
cargo run -- process /path/to/video.mp4 --force

# 查看處理狀態
cargo run -- status 1636719dc31f78ac

# 查看可恢復的檢查點
cargo run -- checkpoints 1636719dc31f78ac

5.6 衝突處理

pub fn resolve_conflict(
    partial: &serde_json::Value,
    main: &serde_json::Value,
    strategy: ConflictStrategy,
) -> serde_json::Value {
    match strategy {
        ConflictStrategy::KeepMain => main.clone(),
        ConflictStrategy::KeepPartial => partial.clone(),
        ConflictStrategy::Merge => {
            // 合併 segments移除重複
            merge_segments(partial, main)
        }
    }
}

6. Video 處理多語種標示規範

6.1 語言代碼標準

使用 ISO 639-1 兩碼語言代碼:

語言 代碼 範例
英語 en English
國語/普通話 zh Chinese (Mandarin)
粵語 yue Cantonese
閩南語 nan Min Nan
日語 ja Japanese
韓語 ko Korean
西班牙語 es Spanish
法語 fr French
德語 de German
義大利語 it Italian
俄語 ru Russian
阿拉伯語 ar Arabic
印地語 hi Hindi
葡萄牙語 pt Portuguese

6.2 多語種偵測結果

{
  "language": "multi",
  "languages_detected": [
    {
      "code": "en",
      "name": "English",
      "probability": 0.75,
      "segments_count": 100
    },
    {
      "code": "zh",
      "name": "Chinese",
      "probability": 0.20,
      "segments_count": 30
    },
    {
      "code": "ja",
      "name": "Japanese",
      "probability": 0.05,
      "segments_count": 5
    }
  ],
  "primary_language": "en",
  "segments": [
    {
      "start": 0.0,
      "end": 5.0,
      "text": "Hello world",
      "language": "en",
      "language_probability": 0.99
    },
    {
      "start": 5.0,
      "end": 10.0,
      "text": "你好世界",
      "language": "zh",
      "language_probability": 0.98
    }
  ]
}

6.3 段落級語言標示

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AsrSegment {
    pub start: f64,
    pub end: f64,
    pub text: String,
    pub language: Option<String>,           // 段落語言
    pub language_probability: Option<f64>,  // 段落語言機率
}

6.4 資料庫儲存

CREATE TABLE asr_segments (
    id BIGSERIAL PRIMARY KEY,
    uuid VARCHAR(16) NOT NULL,
    segment_index INTEGER NOT NULL,
    start_time DOUBLE PRECISION NOT NULL,
    end_time DOUBLE PRECISION NOT NULL,
    text TEXT NOT NULL,
    language VARCHAR(10),
    language_probability DOUBLE PRECISION,
    created_at TIMESTAMP DEFAULT NOW(),
    UNIQUE(uuid, segment_index)
);

CREATE INDEX idx_asr_language ON asr_segments(language);

7. Video 處理未識別成功語種標示規範

7.1 未識別狀態

狀態 代碼 說明
Unknown unknown 無法判斷語言
Uncertain uncertain 語言識別信心度過低
NoSpeech no_speech 無語音內容
Silent silent 完全是靜音

7.2 未識別結果結構

{
  "language": "unknown",
  "language_probability": null,
  "language_detection_failed": true,
  "failure_reason": "low_confidence",
  "min_confidence_threshold": 0.7,
  "detected_language": "en",
  "detected_probability": 0.45,
  "segments": [
    {
      "start": 0.0,
      "end": 5.0,
      "text": "",
      "language": "no_speech",
      "language_probability": 0.99,
      "has_audio": false
    }
  ]
}

7.3 處理策略

pub fn handle_undetected_language(
    audio_path: &str,
    result: &AsrResult,
) -> AsrResult {
    // 1. 檢測是否為靜音
    let is_silent = detect_silence(audio_path);
    
    // 2. 如果靜音,標示為 silent
    if is_silent {
        return AsrResult {
            language: Some("silent".to_string()),
            language_probability: Some(1.0),
            segments: result.segments.iter().map(|s| AsrSegment {
                language: Some("silent".to_string()),
                has_audio: Some(false),
                ..s.clone()
            }).collect(),
        };
    }
    
    // 3. 如果語言信心度低,標示為 uncertain
    if result.language_probability.unwrap_or(0.0) < 0.7 {
        return AsrResult {
            language: Some("uncertain".to_string()),
            language_probability: result.language_probability,
            segments: result.segments.iter().map(|s| AsrSegment {
                language: Some("uncertain".to_string()),
                language_probability: Some(result.language_probability.unwrap_or(0.0)),
                ..s.clone()
            }).collect(),
        };
    }
    
    result.clone()
}

7.4 靜音偵測

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SilenceDetection {
    pub is_silent: bool,
    pub silence_ratio: f64,      # 靜音比例 (0.0 - 1.0)
    pub audio_level_db: f64,      # 平均音量大 (dB)
    pub threshold_db: f64,        # 閾值 (-40 dB)
}

pub fn detect_silence(audio_path: &str, threshold_db: f64) -> SilenceDetection {
    # 使用 ffmpeg 分析音量大
}

8. Video 處理 Music 標示規範

8.1 音樂偵測結果

{
  "has_music": true,
  "music_segments": [
    {
      "start": 0.0,
      "end": 30.0,
      "type": "background_music",
      "confidence": 0.95,
      "genre": "classical",
      "tempo": 120,
      "has_lyrics": false
    },
    {
      "start": 60.0,
      "end": 90.0,
      "type": "song_with_vocals",
      "confidence": 0.88,
      "artist": "Unknown",
      "title": "Unknown",
      "has_lyrics": true
    }
  ],
  "audio_classification": {
    "speech": 0.30,
    "music": 0.60,
    "ambient": 0.10
  }
}

8.2 音樂類型分類

類型 說明
background_music 背景音樂
song_with_vocals 帶歌詞的歌曲
instrumental 純音樂
sound_effect 音效

8.3 結構定義

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct MusicSegment {
    pub start: f64,
    pub end: f64,
    pub music_type: String,       // 音樂類型
    pub confidence: f64,          // 偵測信心度
    pub genre: Option<String>,   // 音樂類型 (可選)
    pub tempo: Option<f64>,      # BPM (可選)
    pub has_lyrics: bool,        # 是否有歌詞
    pub artist: Option<String>,  # 藝術家 (可選)
    pub title: Option<String>,   # 標題 (可選)
}

9. Video 處理無聲音標示規範

9.1 無聲音定義

狀態 說明
no_audio_track 影片無音軌
all_silent 有音軌但全為靜音
audio_error 音軌讀取錯誤

9.2 無聲音結果

{
  "has_audio": false,
  "audio_status": "no_audio_track",
  "audio_info": {
    "has_audio_track": false,
    "error_message": null,
    "audio_codec": null,
    "sample_rate": null,
    "channels": null,
    "duration": null
  },
  "asr": {
    "language": "no_speech",
    "language_probability": 1.0,
    "segments": [],
    "segments_count": 0,
    "total_speech_duration": 0.0,
    "speech_ratio": 0.0
  }
}

9.3 處理流程

pub async fn process_video_no_audio(uuid: &str, video_path: &str) -> Result<ProcessingResult> {
    // 1. Probe 影片
    let probe = probe_video(video_path).await?;
    
    // 2. 判斷無聲音原因
    let audio_status = if !probe.has_audio_stream {
        "no_audio_track"
    } else if probe.audio_is_silent {
        "all_silent"
    } else {
        "audio_error"
    };
    
    // 3. 產生結果
    Ok(ProcessingResult {
        has_audio: false,
        audio_status: audio_status.to_string(),
        asr: AsrResult {
            language: Some("no_speech".to_string()),
            language_probability: Some(1.0),
            segments: vec![],
        },
        ..Default::default()
    })
}

10. Frame 物件識別標示規範 (YOLO)

10.1 YOLO 偵測結果結構

{
  "model": "yolov8x",
  "model_version": "8.0",
  "segments": [
    {
      "start": 0.0,
      "end": 1.0,
      "frame_number": 0,
      "objects": [
        {
          "class": "person",
          "class_id": 0,
          "confidence": 0.92,
          "box": {
            "x1": 150,
            "y1": 200,
            "x2": 400,
            "y2": 800
          },
          "tracking_id": "person_001"
        },
        {
          "class": "car",
          "class_id": 2,
          "confidence": 0.87,
          "box": {
            "x1": 800,
            "y1": 400,
            "x2": 1200,
            "y2": 700
          },
          "tracking_id": "car_001"
        }
      ]
    }
  ],
  "statistics": {
    "total_objects": 189,
    "unique_classes": ["person", "car", "dog", "bicycle"],
    "class_counts": {
      "person": 120,
      "car": 45,
      "dog": 15,
      "bicycle": 9
    }
  }
}

10.2 支援的類別 (COCO)

類別 ID 類別名稱
0 person
1 bicycle
2 car
3 motorcycle
4 airplane
5 bus
6 train
7 truck
8 boat
9 traffic light
... ...

10.3 結構定義

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct YoloResult {
    pub model: String,
    pub model_version: String,
    pub segments: Vec<YoloSegment>,
    pub statistics: YoloStatistics,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct YoloSegment {
    pub start: f64,
    pub end: f64,
    pub frame_number: i64,
    pub objects: Vec<YoloObject>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct YoloObject {
    pub class: String,
    pub class_id: i32,
    pub confidence: f64,
    pub box: BoundingBox,
    pub tracking_id: Option<String>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BoundingBox {
    pub x1: i32,
    pub y1: i32,
    pub x2: i32,
    pub y2: i32,
}

10.4 處理配置

processing:
  yolo:
    model: "yolov8x"
    confidence_threshold: 0.25
    iou_threshold: 0.45
    max_det: 300
    device: "cuda"
    batch_size: 8
    skip_frames: 1  # 每 N 幀處理一次

11. Frame 文字識別標示規範 (OCR)

11.1 OCR 偵測結果結構

{
  "model": "easyocr",
  "model_version": "1.7",
  "language": ["en"],
  "segments": [
    {
      "start": 0.0,
      "end": 1.0,
      "frame_number": 0,
      "texts": [
        {
          "text": "EXAMPLE TEXT",
          "text_normalized": "example text",
          "boxes": [
            {
              "x1": 100,
              "y1": 50,
              "x2": 400,
              "y2": 100
            }
          ],
          "confidence": 0.95,
          "language": "en"
        },
        {
          "text": "SUBTITLE HERE",
          "text_normalized": "subtitle here",
          "boxes": [
            {
              "x1": 200,
              "y1": 900,
              "x2": 1720,
              "y2": 1000
            }
          ],
          "confidence": 0.88,
          "language": "en"
        }
      ]
    }
  ],
  "statistics": {
    "total_text_regions": 25,
    "unique_texts": 18,
    "languages_detected": ["en"]
  }
}

11.2 結構定義

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrResult {
    pub model: String,
    pub model_version: String,
    pub language: Vec<String>,
    pub segments: Vec<OcrSegment>,
    pub statistics: OcrStatistics,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrSegment {
    pub start: f64,
    pub end: f64,
    pub frame_number: i64,
    pub texts: Vec<OcrText>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrText {
    pub text: String,
    pub text_normalized: String,
    pub boxes: Vec<BoundingBox>,
    pub confidence: f64,
    pub language: Option<String>,
}

11.3 文字類型分類

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum TextType {
    Subtitle,     // 字幕
    Title,        # 標題
    Sign,         # 標牌/招牌
    Caption,      # 說明文字
    Watermark,    #浮水印
    SceneText,    # 場景文字
    Unknown,      # 未知
}

11.4 處理配置

processing:
  ocr:
    model: "easyocr"
    languages: ["en", "zh", "ja"]
    confidence_threshold: 0.5
    text_detection: true
    text_recognition: true
    batch_size: 16
    skip_frames: 30  # 每 30 幀處理一次 (字幕通常持續較久)

12. Frame Face 識別標示規範 (Face)

12.1 人臉偵測結果結構

{
  "model": "retinaface",
  "model_version": "1.0",
  "segments": [
    {
      "start": 0.0,
      "end": 1.0,
      "frame_number": 0,
      "faces": [
        {
          "face_id": "face_001",
          "box": {
            "x1": 100,
            "y1": 50,
            "x2": 300,
            "y2": 350
          },
          "embedding": [0.123, -0.456, ...],
          "embedding_dim": 512,
          "emotion": {
            "dominant": "happy",
            "scores": {
              "happy": 0.75,
              "neutral": 0.20,
              "sad": 0.03,
              "angry": 0.02
            }
          },
          "age": 35,
          "gender": "female",
          "confidence": 0.95
        }
      ]
    }
  ],
  "statistics": {
    "total_faces": 50,
    "unique_faces": 5,
    "face_tracks": [
      {
        "face_id": "face_001",
        "duration": 120.5,
        "appearances": 3000,
        "first_seen": 0.0,
        "last_seen": 120.5
      }
    ]
  }
}

12.2 結構定義

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FaceResult {
    pub model: String,
    pub model_version: String,
    pub segments: Vec<FaceSegment>,
    pub statistics: FaceStatistics,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FaceSegment {
    pub start: f64,
    pub end: f64,
    pub frame_number: i64,
    pub faces: Vec<Face>,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Face {
    pub face_id: String,
    pub box: BoundingBox,
    pub embedding: Option<Vec<f32>>,
    pub embedding_dim: Option<i32>,
    pub emotion: Option<Emotion>,
    pub age: Option<i32>,
    pub gender: Option<String>,
    pub confidence: f64,
}

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Emotion {
    pub dominant: String,
    pub scores: std::collections::HashMap<String, f64>,
}

12.3 人臉追蹤

#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FaceTrack {
    pub face_id: String,
    pub duration: f64,
    pub appearances: i32,
    pub first_seen: f64,
    pub last_seen: f64,
    pub frames: Vec<FaceFrame>,
    pub embedding_average: Vec<f32>,
}

12.4 處理配置

processing:
  face:
    model: "retinaface"
    recognition_model: "arcface"
    detection_threshold: 0.5
    recognition_threshold: 0.6
    track_faces: true
    detect_emotion: true
    detect_age: true
    detect_gender: true
    skip_frames: 1

13. 完整處理流程圖

┌─────────────────────────────────────────────────────────────────┐
│                        VIDEO PROCESSING                         │
├─────────────────────────────────────────────────────────────────┤
│                                                                 │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐  │
│  │ REGISTER │───▶│  PROBE   │───▶│   ASR    │───▶│   ASRx   │  │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘  │
│       │              │               │               │         │
│       ▼              ▼               ▼               ▼         │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐     │
│  │ UUID    │    │ Video   │    │ Language│    │Speaker  │     │
│  │ Generate│    │ Info    │    │ Detect  │    │Diariz.  │     │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘     │
│                                                     │         │
│                                                     ▼         │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐    ┌──────────┐ │
│  │   OCR    │───▶│  YOLO    │───▶│   FACE   │───▶│   POSE   │ │
│  └──────────┘    └──────────┘    └──────────┘    └──────────┘ │
│       │              │               │               │         │
│       ▼              ▼               ▼               ▼         │
│  ┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐    │
│  │ Text    │    │ Object  │    │ Face    │    │Pose     │    │
│  │ Detect  │    │ Detect  │    │ Track   │    │Estimate │    │
│  └─────────┘    └─────────┘    └─────────┘    └─────────┘    │
│                                                     │         │
│                                                     ▼         │
│                                            ┌──────────────┐   │
│                                            │    CHUNK     │   │
│                                            └──────────────┘   │
│                                                     │         │
│                                                     ▼         │
│                                            ┌──────────────┐   │
│                                            │   COMPLETE   │   │
│                                            └──────────────┘   │
│                                                                 │
│  ┌──────────────────────────────────────────────────────────┐ │
│  │              CHECKPOINT (每分鐘輸出)                       │ │
│  │  {uuid}.asr.partial.001.json                              │ │
│  │  {uuid}.asrx.partial.001.json                             │ │
│  │  ...                                                      │ │
│  └──────────────────────────────────────────────────────────┘ │
│                                                                 │
└─────────────────────────────────────────────────────────────────┘

14. 處理器狀態矩陣

處理器 輸入 輸出 必要 可選參數
Probe 影片檔 probe.json -
ASR 影片/音軌 asr.json model, language
ASRx asr.json asrx.json min_speakers, max_speakers
OCR 影片幀 ocr.json languages, threshold
YOLO 影片幀 yolo.json model, confidence
Face 影片幀 face.json recognition, track
Pose 影片幀 pose.json model, tracking

15. 錯誤處理

15.1 錯誤類型

#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ProcessError {
    FileNotFound,
    InvalidFormat,
    NoAudioTrack,
    NoVideoTrack,
    ProcessingFailed { processor: String, message: String },
    OutOfMemory,
    DiskFull,
    Timeout,
    Cancelled,
}

15.2 錯誤回應

{
  "error": "processing_failed",
  "processor": "asr",
  "message": "Failed to load Whisper model",
  "timestamp": "2026-03-16T10:00:00Z",
  "retryable": false,
  "suggestion": "Check GPU availability and model files"
}

16. 版本歷史

版本 日期 變更
1.0.0 2026-03-16 初始版本

17. 相關文件