## v0.9.20260325_144654 ### Features - API Key Authentication System - Job Worker System - V2 Backup Versioning ### Bug Fixes - get_processor_results_by_job column mapping Co-authored-by: OpenCode
33 KiB
33 KiB
Video 解析行為規範
| 項目 | 內容 |
|---|---|
| 建立者 | Warren |
| 建立時間 | 2026-03-16 |
| 文件版本 | V1.0 |
版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|---|---|---|---|---|
| V1.0 | 2026-03-16 | 創建文件 | Warren | OpenCode / MiniMax M2.5 |
本文檔定義 Momentry Core 系統中影片解析的完整行為規範,涵蓋觸發、狀態、輸出、斷點續傳、多語言處理及各種識別標示。
1. Video 檔案觸發規範
1.1 支援的影片格式
| 格式 | 副檔名 | 說明 |
|---|---|---|
| MP4 | .mp4 | 最廣泛支援 |
| MOV | .mov | QuickTime 格式 |
| AVI | .avi | 傳統格式 |
| MKV | .mkv | Matroska 格式 |
| WebM | .webm | Web 格式 |
| WMV | .wmv | Windows Media |
| FLV | .flv | Flash 格式 |
1.2 觸發方式
1.2.1 指令列註冊
cargo run -- register /path/to/video.mp4
1.2.2 監控目錄自動觸發
# monitor_config.yaml
watch:
directories:
- /path/to/watch
recursive: true
extensions: [".mp4", ".mov", ".avi", ".mkv", ".webm"]
1.2.3 API 觸發
# POST /api/v1/register
curl -X POST http://localhost:3002/api/v1/register \
-H "Content-Type: application/json" \
-d '{"path": "/path/to/video.mp4", "auto_process": true}'
1.3 觸發前驗證
pub fn validate_video_file(path: &str) -> Result<VideoValidation> {
// 1. 檢查檔案存在
// 2. 檢查副檔名
// 3. 檢查檔案大小 > 0
// 4. 檢查是否為有效的影片檔案 (魔數 Magic Number)
// 回傳結構
Ok(VideoValidation {
path: path.to_string(),
valid: true,
codec: "h264".to_string(),
has_video: true,
has_audio: true,
})
}
1.4 影片UUID 生成
UUID = MD5(檔案路徑)[0:16]
範例:
"/media/videos/clip.mp4" → "3a2f1b9c4d5e6f0a"
2. Video 處理過程狀態顯示規範
2.1 處理狀態定義
#[derive(Debug, Clone, Copy, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ProcessStatus {
Pending, // 等待處理
Registered, // 已註冊
Probing, // 探測中
AsrProcessing, // ASR 處理中
AsrxProcessing, // 說話者分離中
OcrProcessing, // OCR 處理中
YoloProcessing, // YOLO 處理中
FaceProcessing, // 人臉偵測中
PoseProcessing, // 姿態估計中
Chunking, // 分塊處理中
Completed, // 完成
Failed, // 失敗
Paused, // 暫停
Resuming, // 恢復中
}
2.2 狀態輸出格式
2.2.1 標準輸出 (stdout)
[REGISTER] Video registered: 1636719dc31f78ac
[PROBE] Starting probe for video.mp4
[PROBE] Duration: 120.5s, FPS: 24/1, Resolution: 1920x1080
[ASR_START] Loading Whisper model...
[ASR_LANGUAGE:en] Language detected: English (99.45%)
[ASR_PROGRESS:50] Processed 50 segments...
[ASR_PROGRESS:100] Processed 100 segments...
[ASR_COMPLETE:150] Completed! Total: 150 segments
[ASRX_START] Loading pyannote model...
[ASRX_PROGRESS] Speaker diarization: 3 speakers identified
[ASRX_COMPLETE] Speaker diarization complete
[OCR_START] Starting OCR processing...
[OCR_PROGRESS:30/60] Frame 30/60 processed
[OCR_COMPLETE] OCR complete: 25 text regions found
[YOLO_START] Starting YOLO processing...
[YOLO_PROGRESS:60/120] Frame 60/120 processed
[YOLO_COMPLETE] YOLO complete: 189 objects detected
[FACE_START] Starting face detection...
[FACE_PROGRESS:60/120] Frame 60/120 processed
[FACE_COMPLETE] Face detection complete: 5 unique faces
[POSE_START] Starting pose estimation...
[POSE_PROGRESS:60/120] Frame 60/120 processed
[POSE_COMPLETE] Pose estimation complete: 12 persons detected
[CHUNK_START] Creating chunks...
[CHUNK_COMPLETE] 450 chunks created
[COMPLETE] Video processing complete!
2.2.2 狀態訊息前綴
| 處理階段 | 前綴 | 範例 |
|---|---|---|
| 註冊 | [REGISTER] |
[REGISTER] Video registered: 1636719dc31f78ac |
| 探測 | [PROBE] |
[PROBE] Duration: 120.5s |
| ASR | [ASR_*] |
[ASR_START], [ASR_PROGRESS:50] |
| ASRx | [ASRX_*] |
[ASRX_START], [ASRX_COMPLETE] |
| OCR | [OCR_*] |
[OCR_START], [OCR_PROGRESS:30/60] |
| YOLO | [YOLO_*] |
[YOLO_START], [YOLO_COMPLETE] |
| Face | [FACE_*] |
[FACE_START], [FACE_PROGRESS:60/120] |
| Pose | [POSE_*] |
[POSE_START], [POSE_COMPLETE] |
| Chunk | [CHUNK_*] |
[CHUNK_START], [CHUNK_COMPLETE] |
| 完成 | [COMPLETE] |
[COMPLETE] Video processing complete! |
| 錯誤 | [ERROR] |
[ERROR] ASR processing failed |
| 警告 | [WARN] |
[WARN] No audio track detected |
2.3 即時進度報告
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ProcessProgress {
pub uuid: String,
pub status: ProcessStatus,
pub current_processor: String,
pub total_frames: i64,
pub processed_frames: i64,
pub progress_percentage: f64,
pub elapsed_seconds: f64,
pub estimated_remaining_seconds: f64,
pub last_checkpoint: Option<Checkpoint>,
}
範例輸出
{
"uuid": "1636719dc31f78ac",
"status": "asr_processing",
"current_processor": "asr",
"total_frames": 3000,
"processed_frames": 1500,
"progress_percentage": 50.0,
"elapsed_seconds": 120.5,
"estimated_remaining_seconds": 120.5,
"last_checkpoint": {
"timestamp": 60.0,
"segments_processed": 50,
"output_file": "1636719dc31f78ac.asr.partial.json"
}
}
3. Video 處理輸出規範
3.1 輸出檔案命名
{UUID}.{處理類型}.json
範例:
1636719dc31f78ac.probe.json # 探測結果
1636719dc31f78ac.asr.json # ASR 結果
1636719dc31f78ac.asrx.json # 說話者分離結果
1636719dc31f78ac.ocr.json # OCR 結果
1636719dc31f78ac.yolo.json # YOLO 結果
1636719dc31f78ac.face.json # 人臉偵測結果
1636719dc31f78ac.pose.json # 姿態估計結果
1636719dc31f78ac.chunks.json # 分塊結果
3.2 輸出目錄結構
momentry_core/
├── output/
│ ├── {uuid}/
│ │ ├── {uuid}.probe.json
│ │ ├── {uuid}.asr.json
│ │ ├── {uuid}.asrx.json
│ │ ├── {uuid}.ocr.json
│ │ ├── {uuid}.yolo.json
│ │ ├── {uuid}.face.json
│ │ ├── {uuid}.pose.json
│ │ └── thumbnails/
│ │ ├── thumb_000.jpg
│ │ ├── thumb_001.jpg
│ │ └── ...
│ └── checkpoints/
│ └── {uuid}/
│ ├── {uuid}.asr.partial.001.json
│ ├── {uuid}.asr.partial.002.json
│ └── ...
3.3 完整處理結果 JSON 結構
{
"uuid": "1636719dc31f78ac",
"video_path": "/path/to/video.mp4",
"video_info": {
"duration": 120.5,
"fps": "24/1",
"fps_value": 24.0,
"width": 1920,
"height": 1080,
"has_video": true,
"has_audio": true,
"has_music": false,
"audio_codec": "aac",
"video_codec": "h264"
},
"processing": {
"status": "completed",
"started_at": "2026-03-16T10:00:00Z",
"completed_at": "2026-03-16T10:05:00Z",
"elapsed_seconds": 300.0,
"processors": {
"asr": {
"status": "completed",
"language": "en",
"language_probability": 0.9945,
"segments_count": 150,
"duration_seconds": 120.0
},
"asrx": {
"status": "completed",
"speakers_count": 3,
"segments_count": 150,
"duration_seconds": 60.0
},
"ocr": {
"status": "completed",
"text_regions_count": 25,
"duration_seconds": 45.0
},
"yolo": {
"status": "completed",
"objects_count": 189,
"unique_classes": ["person", "car", "dog"],
"duration_seconds": 30.0
},
"face": {
"status": "completed",
"unique_faces_count": 5,
"duration_seconds": 30.0
},
"pose": {
"status": "completed",
"persons_count": 12,
"duration_seconds": 15.0
}
}
},
"asr": {
"language": "en",
"language_probability": 0.9945855736732483,
"segments": [...]
},
"asrx": {
"language": "en",
"segments": [...]
},
"ocr": {
"segments": [...]
},
"yolo": {
"segments": [...]
},
"face": {
"segments": [...]
},
"pose": {
"segments": [...]
}
}
4. Video 處理中分時輸出規範 (Checkpoint)
4.1 分時輸出目的
- 避免處理異常中斷導致資料全部遺失
- 提供中斷點,方便後續接續處理
- 可設定輸出頻率(每 N 秒或每 N 幀)
4.2 配置參數
processing:
checkpoint:
enabled: true
interval_seconds: 60 # 每 60 秒輸出一次
interval_frames: 1500 # 或每 1500 幀 (二選一)
output_dir: "checkpoints"
keep_partial: true # 保留部分完成檔案
4.3 Checkpoint 結構
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Checkpoint {
pub uuid: String,
pub processor: String,
pub checkpoint_id: String,
pub timestamp: f64,
pub frame_number: i64,
pub total_frames: i64,
pub progress_percentage: f64,
pub partial_data: serde_json::Value,
pub created_at: DateTime<Utc>,
}
4.4 分時輸出檔案命名
{UUID}.{處理類型}.partial.{序號}.json
範例:
1636719dc31f78ac.asr.partial.001.json # 第 1 次 checkpoint
1636719dc31f78ac.asr.partial.002.json # 第 2 次 checkpoint
1636719dc31f78ac.asr.partial.003.json # 第 3 次 checkpoint
4.5 分時輸出範例 (ASR)
{
"uuid": "1636719dc31f78ac",
"processor": "asr",
"checkpoint_id": "partial_001",
"timestamp": 60.0,
"frame_number": 1440,
"total_frames": 3000,
"progress_percentage": 48.0,
"partial_data": {
"language": "en",
"language_probability": 0.9945,
"segments": [
{"start": 0.0, "end": 5.0, "text": "Hello world"},
{"start": 5.0, "end": 10.0, "text": "This is a test"},
...
]
},
"created_at": "2026-03-16T10:01:00Z"
}
4.6 分時合併邏輯
pub fn merge_checkpoints(checkpoints: Vec<Checkpoint>) -> serde_json::Value {
// 按 checkpoint_id 排序
let mut sorted = checkpoints;
sorted.sort_by(|a, b| a.checkpoint_id.cmp(&b.checkpoint_id));
// 合併 segments
let mut merged_segments: Vec<serde_json::Value> = vec![];
for checkpoint in sorted {
if let Some(segments) = checkpoint.partial_data.get("segments") {
if let Some(seg_array) = segments.as_array() {
merged_segments.extend(seg_array.clone());
}
}
}
serde_json::json!({
"segments": merged_segments
})
}
5. Video 處理中斷接續規範
5.1 支援的中斷類型
| 中斷類型 | 說明 | 處理方式 |
|---|---|---|
| 程序崩潰 | 處理程序異常退出 | 從上次 checkpoint 恢復 |
| 系統關機 | 系統意外關機 | 從上次 checkpoint 恢復 |
| 資源不足 | OOM/磁碟空間不足 | 釋放資源後重試 |
| 用戶暫停 | 用戶主動暫停 | 顯示 Paused 狀態 |
| 網路中斷 | 遠端資源不可用 | 重試連線後繼續 |
5.2 接續處理流程
1. 檢測中斷
│
▼
2. 查找最新 checkpoint
│
▼
3. 載入 partial data
│
▼
4. 驗證數據完整性
│
▼
5. 從 checkpoint 繼續處理
│
▼
6. 輸出完整結果
5.3 接續狀態檢測
pub async fn check_resume_status(uuid: &str) -> Result<ResumeStatus> {
// 1. 查找所有 checkpoint 檔案
let checkpoints = find_checkpoints(uuid)?;
// 2. 查找最後處理的進度
let last_checkpoint = checkpoints.last();
// 3. 檢查主要輸出檔案是否存在
let main_output_exists = Path::new(&format!("{}.asr.json", uuid)).exists();
// 4. 判斷可恢復的處理器
let resumeable = ResumeStatus {
can_resume: !checkpoints.is_empty() && !main_output_exists,
last_checkpoint: last_checkpoint.cloned(),
processed_processors: detect_processed_processors(uuid),
remaining_processors: detect_remaining_processors(uuid),
};
Ok(resumeable)
}
5.4 Resume Status 結構
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct ResumeStatus {
pub can_resume: bool,
pub last_checkpoint: Option<Checkpoint>,
pub processed_processors: Vec<String>,
pub remaining_processors: Vec<String>,
pub suggested_action: String,
}
5.5 接續命令
# 自動檢測並恢復
cargo run -- resume /path/to/video.mp4
# 強制從頭開始
cargo run -- process /path/to/video.mp4 --force
# 查看處理狀態
cargo run -- status 1636719dc31f78ac
# 查看可恢復的檢查點
cargo run -- checkpoints 1636719dc31f78ac
5.6 衝突處理
pub fn resolve_conflict(
partial: &serde_json::Value,
main: &serde_json::Value,
strategy: ConflictStrategy,
) -> serde_json::Value {
match strategy {
ConflictStrategy::KeepMain => main.clone(),
ConflictStrategy::KeepPartial => partial.clone(),
ConflictStrategy::Merge => {
// 合併 segments,移除重複
merge_segments(partial, main)
}
}
}
6. Video 處理多語種標示規範
6.1 語言代碼標準
使用 ISO 639-1 兩碼語言代碼:
| 語言 | 代碼 | 範例 |
|---|---|---|
| 英語 | en | English |
| 國語/普通話 | zh | Chinese (Mandarin) |
| 粵語 | yue | Cantonese |
| 閩南語 | nan | Min Nan |
| 日語 | ja | Japanese |
| 韓語 | ko | Korean |
| 西班牙語 | es | Spanish |
| 法語 | fr | French |
| 德語 | de | German |
| 義大利語 | it | Italian |
| 俄語 | ru | Russian |
| 阿拉伯語 | ar | Arabic |
| 印地語 | hi | Hindi |
| 葡萄牙語 | pt | Portuguese |
6.2 多語種偵測結果
{
"language": "multi",
"languages_detected": [
{
"code": "en",
"name": "English",
"probability": 0.75,
"segments_count": 100
},
{
"code": "zh",
"name": "Chinese",
"probability": 0.20,
"segments_count": 30
},
{
"code": "ja",
"name": "Japanese",
"probability": 0.05,
"segments_count": 5
}
],
"primary_language": "en",
"segments": [
{
"start": 0.0,
"end": 5.0,
"text": "Hello world",
"language": "en",
"language_probability": 0.99
},
{
"start": 5.0,
"end": 10.0,
"text": "你好世界",
"language": "zh",
"language_probability": 0.98
}
]
}
6.3 段落級語言標示
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct AsrSegment {
pub start: f64,
pub end: f64,
pub text: String,
pub language: Option<String>, // 段落語言
pub language_probability: Option<f64>, // 段落語言機率
}
6.4 資料庫儲存
CREATE TABLE asr_segments (
id BIGSERIAL PRIMARY KEY,
uuid VARCHAR(16) NOT NULL,
segment_index INTEGER NOT NULL,
start_time DOUBLE PRECISION NOT NULL,
end_time DOUBLE PRECISION NOT NULL,
text TEXT NOT NULL,
language VARCHAR(10),
language_probability DOUBLE PRECISION,
created_at TIMESTAMP DEFAULT NOW(),
UNIQUE(uuid, segment_index)
);
CREATE INDEX idx_asr_language ON asr_segments(language);
7. Video 處理未識別成功語種標示規範
7.1 未識別狀態
| 狀態 | 代碼 | 說明 |
|---|---|---|
| Unknown | unknown | 無法判斷語言 |
| Uncertain | uncertain | 語言識別信心度過低 |
| NoSpeech | no_speech | 無語音內容 |
| Silent | silent | 完全是靜音 |
7.2 未識別結果結構
{
"language": "unknown",
"language_probability": null,
"language_detection_failed": true,
"failure_reason": "low_confidence",
"min_confidence_threshold": 0.7,
"detected_language": "en",
"detected_probability": 0.45,
"segments": [
{
"start": 0.0,
"end": 5.0,
"text": "",
"language": "no_speech",
"language_probability": 0.99,
"has_audio": false
}
]
}
7.3 處理策略
pub fn handle_undetected_language(
audio_path: &str,
result: &AsrResult,
) -> AsrResult {
// 1. 檢測是否為靜音
let is_silent = detect_silence(audio_path);
// 2. 如果靜音,標示為 silent
if is_silent {
return AsrResult {
language: Some("silent".to_string()),
language_probability: Some(1.0),
segments: result.segments.iter().map(|s| AsrSegment {
language: Some("silent".to_string()),
has_audio: Some(false),
..s.clone()
}).collect(),
};
}
// 3. 如果語言信心度低,標示為 uncertain
if result.language_probability.unwrap_or(0.0) < 0.7 {
return AsrResult {
language: Some("uncertain".to_string()),
language_probability: result.language_probability,
segments: result.segments.iter().map(|s| AsrSegment {
language: Some("uncertain".to_string()),
language_probability: Some(result.language_probability.unwrap_or(0.0)),
..s.clone()
}).collect(),
};
}
result.clone()
}
7.4 靜音偵測
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct SilenceDetection {
pub is_silent: bool,
pub silence_ratio: f64, # 靜音比例 (0.0 - 1.0)
pub audio_level_db: f64, # 平均音量大 (dB)
pub threshold_db: f64, # 閾值 (-40 dB)
}
pub fn detect_silence(audio_path: &str, threshold_db: f64) -> SilenceDetection {
# 使用 ffmpeg 分析音量大
}
8. Video 處理 Music 標示規範
8.1 音樂偵測結果
{
"has_music": true,
"music_segments": [
{
"start": 0.0,
"end": 30.0,
"type": "background_music",
"confidence": 0.95,
"genre": "classical",
"tempo": 120,
"has_lyrics": false
},
{
"start": 60.0,
"end": 90.0,
"type": "song_with_vocals",
"confidence": 0.88,
"artist": "Unknown",
"title": "Unknown",
"has_lyrics": true
}
],
"audio_classification": {
"speech": 0.30,
"music": 0.60,
"ambient": 0.10
}
}
8.2 音樂類型分類
| 類型 | 說明 |
|---|---|
| background_music | 背景音樂 |
| song_with_vocals | 帶歌詞的歌曲 |
| instrumental | 純音樂 |
| sound_effect | 音效 |
8.3 結構定義
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct MusicSegment {
pub start: f64,
pub end: f64,
pub music_type: String, // 音樂類型
pub confidence: f64, // 偵測信心度
pub genre: Option<String>, // 音樂類型 (可選)
pub tempo: Option<f64>, # BPM (可選)
pub has_lyrics: bool, # 是否有歌詞
pub artist: Option<String>, # 藝術家 (可選)
pub title: Option<String>, # 標題 (可選)
}
9. Video 處理無聲音標示規範
9.1 無聲音定義
| 狀態 | 說明 |
|---|---|
| no_audio_track | 影片無音軌 |
| all_silent | 有音軌但全為靜音 |
| audio_error | 音軌讀取錯誤 |
9.2 無聲音結果
{
"has_audio": false,
"audio_status": "no_audio_track",
"audio_info": {
"has_audio_track": false,
"error_message": null,
"audio_codec": null,
"sample_rate": null,
"channels": null,
"duration": null
},
"asr": {
"language": "no_speech",
"language_probability": 1.0,
"segments": [],
"segments_count": 0,
"total_speech_duration": 0.0,
"speech_ratio": 0.0
}
}
9.3 處理流程
pub async fn process_video_no_audio(uuid: &str, video_path: &str) -> Result<ProcessingResult> {
// 1. Probe 影片
let probe = probe_video(video_path).await?;
// 2. 判斷無聲音原因
let audio_status = if !probe.has_audio_stream {
"no_audio_track"
} else if probe.audio_is_silent {
"all_silent"
} else {
"audio_error"
};
// 3. 產生結果
Ok(ProcessingResult {
has_audio: false,
audio_status: audio_status.to_string(),
asr: AsrResult {
language: Some("no_speech".to_string()),
language_probability: Some(1.0),
segments: vec![],
},
..Default::default()
})
}
10. Frame 物件識別標示規範 (YOLO)
10.1 YOLO 偵測結果結構
{
"model": "yolov8x",
"model_version": "8.0",
"segments": [
{
"start": 0.0,
"end": 1.0,
"frame_number": 0,
"objects": [
{
"class": "person",
"class_id": 0,
"confidence": 0.92,
"box": {
"x1": 150,
"y1": 200,
"x2": 400,
"y2": 800
},
"tracking_id": "person_001"
},
{
"class": "car",
"class_id": 2,
"confidence": 0.87,
"box": {
"x1": 800,
"y1": 400,
"x2": 1200,
"y2": 700
},
"tracking_id": "car_001"
}
]
}
],
"statistics": {
"total_objects": 189,
"unique_classes": ["person", "car", "dog", "bicycle"],
"class_counts": {
"person": 120,
"car": 45,
"dog": 15,
"bicycle": 9
}
}
}
10.2 支援的類別 (COCO)
| 類別 ID | 類別名稱 |
|---|---|
| 0 | person |
| 1 | bicycle |
| 2 | car |
| 3 | motorcycle |
| 4 | airplane |
| 5 | bus |
| 6 | train |
| 7 | truck |
| 8 | boat |
| 9 | traffic light |
| ... | ... |
10.3 結構定義
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct YoloResult {
pub model: String,
pub model_version: String,
pub segments: Vec<YoloSegment>,
pub statistics: YoloStatistics,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct YoloSegment {
pub start: f64,
pub end: f64,
pub frame_number: i64,
pub objects: Vec<YoloObject>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct YoloObject {
pub class: String,
pub class_id: i32,
pub confidence: f64,
pub box: BoundingBox,
pub tracking_id: Option<String>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct BoundingBox {
pub x1: i32,
pub y1: i32,
pub x2: i32,
pub y2: i32,
}
10.4 處理配置
processing:
yolo:
model: "yolov8x"
confidence_threshold: 0.25
iou_threshold: 0.45
max_det: 300
device: "cuda"
batch_size: 8
skip_frames: 1 # 每 N 幀處理一次
11. Frame 文字識別標示規範 (OCR)
11.1 OCR 偵測結果結構
{
"model": "easyocr",
"model_version": "1.7",
"language": ["en"],
"segments": [
{
"start": 0.0,
"end": 1.0,
"frame_number": 0,
"texts": [
{
"text": "EXAMPLE TEXT",
"text_normalized": "example text",
"boxes": [
{
"x1": 100,
"y1": 50,
"x2": 400,
"y2": 100
}
],
"confidence": 0.95,
"language": "en"
},
{
"text": "SUBTITLE HERE",
"text_normalized": "subtitle here",
"boxes": [
{
"x1": 200,
"y1": 900,
"x2": 1720,
"y2": 1000
}
],
"confidence": 0.88,
"language": "en"
}
]
}
],
"statistics": {
"total_text_regions": 25,
"unique_texts": 18,
"languages_detected": ["en"]
}
}
11.2 結構定義
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrResult {
pub model: String,
pub model_version: String,
pub language: Vec<String>,
pub segments: Vec<OcrSegment>,
pub statistics: OcrStatistics,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrSegment {
pub start: f64,
pub end: f64,
pub frame_number: i64,
pub texts: Vec<OcrText>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct OcrText {
pub text: String,
pub text_normalized: String,
pub boxes: Vec<BoundingBox>,
pub confidence: f64,
pub language: Option<String>,
}
11.3 文字類型分類
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum TextType {
Subtitle, // 字幕
Title, # 標題
Sign, # 標牌/招牌
Caption, # 說明文字
Watermark, #浮水印
SceneText, # 場景文字
Unknown, # 未知
}
11.4 處理配置
processing:
ocr:
model: "easyocr"
languages: ["en", "zh", "ja"]
confidence_threshold: 0.5
text_detection: true
text_recognition: true
batch_size: 16
skip_frames: 30 # 每 30 幀處理一次 (字幕通常持續較久)
12. Frame Face 識別標示規範 (Face)
12.1 人臉偵測結果結構
{
"model": "retinaface",
"model_version": "1.0",
"segments": [
{
"start": 0.0,
"end": 1.0,
"frame_number": 0,
"faces": [
{
"face_id": "face_001",
"box": {
"x1": 100,
"y1": 50,
"x2": 300,
"y2": 350
},
"embedding": [0.123, -0.456, ...],
"embedding_dim": 512,
"emotion": {
"dominant": "happy",
"scores": {
"happy": 0.75,
"neutral": 0.20,
"sad": 0.03,
"angry": 0.02
}
},
"age": 35,
"gender": "female",
"confidence": 0.95
}
]
}
],
"statistics": {
"total_faces": 50,
"unique_faces": 5,
"face_tracks": [
{
"face_id": "face_001",
"duration": 120.5,
"appearances": 3000,
"first_seen": 0.0,
"last_seen": 120.5
}
]
}
}
12.2 結構定義
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FaceResult {
pub model: String,
pub model_version: String,
pub segments: Vec<FaceSegment>,
pub statistics: FaceStatistics,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FaceSegment {
pub start: f64,
pub end: f64,
pub frame_number: i64,
pub faces: Vec<Face>,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Face {
pub face_id: String,
pub box: BoundingBox,
pub embedding: Option<Vec<f32>>,
pub embedding_dim: Option<i32>,
pub emotion: Option<Emotion>,
pub age: Option<i32>,
pub gender: Option<String>,
pub confidence: f64,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Emotion {
pub dominant: String,
pub scores: std::collections::HashMap<String, f64>,
}
12.3 人臉追蹤
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct FaceTrack {
pub face_id: String,
pub duration: f64,
pub appearances: i32,
pub first_seen: f64,
pub last_seen: f64,
pub frames: Vec<FaceFrame>,
pub embedding_average: Vec<f32>,
}
12.4 處理配置
processing:
face:
model: "retinaface"
recognition_model: "arcface"
detection_threshold: 0.5
recognition_threshold: 0.6
track_faces: true
detect_emotion: true
detect_age: true
detect_gender: true
skip_frames: 1
13. 完整處理流程圖
┌─────────────────────────────────────────────────────────────────┐
│ VIDEO PROCESSING │
├─────────────────────────────────────────────────────────────────┤
│ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ REGISTER │───▶│ PROBE │───▶│ ASR │───▶│ ASRx │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ UUID │ │ Video │ │ Language│ │Speaker │ │
│ │ Generate│ │ Info │ │ Detect │ │Diariz. │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ OCR │───▶│ YOLO │───▶│ FACE │───▶│ POSE │ │
│ └──────────┘ └──────────┘ └──────────┘ └──────────┘ │
│ │ │ │ │ │
│ ▼ ▼ ▼ ▼ │
│ ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐ │
│ │ Text │ │ Object │ │ Face │ │Pose │ │
│ │ Detect │ │ Detect │ │ Track │ │Estimate │ │
│ └─────────┘ └─────────┘ └─────────┘ └─────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ CHUNK │ │
│ └──────────────┘ │
│ │ │
│ ▼ │
│ ┌──────────────┐ │
│ │ COMPLETE │ │
│ └──────────────┘ │
│ │
│ ┌──────────────────────────────────────────────────────────┐ │
│ │ CHECKPOINT (每分鐘輸出) │ │
│ │ {uuid}.asr.partial.001.json │ │
│ │ {uuid}.asrx.partial.001.json │ │
│ │ ... │ │
│ └──────────────────────────────────────────────────────────┘ │
│ │
└─────────────────────────────────────────────────────────────────┘
14. 處理器狀態矩陣
| 處理器 | 輸入 | 輸出 | 必要 | 可選參數 |
|---|---|---|---|---|
| Probe | 影片檔 | probe.json | ✅ | - |
| ASR | 影片/音軌 | asr.json | ✅ | model, language |
| ASRx | asr.json | asrx.json | ❌ | min_speakers, max_speakers |
| OCR | 影片幀 | ocr.json | ❌ | languages, threshold |
| YOLO | 影片幀 | yolo.json | ❌ | model, confidence |
| Face | 影片幀 | face.json | ❌ | recognition, track |
| Pose | 影片幀 | pose.json | ❌ | model, tracking |
15. 錯誤處理
15.1 錯誤類型
#[derive(Debug, Clone, Serialize, Deserialize)]
#[serde(rename_all = "snake_case")]
pub enum ProcessError {
FileNotFound,
InvalidFormat,
NoAudioTrack,
NoVideoTrack,
ProcessingFailed { processor: String, message: String },
OutOfMemory,
DiskFull,
Timeout,
Cancelled,
}
15.2 錯誤回應
{
"error": "processing_failed",
"processor": "asr",
"message": "Failed to load Whisper model",
"timestamp": "2026-03-16T10:00:00Z",
"retryable": false,
"suggestion": "Check GPU availability and model files"
}
16. 版本歷史
| 版本 | 日期 | 變更 |
|---|---|---|
| 1.0.0 | 2026-03-16 | 初始版本 |
17. 相關文件
- CHUNK_SPEC.md - 影片分塊規範
- JSON_OUTPUT_SPEC.md - JSON 輸出規範
- RUST_DEVELOPMENT.md - Rust 開發規範
- AGENTS.md - 開發規範