feat: update Python processors and add utility scripts

- Update ASR, face, OCR, pose processors - Add release pre-flight check script - Add synonym generation, chunk processing scripts - Add face recognition, stamp search utilities
2026-04-30 15:07:49 +08:00
parent f4697396e4
commit 8f05a7c188
256 changed files with 60505 additions and 299 deletions
@@ -0,0 +1,569 @@
+# 多模態整合計畫：Face + ASR + pyannote + Pose
+
+**更新日期**: 2026-04-02  
+**整合目標**: 說話人識別準確度 95%+
+
+---
+
+## 📊 當前系統狀態
+
+### 模組檢查
+
+| 模組 | 狀態 | 準確度 | 處理速度 | 備註 |
+|------|------|--------|---------|------|
+| **Face** | ✅ 已安裝 | 85% | 65s (短) | OpenCV Haar Cascade |
+| **ASR** | ✅ 已安裝 | 90% | 50s (短) | small 模型，台灣腔調優化 |
+| **pyannote** | ✅ 已安裝 | 95%+ | 180s | 需 HuggingFace token |
+| **Pose** | ✅ 已安裝 | 85% | 65s | YOLOv8 Pose |
+| **mediapipe** | ❓ 待確認 | - | - | 嘴部動作檢測 |
+
+---
+
+## 🎯 整合架構
+
+### 四模態融合流程
+
+```
+影片輸入
+    │
+    ├─→ Face 檢測 ──→ 人臉位置 ─
+    │                           │
+    ├─→ ASR 轉錄 ──→ 文字內容 ──┼─→ 多模態整合 ──→ 最終結果
+    │                           │      │
+    ├─→ pyannote ──→ 說話人 ID ─┘      │
+    │                                  │
+    └─→ Pose 檢測 ──→ 嘴部動作 ────────┘
+                                    (準確度 95%+)
+```
+
+---
+
+## 🔍 各模組功能定位
+
+### 1. Face 檢測
+
+**功能**: 人臉位置檢測  
+**輸出**: `{x, y, width, height, timestamp}`  
+**準確度**: 85%  
+**處理速度**: 65 秒（短影片）
+
+**貢獻**:
+- ✅ 確認畫面中有人
+- ✅ 提供人臉位置
+- ✅ 多人場景區分
+
+---
+
+### 2. ASR 轉錄
+
+**功能**: 語音轉文字  
+**輸出**: `{text, start, end, language}`  
+**準確度**: 90%（台灣腔調）  
+**處理速度**: 50 秒（短影片）
+
+**貢獻**:
+- ✅ 語音內容轉錄
+- ✅ 語言識別
+- ✅ 時間戳對齊
+- ✅ 專業詞彙識別
+
+---
+
+### 3. pyannote.audio
+
+**功能**: 說話人分離  
+**輸出**: `{speaker_id, start, end}`  
+**準確度**: 95%+  
+**處理速度**: 180 秒（短影片）
+
+**貢獻**:
+- ✅ 說話人 ID 分配
+- ✅ 高準確度分離
+- ✅ 多語種支援
+- ✅ 重疊說話檢測
+
+---
+
+### 4. Pose 嘴部動作
+
+**功能**: 嘴部動作檢測  
+**輸出**: `{is_speaking, lip_distance, timestamp}`  
+**準確度**: 90%  
+**處理速度**: 30 秒（短影片，預估）
+
+**貢獻**:
+- ✅ 視覺驗證說話
+- ✅ 嘴部開合檢測
+- ✅ 提升重疊說話準確度
+- ✅ 噪音環境魯棒性
+
+---
+
+## 🧩 整合邏輯
+
+### 多模態投票機制
+
+```python
+class MultimodalIntegration:
+    def __init__(self):
+        self.weights = {
+            'pyannote': 0.40,  # 語音分離（最高權重）
+            'asr': 0.30,       # ASR 轉錄
+            'pose': 0.20,      # 嘴部動作
+            'face': 0.10       # 人臉檢測
+        }
+    
+    def integrate(self, face_result, asr_result, pyannote_result, pose_result):
+        """
+        多模態整合
+        """
+        segments = []
+        
+        # 以 pyannote 時間軸為基準
+        for pyannote_seg in pyannote_result['segments']:
+            # 收集各模組證據
+            evidence = {
+                'pyannote': self.check_pyannote_evidence(pyannote_seg),
+                'asr': self.check_asr_evidence(asr_result, pyannote_seg),
+                'pose': self.check_pose_evidence(pose_result, pyannote_seg),
+                'face': self.check_face_evidence(face_result, pyannote_seg)
+            }
+            
+            # 計算置信度
+            confidence = self.calculate_confidence(evidence)
+            
+            # 決定說話人
+            speaker = self.determine_speaker(evidence, confidence)
+            
+            segments.append({
+                'start': pyannote_seg['start'],
+                'end': pyannote_seg['end'],
+                'speaker': speaker,
+                'confidence': confidence,
+                'evidence': evidence
+            })
+        
+        return segments
+    
+    def calculate_confidence(self, evidence):
+        """
+        計算置信度分數
+        """
+        score = 0.0
+        
+        if evidence['pyannote']:
+            score += self.weights['pyannote']
+        
+        if evidence['asr']:
+            score += self.weights['asr']
+        
+        if evidence['pose']:
+            score += self.weights['pose']
+        
+        if evidence['face']:
+            score += self.weights['face']
+        
+        return score  # 0.0 - 1.0
+    
+    def determine_speaker(self, evidence, confidence):
+        """
+        決定說話人 ID
+        """
+        if confidence >= 0.8:
+            return "HIGH_CONFIDENCE"  # 高置信度
+        elif confidence >= 0.6:
+            return "MEDIUM_CONFIDENCE"  # 中置信度
+        else:
+            return "LOW_CONFIDENCE"  # 低置信度
+```
+
+---
+
+## 📈 預期效果
+
+### 準確度提升
+
+| 場景 | 單模態 | 雙模態 | 三模態 | 四模態 |
+|------|--------|--------|--------|--------|
+| **雙人對話** | 85% | 90% | 93% | **95-98%** |
+| **三人會議** | 80% | 85% | 90% | **92-95%** |
+| **多人會議** | 75% | 80% | 85% | **88-92%** |
+| **重疊說話** | 65% | 75% | 80% | **85-90%** |
+| **噪音環境** | 70% | 80% | 85% | **90-93%** |
+
+---
+
+### 處理時間
+
+| 模組 | 處理時間 | 可並行 |
+|------|---------|--------|
+| **Face** | 65s | ✅ 可並行 |
+| **ASR** | 50s | ✅ 可並行 |
+| **pyannote** | 180s | ❌ 需音頻 |
+| **Pose** | 30s | ✅ 可並行 |
+| **整合** | 10s | ❌ 需等待 |
+| **總計** | ~190s | (並行後) |
+
+---
+
+## 🔧 實施步驟
+
+### 階段 1: 安裝 mediapipe（30 分鐘）
+
+```bash
+# 安裝 mediapipe
+pip install mediapipe
+
+# 測試安裝
+python3 -c "import mediapipe; print('✅ mediapipe installed')"
+```
+
+---
+
+### 階段 2: 創建 Pose 嘴部檢測模組（2 小時）
+
+**檔案**: `scripts/pose_lip_processor.py`
+
+**功能**:
+- MediaPipe Face Mesh
+- 468 個人臉關鍵點
+- 嘴唇輪廓檢測
+- 嘴部開合度計算
+
+**程式碼架構**:
+```python
+import mediapipe as mp
+import cv2
+
+class LipMovementDetector:
+    def __init__(self):
+        self.face_mesh = mp.solutions.face_mesh.FaceMesh()
+    
+    def detect(self, video_path):
+        """檢測嘴部動作"""
+        cap = cv2.VideoCapture(video_path)
+        speaking_segments = []
+        
+        while cap.isOpened():
+            ret, frame = cap.read()
+            if not ret:
+                break
+            
+            # MediaPipe 檢測
+            results = self.face_mesh.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
+            
+            if results.multi_face_landmarks:
+                # 計算嘴唇開合度
+                lip_distance = self.calculate_lip_distance(
+                    results.multi_face_landmarks[0]
+                )
+                
+                # 判斷是否說話
+                is_speaking = lip_distance > 0.05
+                
+                if is_speaking:
+                    speaking_segments.append({
+                        'timestamp': cap.get(cv2.CAP_PROP_POS_MSEC) / 1000,
+                        'lip_distance': lip_distance
+                    })
+        
+        cap.release()
+        return speaking_segments
+    
+    def calculate_lip_distance(self, landmarks):
+        """計算嘴唇開合度"""
+        # 上嘴唇關鍵點：13, 14
+        # 下嘴唇關鍵點：17, 18
+        upper_lip = landmarks.landmark[13]
+        lower_lip = landmarks.landmark[17]
+        
+        return abs(upper_lip.y - lower_lip.y)
+```
+
+---
+
+### 階段 3: 創建多模態整合器（3 小時）
+
+**檔案**: `scripts/multimodal_integrator.py`
+
+**功能**:
+- 整合 Face + ASR + pyannote + Pose
+- 投票機制
+- 置信度計算
+- 最終結果輸出
+
+**程式碼架構**:
+```python
+import json
+from typing import Dict, List
+
+class MultimodalIntegrator:
+    def __init__(self):
+        self.weights = {
+            'pyannote': 0.40,
+            'asr': 0.30,
+            'pose': 0.20,
+            'face': 0.10
+        }
+    
+    def integrate(self, results: Dict) -> Dict:
+        """
+        整合所有模組結果
+        
+        Args:
+            results: {
+                'face': face_result,
+                'asr': asr_result,
+                'pyannote': pyannote_result,
+                'pose': pose_result
+            }
+        
+        Returns:
+            integrated_result
+        """
+        # 以 pyannote 時間軸為基準
+        segments = []
+        
+        for pyannote_seg in results['pyannote']['segments']:
+            # 收集證據
+            evidence = self.collect_evidence(results, pyannote_seg)
+            
+            # 計算置信度
+            confidence = self.calculate_confidence(evidence)
+            
+            # 決定說話人
+            speaker = self.determine_speaker(evidence, confidence)
+            
+            segments.append({
+                'start': pyannote_seg['start'],
+                'end': pyannote_seg['end'],
+                'speaker': speaker,
+                'confidence': confidence,
+                'text': self.get_asr_text(results['asr'], pyannote_seg),
+                'evidence': evidence
+            })
+        
+        return {
+            'segments': segments,
+            'num_speakers': len(set(s['speaker'] for s in segments)),
+            'avg_confidence': sum(s['confidence'] for s in segments) / len(segments)
+        }
+    
+    def collect_evidence(self, results: Dict, segment: Dict) -> Dict:
+        """收集各模組證據"""
+        evidence = {}
+        
+        # pyannote 證據
+        evidence['pyannote'] = self.check_pyannote_evidence(
+            results['pyannote'], segment
+        )
+        
+        # ASR 證據
+        evidence['asr'] = self.check_asr_evidence(
+            results['asr'], segment
+        )
+        
+        # Pose 證據
+        evidence['pose'] = self.check_pose_evidence(
+            results['pose'], segment
+        )
+        
+        # Face 證據
+        evidence['face'] = self.check_face_evidence(
+            results['face'], segment
+        )
+        
+        return evidence
+    
+    def calculate_confidence(self, evidence: Dict) -> float:
+        """計算置信度分數"""
+        score = 0.0
+        
+        if evidence['pyannote']:
+            score += self.weights['pyannote']
+        
+        if evidence['asr']:
+            score += self.weights['asr']
+        
+        if evidence['pose']:
+            score += self.weights['pose']
+        
+        if evidence['face']:
+            score += self.weights['face']
+        
+        return score
+```
+
+---
+
+### 階段 4: 測試與驗證（4 小時）
+
+**測試腳本**:
+```bash
+# 1. 短影片測試
+python3 scripts/test_multimodal_short.py
+
+# 2. 長影片測試
+python3 scripts/test_multimodal_long.py
+
+# 3. 準確度驗證
+python3 scripts/validate_multimodal_accuracy.py
+
+# 4. 效能測試
+python3 scripts/benchmark_performance.py
+```
+
+**測試影片**:
+- ExaSAN（2.6 分鐘，短影片）
+- Charade 1963（114 分鐘，長影片）
+
+**驗證指標**:
+- 準確度（vs 人工標註）
+- 處理時間
+- 記憶體使用
+- 置信度分佈
+
+---
+
+### 階段 5: 優化與部署（3 小時）
+
+**優化方向**:
+1. 並行處理（Face + ASR + Pose）
+2. 批次處理（長影片分段）
+3. 快取機制（避免重複計算）
+4. 記憶體優化
+
+**部署方式**:
+```bash
+# 整合處理器
+python3 scripts/multimodal_processor.py \
+  video.mp4 \
+  output.json \
+  --face \
+  --asr \
+  --pyannote \
+  --pose
+```
+
+---
+
+## 📋 檔案清單
+
+### 現有檔案
+
+```
+scripts/
+├── face_processor.py              # ✅ Face 檢測
+├── asr_processor_small.py         # ✅ ASR 轉錄
+├── asrx_processor_v2_transcribe.py # ✅ pyannote 轉錄
+├── pose_processor.py              # ✅ Pose 檢測（YOLOv8）
+└── integrate_face_asrx.py         # ✅ Face+ASR 整合
+```
+
+### 新增檔案（需創建）
+
+```
+scripts/
+├── pose_lip_processor.py          # 🆕 嘴部動作檢測
+├── multimodal_integrator.py       # 🆕 多模態整合器
+├── multimodal_processor.py        # 🆕 完整處理器
+├── test_multimodal_short.py       # 🆕 短影片測試
+├── test_multimodal_long.py        # 🆕 長影片測試
+├── validate_multimodal_accuracy.py # 🆕 準確度驗證
+└── MULTIMODAL_INTEGRATION_PLAN.md # 🆕 本計畫
+```
+
+---
+
+## 📊 資源需求
+
+### 硬體需求
+
+| 組件 | 最低需求 | 推薦配置 |
+|------|---------|---------|
+| **CPU** | 4 核心 | 8 核心（M4 Mac Mini） |
+| **記憶體** | 8 GB | 16 GB |
+| **儲存** | 10 GB | 50 GB |
+| **GPU** | 可選 | M4 GPU（加速） |
+
+---
+
+### 軟體依賴
+
+```bash
+# 核心依賴
+mediapipe>=0.9.0
+opencv-python>=4.5.0
+pyannote.audio>=3.4.0
+whisperx>=3.7.0
+ultralytics>=8.0.0
+
+# 可選依賴
+torch>=2.5.0
+numpy>=1.20.0
+```
+
+---
+
+## ✅ 驗收標準
+
+### 功能驗收
+
+- [ ] Face 檢測正常運作
+- [ ] ASR 轉錄準確（90%+）
+- [ ] pyannote 說話人分離（95%+）
+- [ ] Pose 嘴部動作檢測（90%+）
+- [ ] 多模態整合正常
+- [ ] 置信度計算正確
+
+---
+
+### 效能驗收
+
+- [ ] 短影片處理 < 200 秒
+- [ ] 長影片實時比 > 5x
+- [ ] 記憶體使用 < 12 GB
+- [ ] 準確度 > 95%（雙人對話）
+- [ ] 準確度 > 90%（多人會議）
+
+---
+
+## 🎯 決策點
+
+### 立即實施如果：
+
+- ✅ 需要最高準確度（95%+）
+- ✅ 多人會議場景多
+- ✅ 重疊說話常見
+- ✅ 硬體資源充足
+- ✅ 時間充裕（10-15 小時）
+
+---
+
+### 分階段實施如果：
+
+- ⚠️ 時間有限
+- ⚠️ 需要先驗證效果
+- ⚠️ 資源有限
+
+**階段 1**: Face + ASR + pyannote（已有）  
+**階段 2**: 添加 Pose 嘴部檢測  
+**階段 3**: 完整整合
+
+---
+
+## 📁 參考文檔
+
+- `PYANNOTE_AUDIO_GUIDE.md` - pyannote 使用指南
+- `PYANNOTE_MULTILINGUAL_GUIDE.md` - 多語種指南
+- `PYANNOTE_VS_ASRX_COMPARISON.md` - 方案比較
+- `LIP_MOVEMENT_INTEGRATION_PLAN.md` - 嘴部動作計畫
+- `ASRX_ALTERNATIVES_FINAL_REPORT.md` - 替代方案報告
+
+---
+
+**計畫完成日期**: 2026-04-02  
+**實施難度**: ⭐⭐⭐⭐ (高)  
+**預計時間**: 10-15 小時  
+**預期準確度**: 95%+  
+**建議**: 分階段實施