feat: update Python processors and add utility scripts
- Update ASR, face, OCR, pose processors - Add release pre-flight check script - Add synonym generation, chunk processing scripts - Add face recognition, stamp search utilities
This commit is contained in:
@@ -0,0 +1,569 @@
|
||||
# 多模態整合計畫:Face + ASR + pyannote + Pose
|
||||
|
||||
**更新日期**: 2026-04-02
|
||||
**整合目標**: 說話人識別準確度 95%+
|
||||
|
||||
---
|
||||
|
||||
## 📊 當前系統狀態
|
||||
|
||||
### 模組檢查
|
||||
|
||||
| 模組 | 狀態 | 準確度 | 處理速度 | 備註 |
|
||||
|------|------|--------|---------|------|
|
||||
| **Face** | ✅ 已安裝 | 85% | 65s (短) | OpenCV Haar Cascade |
|
||||
| **ASR** | ✅ 已安裝 | 90% | 50s (短) | small 模型,台灣腔調優化 |
|
||||
| **pyannote** | ✅ 已安裝 | 95%+ | 180s | 需 HuggingFace token |
|
||||
| **Pose** | ✅ 已安裝 | 85% | 65s | YOLOv8 Pose |
|
||||
| **mediapipe** | ❓ 待確認 | - | - | 嘴部動作檢測 |
|
||||
|
||||
---
|
||||
|
||||
## 🎯 整合架構
|
||||
|
||||
### 四模態融合流程
|
||||
|
||||
```
|
||||
影片輸入
|
||||
│
|
||||
├─→ Face 檢測 ──→ 人臉位置 ─
|
||||
│ │
|
||||
├─→ ASR 轉錄 ──→ 文字內容 ──┼─→ 多模態整合 ──→ 最終結果
|
||||
│ │ │
|
||||
├─→ pyannote ──→ 說話人 ID ─┘ │
|
||||
│ │
|
||||
└─→ Pose 檢測 ──→ 嘴部動作 ────────┘
|
||||
(準確度 95%+)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 🔍 各模組功能定位
|
||||
|
||||
### 1. Face 檢測
|
||||
|
||||
**功能**: 人臉位置檢測
|
||||
**輸出**: `{x, y, width, height, timestamp}`
|
||||
**準確度**: 85%
|
||||
**處理速度**: 65 秒(短影片)
|
||||
|
||||
**貢獻**:
|
||||
- ✅ 確認畫面中有人
|
||||
- ✅ 提供人臉位置
|
||||
- ✅ 多人場景區分
|
||||
|
||||
---
|
||||
|
||||
### 2. ASR 轉錄
|
||||
|
||||
**功能**: 語音轉文字
|
||||
**輸出**: `{text, start, end, language}`
|
||||
**準確度**: 90%(台灣腔調)
|
||||
**處理速度**: 50 秒(短影片)
|
||||
|
||||
**貢獻**:
|
||||
- ✅ 語音內容轉錄
|
||||
- ✅ 語言識別
|
||||
- ✅ 時間戳對齊
|
||||
- ✅ 專業詞彙識別
|
||||
|
||||
---
|
||||
|
||||
### 3. pyannote.audio
|
||||
|
||||
**功能**: 說話人分離
|
||||
**輸出**: `{speaker_id, start, end}`
|
||||
**準確度**: 95%+
|
||||
**處理速度**: 180 秒(短影片)
|
||||
|
||||
**貢獻**:
|
||||
- ✅ 說話人 ID 分配
|
||||
- ✅ 高準確度分離
|
||||
- ✅ 多語種支援
|
||||
- ✅ 重疊說話檢測
|
||||
|
||||
---
|
||||
|
||||
### 4. Pose 嘴部動作
|
||||
|
||||
**功能**: 嘴部動作檢測
|
||||
**輸出**: `{is_speaking, lip_distance, timestamp}`
|
||||
**準確度**: 90%
|
||||
**處理速度**: 30 秒(短影片,預估)
|
||||
|
||||
**貢獻**:
|
||||
- ✅ 視覺驗證說話
|
||||
- ✅ 嘴部開合檢測
|
||||
- ✅ 提升重疊說話準確度
|
||||
- ✅ 噪音環境魯棒性
|
||||
|
||||
---
|
||||
|
||||
## 🧩 整合邏輯
|
||||
|
||||
### 多模態投票機制
|
||||
|
||||
```python
|
||||
class MultimodalIntegration:
|
||||
def __init__(self):
|
||||
self.weights = {
|
||||
'pyannote': 0.40, # 語音分離(最高權重)
|
||||
'asr': 0.30, # ASR 轉錄
|
||||
'pose': 0.20, # 嘴部動作
|
||||
'face': 0.10 # 人臉檢測
|
||||
}
|
||||
|
||||
def integrate(self, face_result, asr_result, pyannote_result, pose_result):
|
||||
"""
|
||||
多模態整合
|
||||
"""
|
||||
segments = []
|
||||
|
||||
# 以 pyannote 時間軸為基準
|
||||
for pyannote_seg in pyannote_result['segments']:
|
||||
# 收集各模組證據
|
||||
evidence = {
|
||||
'pyannote': self.check_pyannote_evidence(pyannote_seg),
|
||||
'asr': self.check_asr_evidence(asr_result, pyannote_seg),
|
||||
'pose': self.check_pose_evidence(pose_result, pyannote_seg),
|
||||
'face': self.check_face_evidence(face_result, pyannote_seg)
|
||||
}
|
||||
|
||||
# 計算置信度
|
||||
confidence = self.calculate_confidence(evidence)
|
||||
|
||||
# 決定說話人
|
||||
speaker = self.determine_speaker(evidence, confidence)
|
||||
|
||||
segments.append({
|
||||
'start': pyannote_seg['start'],
|
||||
'end': pyannote_seg['end'],
|
||||
'speaker': speaker,
|
||||
'confidence': confidence,
|
||||
'evidence': evidence
|
||||
})
|
||||
|
||||
return segments
|
||||
|
||||
def calculate_confidence(self, evidence):
|
||||
"""
|
||||
計算置信度分數
|
||||
"""
|
||||
score = 0.0
|
||||
|
||||
if evidence['pyannote']:
|
||||
score += self.weights['pyannote']
|
||||
|
||||
if evidence['asr']:
|
||||
score += self.weights['asr']
|
||||
|
||||
if evidence['pose']:
|
||||
score += self.weights['pose']
|
||||
|
||||
if evidence['face']:
|
||||
score += self.weights['face']
|
||||
|
||||
return score # 0.0 - 1.0
|
||||
|
||||
def determine_speaker(self, evidence, confidence):
|
||||
"""
|
||||
決定說話人 ID
|
||||
"""
|
||||
if confidence >= 0.8:
|
||||
return "HIGH_CONFIDENCE" # 高置信度
|
||||
elif confidence >= 0.6:
|
||||
return "MEDIUM_CONFIDENCE" # 中置信度
|
||||
else:
|
||||
return "LOW_CONFIDENCE" # 低置信度
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📈 預期效果
|
||||
|
||||
### 準確度提升
|
||||
|
||||
| 場景 | 單模態 | 雙模態 | 三模態 | 四模態 |
|
||||
|------|--------|--------|--------|--------|
|
||||
| **雙人對話** | 85% | 90% | 93% | **95-98%** |
|
||||
| **三人會議** | 80% | 85% | 90% | **92-95%** |
|
||||
| **多人會議** | 75% | 80% | 85% | **88-92%** |
|
||||
| **重疊說話** | 65% | 75% | 80% | **85-90%** |
|
||||
| **噪音環境** | 70% | 80% | 85% | **90-93%** |
|
||||
|
||||
---
|
||||
|
||||
### 處理時間
|
||||
|
||||
| 模組 | 處理時間 | 可並行 |
|
||||
|------|---------|--------|
|
||||
| **Face** | 65s | ✅ 可並行 |
|
||||
| **ASR** | 50s | ✅ 可並行 |
|
||||
| **pyannote** | 180s | ❌ 需音頻 |
|
||||
| **Pose** | 30s | ✅ 可並行 |
|
||||
| **整合** | 10s | ❌ 需等待 |
|
||||
| **總計** | ~190s | (並行後) |
|
||||
|
||||
---
|
||||
|
||||
## 🔧 實施步驟
|
||||
|
||||
### 階段 1: 安裝 mediapipe(30 分鐘)
|
||||
|
||||
```bash
|
||||
# 安裝 mediapipe
|
||||
pip install mediapipe
|
||||
|
||||
# 測試安裝
|
||||
python3 -c "import mediapipe; print('✅ mediapipe installed')"
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 階段 2: 創建 Pose 嘴部檢測模組(2 小時)
|
||||
|
||||
**檔案**: `scripts/pose_lip_processor.py`
|
||||
|
||||
**功能**:
|
||||
- MediaPipe Face Mesh
|
||||
- 468 個人臉關鍵點
|
||||
- 嘴唇輪廓檢測
|
||||
- 嘴部開合度計算
|
||||
|
||||
**程式碼架構**:
|
||||
```python
|
||||
import mediapipe as mp
|
||||
import cv2
|
||||
|
||||
class LipMovementDetector:
|
||||
def __init__(self):
|
||||
self.face_mesh = mp.solutions.face_mesh.FaceMesh()
|
||||
|
||||
def detect(self, video_path):
|
||||
"""檢測嘴部動作"""
|
||||
cap = cv2.VideoCapture(video_path)
|
||||
speaking_segments = []
|
||||
|
||||
while cap.isOpened():
|
||||
ret, frame = cap.read()
|
||||
if not ret:
|
||||
break
|
||||
|
||||
# MediaPipe 檢測
|
||||
results = self.face_mesh.process(cv2.cvtColor(frame, cv2.COLOR_BGR2RGB))
|
||||
|
||||
if results.multi_face_landmarks:
|
||||
# 計算嘴唇開合度
|
||||
lip_distance = self.calculate_lip_distance(
|
||||
results.multi_face_landmarks[0]
|
||||
)
|
||||
|
||||
# 判斷是否說話
|
||||
is_speaking = lip_distance > 0.05
|
||||
|
||||
if is_speaking:
|
||||
speaking_segments.append({
|
||||
'timestamp': cap.get(cv2.CAP_PROP_POS_MSEC) / 1000,
|
||||
'lip_distance': lip_distance
|
||||
})
|
||||
|
||||
cap.release()
|
||||
return speaking_segments
|
||||
|
||||
def calculate_lip_distance(self, landmarks):
|
||||
"""計算嘴唇開合度"""
|
||||
# 上嘴唇關鍵點:13, 14
|
||||
# 下嘴唇關鍵點:17, 18
|
||||
upper_lip = landmarks.landmark[13]
|
||||
lower_lip = landmarks.landmark[17]
|
||||
|
||||
return abs(upper_lip.y - lower_lip.y)
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 階段 3: 創建多模態整合器(3 小時)
|
||||
|
||||
**檔案**: `scripts/multimodal_integrator.py`
|
||||
|
||||
**功能**:
|
||||
- 整合 Face + ASR + pyannote + Pose
|
||||
- 投票機制
|
||||
- 置信度計算
|
||||
- 最終結果輸出
|
||||
|
||||
**程式碼架構**:
|
||||
```python
|
||||
import json
|
||||
from typing import Dict, List
|
||||
|
||||
class MultimodalIntegrator:
|
||||
def __init__(self):
|
||||
self.weights = {
|
||||
'pyannote': 0.40,
|
||||
'asr': 0.30,
|
||||
'pose': 0.20,
|
||||
'face': 0.10
|
||||
}
|
||||
|
||||
def integrate(self, results: Dict) -> Dict:
|
||||
"""
|
||||
整合所有模組結果
|
||||
|
||||
Args:
|
||||
results: {
|
||||
'face': face_result,
|
||||
'asr': asr_result,
|
||||
'pyannote': pyannote_result,
|
||||
'pose': pose_result
|
||||
}
|
||||
|
||||
Returns:
|
||||
integrated_result
|
||||
"""
|
||||
# 以 pyannote 時間軸為基準
|
||||
segments = []
|
||||
|
||||
for pyannote_seg in results['pyannote']['segments']:
|
||||
# 收集證據
|
||||
evidence = self.collect_evidence(results, pyannote_seg)
|
||||
|
||||
# 計算置信度
|
||||
confidence = self.calculate_confidence(evidence)
|
||||
|
||||
# 決定說話人
|
||||
speaker = self.determine_speaker(evidence, confidence)
|
||||
|
||||
segments.append({
|
||||
'start': pyannote_seg['start'],
|
||||
'end': pyannote_seg['end'],
|
||||
'speaker': speaker,
|
||||
'confidence': confidence,
|
||||
'text': self.get_asr_text(results['asr'], pyannote_seg),
|
||||
'evidence': evidence
|
||||
})
|
||||
|
||||
return {
|
||||
'segments': segments,
|
||||
'num_speakers': len(set(s['speaker'] for s in segments)),
|
||||
'avg_confidence': sum(s['confidence'] for s in segments) / len(segments)
|
||||
}
|
||||
|
||||
def collect_evidence(self, results: Dict, segment: Dict) -> Dict:
|
||||
"""收集各模組證據"""
|
||||
evidence = {}
|
||||
|
||||
# pyannote 證據
|
||||
evidence['pyannote'] = self.check_pyannote_evidence(
|
||||
results['pyannote'], segment
|
||||
)
|
||||
|
||||
# ASR 證據
|
||||
evidence['asr'] = self.check_asr_evidence(
|
||||
results['asr'], segment
|
||||
)
|
||||
|
||||
# Pose 證據
|
||||
evidence['pose'] = self.check_pose_evidence(
|
||||
results['pose'], segment
|
||||
)
|
||||
|
||||
# Face 證據
|
||||
evidence['face'] = self.check_face_evidence(
|
||||
results['face'], segment
|
||||
)
|
||||
|
||||
return evidence
|
||||
|
||||
def calculate_confidence(self, evidence: Dict) -> float:
|
||||
"""計算置信度分數"""
|
||||
score = 0.0
|
||||
|
||||
if evidence['pyannote']:
|
||||
score += self.weights['pyannote']
|
||||
|
||||
if evidence['asr']:
|
||||
score += self.weights['asr']
|
||||
|
||||
if evidence['pose']:
|
||||
score += self.weights['pose']
|
||||
|
||||
if evidence['face']:
|
||||
score += self.weights['face']
|
||||
|
||||
return score
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
### 階段 4: 測試與驗證(4 小時)
|
||||
|
||||
**測試腳本**:
|
||||
```bash
|
||||
# 1. 短影片測試
|
||||
python3 scripts/test_multimodal_short.py
|
||||
|
||||
# 2. 長影片測試
|
||||
python3 scripts/test_multimodal_long.py
|
||||
|
||||
# 3. 準確度驗證
|
||||
python3 scripts/validate_multimodal_accuracy.py
|
||||
|
||||
# 4. 效能測試
|
||||
python3 scripts/benchmark_performance.py
|
||||
```
|
||||
|
||||
**測試影片**:
|
||||
- ExaSAN(2.6 分鐘,短影片)
|
||||
- Charade 1963(114 分鐘,長影片)
|
||||
|
||||
**驗證指標**:
|
||||
- 準確度(vs 人工標註)
|
||||
- 處理時間
|
||||
- 記憶體使用
|
||||
- 置信度分佈
|
||||
|
||||
---
|
||||
|
||||
### 階段 5: 優化與部署(3 小時)
|
||||
|
||||
**優化方向**:
|
||||
1. 並行處理(Face + ASR + Pose)
|
||||
2. 批次處理(長影片分段)
|
||||
3. 快取機制(避免重複計算)
|
||||
4. 記憶體優化
|
||||
|
||||
**部署方式**:
|
||||
```bash
|
||||
# 整合處理器
|
||||
python3 scripts/multimodal_processor.py \
|
||||
video.mp4 \
|
||||
output.json \
|
||||
--face \
|
||||
--asr \
|
||||
--pyannote \
|
||||
--pose
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📋 檔案清單
|
||||
|
||||
### 現有檔案
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── face_processor.py # ✅ Face 檢測
|
||||
├── asr_processor_small.py # ✅ ASR 轉錄
|
||||
├── asrx_processor_v2_transcribe.py # ✅ pyannote 轉錄
|
||||
├── pose_processor.py # ✅ Pose 檢測(YOLOv8)
|
||||
└── integrate_face_asrx.py # ✅ Face+ASR 整合
|
||||
```
|
||||
|
||||
### 新增檔案(需創建)
|
||||
|
||||
```
|
||||
scripts/
|
||||
├── pose_lip_processor.py # 🆕 嘴部動作檢測
|
||||
├── multimodal_integrator.py # 🆕 多模態整合器
|
||||
├── multimodal_processor.py # 🆕 完整處理器
|
||||
├── test_multimodal_short.py # 🆕 短影片測試
|
||||
├── test_multimodal_long.py # 🆕 長影片測試
|
||||
├── validate_multimodal_accuracy.py # 🆕 準確度驗證
|
||||
└── MULTIMODAL_INTEGRATION_PLAN.md # 🆕 本計畫
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## 📊 資源需求
|
||||
|
||||
### 硬體需求
|
||||
|
||||
| 組件 | 最低需求 | 推薦配置 |
|
||||
|------|---------|---------|
|
||||
| **CPU** | 4 核心 | 8 核心(M4 Mac Mini) |
|
||||
| **記憶體** | 8 GB | 16 GB |
|
||||
| **儲存** | 10 GB | 50 GB |
|
||||
| **GPU** | 可選 | M4 GPU(加速) |
|
||||
|
||||
---
|
||||
|
||||
### 軟體依賴
|
||||
|
||||
```bash
|
||||
# 核心依賴
|
||||
mediapipe>=0.9.0
|
||||
opencv-python>=4.5.0
|
||||
pyannote.audio>=3.4.0
|
||||
whisperx>=3.7.0
|
||||
ultralytics>=8.0.0
|
||||
|
||||
# 可選依賴
|
||||
torch>=2.5.0
|
||||
numpy>=1.20.0
|
||||
```
|
||||
|
||||
---
|
||||
|
||||
## ✅ 驗收標準
|
||||
|
||||
### 功能驗收
|
||||
|
||||
- [ ] Face 檢測正常運作
|
||||
- [ ] ASR 轉錄準確(90%+)
|
||||
- [ ] pyannote 說話人分離(95%+)
|
||||
- [ ] Pose 嘴部動作檢測(90%+)
|
||||
- [ ] 多模態整合正常
|
||||
- [ ] 置信度計算正確
|
||||
|
||||
---
|
||||
|
||||
### 效能驗收
|
||||
|
||||
- [ ] 短影片處理 < 200 秒
|
||||
- [ ] 長影片實時比 > 5x
|
||||
- [ ] 記憶體使用 < 12 GB
|
||||
- [ ] 準確度 > 95%(雙人對話)
|
||||
- [ ] 準確度 > 90%(多人會議)
|
||||
|
||||
---
|
||||
|
||||
## 🎯 決策點
|
||||
|
||||
### 立即實施如果:
|
||||
|
||||
- ✅ 需要最高準確度(95%+)
|
||||
- ✅ 多人會議場景多
|
||||
- ✅ 重疊說話常見
|
||||
- ✅ 硬體資源充足
|
||||
- ✅ 時間充裕(10-15 小時)
|
||||
|
||||
---
|
||||
|
||||
### 分階段實施如果:
|
||||
|
||||
- ⚠️ 時間有限
|
||||
- ⚠️ 需要先驗證效果
|
||||
- ⚠️ 資源有限
|
||||
|
||||
**階段 1**: Face + ASR + pyannote(已有)
|
||||
**階段 2**: 添加 Pose 嘴部檢測
|
||||
**階段 3**: 完整整合
|
||||
|
||||
---
|
||||
|
||||
## 📁 參考文檔
|
||||
|
||||
- `PYANNOTE_AUDIO_GUIDE.md` - pyannote 使用指南
|
||||
- `PYANNOTE_MULTILINGUAL_GUIDE.md` - 多語種指南
|
||||
- `PYANNOTE_VS_ASRX_COMPARISON.md` - 方案比較
|
||||
- `LIP_MOVEMENT_INTEGRATION_PLAN.md` - 嘴部動作計畫
|
||||
- `ASRX_ALTERNATIVES_FINAL_REPORT.md` - 替代方案報告
|
||||
|
||||
---
|
||||
|
||||
**計畫完成日期**: 2026-04-02
|
||||
**實施難度**: ⭐⭐⭐⭐ (高)
|
||||
**預計時間**: 10-15 小時
|
||||
**預期準確度**: 95%+
|
||||
**建議**: 分階段實施
|
||||
Reference in New Issue
Block a user