# Video Chunk 切分規範 | 項目 | 內容 | |------|------| | 建立者 | Warren | | 建立時間 | 2026-03-16 | | 文件版本 | V1.0 | --- ## 版本歷史 | 版本 | 日期 | 目的 | 操作人 | 工具/模型 | |------|------|------|--------|-----------| | V1.0 | 2026-03-16 | 創建文件 | Warren | OpenCode / MiniMax M2.5 | --- 本文檔定義 Momentry Core 系統中影片 chunks 的切分原則與資料結構。 --- ## 1. Chunk 概述 ### 1.1 設計原則 1. **允許重疊**: 不同類型的 chunk 可以重疊(如語句 chunk 與時間 chunk) 2. **Frame 精確度**: 時間坐標精確到影片 frame 3. **多元分類**: 支援語句、場景、時間三種分割方式 ### 1.2 Chunk 類型 | 類型 | 說明 | 是否可重疊 | |------|------|------------| | **Sentence** | 語句分割 | ✅ 可與其他類型重疊 | | **Cut** | 場景切割 | ✅ 可與其他類型重疊 | | **TimeBased** | 時間長度切割 | ✅ 可與其他類型重疊 | --- ## 2. 時間坐標系統 ### 2.1 時間格式 所有時間使用 **秒** 為單位,精確到 **微秒** (浮點數): ```json { "start_time": 10.5, "end_time": 15.75 } ``` ### 2.2 Frame 計算 ``` frame_number = floor(time_in_seconds * fps) time_at_frame = frame_number / fps ``` **範例**: - 影片 FPS: 24/1 (24 fps) - 時間: 10.5 秒 - Frame: floor(10.5 * 24) = 252 - 校驗: 252 / 24 = 10.5 秒 ✅ ### 2.3 Frame 資訊結構 ```json { "start_time": 10.5, "start_frame": 252, "end_time": 15.75, "end_frame": 378, "fps": "24/1", "fps_value": 24.0 } ``` --- ## 3. 三種切分方式 ### 3.1 Sentence (語句分割) **原則**: - 根據 ASR 語音識別結果 - 每個識別的語句為一個 chunk - 文字內容來自 ASR 輸出 **範例**: ``` ASR 輸出: [ {"start": 10.0, "end": 15.0, "text": "Hello world"}, {"start": 15.0, "end": 20.0, "text": "This is a test"}, {"start": 20.0, "end": 25.5, "text": "Processing video"} ] 轉換為 Chunks: ┌────────────────────────────────────────┐ │ chunk_0001: 10.0s - 15.0s "Hello world" │ ├────────────────────────────────────────┤ │ chunk_0002: 15.0s - 20.0s "This is a test" │ ├────────────────────────────────────────┤ │ chunk_0003: 20.0s - 25.5s "Processing video" │ └────────────────────────────────────────┘ ``` ### 3.2 Cut (場景切割) **原則**: - 根據影片鏡頭變化 (scene change / cut detection) - 使用 ffmpeg 或 Python (scenedetect) 偵測 - 每個場景為一個 chunk **偵測方法**: ```bash # 使用 ffmpeg 偵測場景變化 ffmpeg -i input.mp4 -filter:v "select='gt(scene,0.3)',showinfo" -f null - ``` **範例**: ``` 場景偵測結果: [ {"start": 0.0, "end": 45.2, "scene_id": 1}, {"start": 45.2, "end": 120.5, "scene_id": 2}, {"start": 120.5, "end": 180.0, "scene_id": 3} ] 轉換為 Chunks: ┌────────────────────────────────────────┐ │ chunk_0001: 0.0s - 45.2s (Scene 1) │ ├────────────────────────────────────────┤ │ chunk_0002: 45.2s - 120.5s (Scene 2) │ ├────────────────────────────────────────┤ │ chunk_0003: 120.5s - 180.0s (Scene 3) │ └────────────────────────────────────────┘ ``` ### 3.3 TimeBased (時間長度切割) **原則**: - 固定時間長度切割 - 預設 **10 秒** 為一個 chunk - 最後一個 chunk 可能不足 10 秒 - **支援重疊** (可設定 overlap 秒數) **參數配置**: | 參數 | 預設值 | 說明 | |------|--------|------| | duration | 10.0 | 每個 chunk 時長 (秒) | | overlap | 0.0 | 重疊時長 (秒) | **範例** (無重疊): ``` 影片時長: 35 秒, duration=10 Chunks: ┌────────────────────────────────────────┐ │ chunk_0001: 0.0s - 10.0s │ ├────────────────────────────────────────┤ │ chunk_0002: 10.0s - 20.0s │ ├────────────────────────────────────────┤ │ chunk_0003: 20.0s - 30.0s │ ├────────────────────────────────────────┤ │ chunk_0004: 30.0s - 35.0s (不足10秒) │ └────────────────────────────────────────┘ ``` **範例** (有重疊, overlap=2): ``` 影片時長: 35 秒, duration=10, overlap=2 Chunks: ┌────────────────────────────────────────┐ │ chunk_0001: 0.0s - 10.0s │ ├────────────────────────────────────────┤ │ chunk_0002: 8.0s - 18.0s (重疊 2秒) │ ├────────────────────────────────────────┤ │ chunk_0003: 16.0s - 26.0s (重疊 2秒) │ ├────────────────────────────────────────┤ │ chunk_0004: 24.0s - 34.0s (重疊 2秒) │ ├────────────────────────────────────────┤ │ chunk_0005: 32.0s - 35.0s (重疊+不足) │ └────────────────────────────────────────┘ ``` --- ## 4. Chunk 資料結構 ### 4.1 基本結構 ```json { "uuid": "1636719dc31f78ac", "chunk_id": "sentence_0001", "chunk_index": 1, "chunk_type": "sentence", "start_time": 10.5, "start_frame": 252, "end_time": 15.75, "end_frame": 378, "fps": "24/1", "fps_value": 24.0, "content": { "text": "Hello world, this is a test" }, "metadata": { "source": "asr", "confidence": 0.95, "language": "en" } } ``` ### 4.2 欄位說明 | 欄位 | 類型 | 必填 | 說明 | |------|------|------|------| | `uuid` | String | ✅ | 影片 UUID (16 字元) | | `chunk_id` | String | ✅ | Chunk 唯一 ID | | `chunk_index` | Integer | ✅ | Chunk 索引 (從 0 開始) | | `chunk_type` | String | ✅ | 類型: sentence/cut/time_based | | `start_time` | Float | ✅ | 開始時間 (秒) | | `start_frame` | Integer | ✅ | 開始 frame 編號 | | `end_time` | Float | ✅ | 結束時間 (秒) | | `end_frame` | Integer | ✅ | 結束 frame 編號 | | `fps` | String | ✅ | FPS 表示 (如 "24/1") | | `fps_value` | Float | ✅ | FPS 數值 (如 24.0) | | `content` | Object | ✅ | 內容 (見下文) | | `metadata` | Object | ❌ | 額外資訊 (見下文) | ### 4.3 Content 結構 根據 `chunk_type` 不同,content 結構也不同: #### Sentence Content ```json { "content": { "text": "Hello world, this is a test message", "text_normalized": "hello world this is a test message", "word_count": 7, "char_count": 34 } } ``` | 欄位 | 類型 | 說明 | |------|------|------| | `text` | String | 原始識別文字 | | `text_normalized` | String | 正規化文字 (小寫,去除標點) | | `word_count` | Integer | 字詞數量 | | `char_count` | Integer | 字元數量 | #### Cut Content ```json { "content": { "scene_id": 2, "scene_number": 2, "transition_type": "cut", "scene_change_score": 0.95 } } ``` | 欄位 | 類型 | 說明 | |------|------|------| | `scene_id` | Integer | 場景 ID | | `scene_number` | Integer | 場景編號 | | `transition_type` | String | 轉場類型: cut/dissolve/fade | | `scene_change_score` | Float | 場景變化分數 (0-1) | #### TimeBased Content ```json { "content": { "duration": 10.0, "is_last": false, "segment_number": 3, "total_segments": 10 } } ``` | 欄位 | 類型 | 說明 | |------|------|------| | `duration` | Float | 時長 (秒) | | `is_last` | Boolean | 是否最後一個 chunk | | `segment_number` | Integer | 分段編號 | | `total_segments` | Integer | 總分段數 | ### 4.4 Metadata 結構 ```json { "metadata": { "source": "asr", "confidence": 0.95, "language": "en", "model": "tiny", "created_at": "2026-03-16T10:00:00Z" } } ``` | 欄位 | 類型 | 說明 | |------|------|------| | `source` | String | 來源: asr/scene_detect/time_based | | `confidence` | Float | 信心度 (0-1) | | `language` | String | 語言代碼 | | `model` | String | 使用模型 | | `created_at` | String | 創建時間 (ISO 8601) | --- ## 5. Chunk ID 命名規範 ### 5.1 格式 ``` {chunk_type}_{chunk_index:04} ``` | 類型 | 前綴 | 範例 | |------|------|------| | Sentence | `sentence_` | `sentence_0001` | | Cut | `cut_` | `cut_0001` | | TimeBased | `time_based_` | `time_based_0001` | ### 5.2 編號規則 - 從 **0** 開始 - 使用 **4 位數** 補零 - 按時間順序遞增 --- ## 6. 資料庫 Schema ### 6.1 PostgreSQL Table ```sql CREATE TABLE chunks ( id BIGSERIAL PRIMARY KEY, uuid VARCHAR(16) NOT NULL, chunk_id VARCHAR(64) NOT NULL, chunk_index INTEGER NOT NULL, chunk_type VARCHAR(32) NOT NULL, start_time DOUBLE PRECISION NOT NULL, start_frame BIGINT NOT NULL, end_time DOUBLE PRECISION NOT NULL, end_frame BIGINT NOT NULL, fps VARCHAR(16) NOT NULL, fps_value DOUBLE PRECISION NOT NULL, content JSONB NOT NULL, metadata JSONB, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), UNIQUE(uuid, chunk_id) ); -- 索引 CREATE INDEX idx_chunks_uuid ON chunks(uuid); CREATE INDEX idx_chunks_type ON chunks(chunk_type); CREATE INDEX idx_chunks_time ON chunks(start_time, end_time); CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type); ``` ### 6.2 查詢範例 ```sql -- 查詢影片所有 chunks SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac'; -- 查詢特定類型的 chunks SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'sentence'; -- 查詢時間範圍內的 chunks SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND start_time <= 30.0 AND end_time >= 20.0; -- 查詢時間範圍內的所有 chunks (混合類型) SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND start_time <= 30.0 AND end_time >= 20.0 ORDER BY chunk_type, chunk_index; ``` --- ## 7. Rust 資料結構 ### 7.1 Chunk 定義 ```rust use serde::{Deserialize, Serialize}; #[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)] #[serde(rename_all = "snake_case")] pub enum ChunkType { Sentence, Cut, TimeBased, } impl ChunkType { pub fn as_str(&self) -> &'static str { match self { ChunkType::Sentence => "sentence", ChunkType::Cut => "cut", ChunkType::TimeBased => "time_based", } } } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct Chunk { pub uuid: String, pub chunk_id: String, pub chunk_index: u32, pub chunk_type: ChunkType, pub start_time: f64, pub start_frame: i64, pub end_time: f64, pub end_frame: i64, pub fps: String, pub fps_value: f64, pub content: serde_json::Value, pub metadata: Option, } ``` ### 7.2 建立 Chunk ```rust impl Chunk { pub fn new( uuid: String, chunk_index: u32, chunk_type: ChunkType, start_time: f64, end_time: f64, fps: &str, content: serde_json::Value, ) -> Self { let fps_value = parse_fps(fps); let start_frame = (start_time * fps_value) as i64; let end_frame = (end_time * fps_value) as i64; let chunk_id = format!("{}_{:04}", chunk_type.as_str(), chunk_index); Self { uuid, chunk_id, chunk_index, chunk_type, start_time, start_frame, end_time, end_frame, fps: fps.to_string(), fps_value, content, metadata: None, } } } ``` --- ## 8. 時間切割器實作 ### 8.1 TimeBasedSplitter ```rust pub struct TimeBasedSplitter { pub duration: f64, // 每個 chunk 時長 (秒) pub overlap: f64, // 重疊時長 (秒) } impl TimeBasedSplitter { pub fn new(duration: f64, overlap: f64) -> Self { Self { duration, overlap } } pub fn split(&self, uuid: &str, video_duration: f64, fps: f64) -> Vec { let mut chunks = Vec::new(); let step = self.duration - self.overlap; let mut current_time = 0.0; let mut index = 0; while current_time < video_duration { let end_time = (current_time + self.duration).min(video_duration); let chunk = Chunk::new( uuid.to_string(), index, ChunkType::TimeBased, current_time, end_time, &format!("{:.0}/1", fps as u32), serde_json::json!({ "duration": end_time - current_time, "is_last": end_time >= video_duration, "segment_number": index + 1, }), ); chunks.push(chunk); current_time += step; index += 1; } chunks } } ``` ### 8.2 使用範例 ```rust // 建立時間切割器 (10秒, 無重疊) let splitter = TimeBasedSplitter::new(10.0, 0.0); let chunks = splitter.split(&uuid, video_duration, 24.0); // 建立時間切割器 (10秒, 2秒重疊) let splitter = TimeBasedSplitter::new(10.0, 2.0); let chunks = splitter.split(&uuid, video_duration, 24.0); ``` --- ## 9. 處理流程 ### 9.1 完整流程 ``` 1. Register (註冊影片) └── 取得 UUID, video_duration, fps 2. Probe (探測影片) └── 取得 streams, format, fps 3. 產生 Sentence Chunks └── 讀取 ASR 輸出 └── 為每個 segment 建立 chunk 4. 產生 Cut Chunks └── 執行場景偵測 └── 為每個 scene 建立 chunk 5. 產生 TimeBased Chunks └── 使用 TimeBasedSplitter └── 為每個時間段建立 chunk 6. 儲存至資料庫 └── 批次寫入 PostgreSQL ``` ### 9.2 輸出範例 ``` 影片: 35 秒, FPS: 24 Sentence Chunks (3 個): sentence_0000: 0.0s - 10.0s (252 frames) sentence_0001: 10.0s - 20.0s (480 frames) sentence_0002: 20.0s - 35.0s (840 frames) Cut Chunks (3 個): cut_0000: 0.0s - 15.0s (360 frames) cut_0001: 15.0s - 28.0s (672 frames) cut_0002: 28.0s - 35.0s (168 frames) TimeBased Chunks (4 個, 重疊 2秒): time_based_0000: 0.0s - 10.0s (240 frames) time_based_0001: 8.0s - 18.0s (240 frames) time_based_0002: 16.0s - 26.0s (240 frames) time_based_0003: 24.0s - 35.0s (264 frames) ``` --- ## 10. 資料庫儲存 ### 10.1 PostgreSQL 儲存 #### Table Schema ```sql CREATE TABLE chunks ( id BIGSERIAL PRIMARY KEY, uuid VARCHAR(16) NOT NULL, chunk_id VARCHAR(64) NOT NULL, chunk_index INTEGER NOT NULL, chunk_type VARCHAR(32) NOT NULL, start_time DOUBLE PRECISION NOT NULL, start_frame BIGINT NOT NULL, end_time DOUBLE PRECISION NOT NULL, end_frame BIGINT NOT NULL, fps VARCHAR(16) NOT NULL, fps_value DOUBLE PRECISION NOT NULL, content JSONB NOT NULL, metadata JSONB, vector_id VARCHAR(64), created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), UNIQUE(uuid, chunk_id) ); -- 索引 CREATE INDEX idx_chunks_uuid ON chunks(uuid); CREATE INDEX idx_chunks_type ON chunks(chunk_type); CREATE INDEX idx_chunks_time ON chunks(start_time, end_time); CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type); CREATE INDEX idx_chunks_vector_id ON chunks(vector_id); ``` #### 儲存範例 ```rust pub async fn store_chunk_to_postgres(db: &PostgresDb, chunk: &Chunk) -> Result<()> { sqlx::query!( r#" INSERT INTO chunks ( uuid, chunk_id, chunk_index, chunk_type, start_time, start_frame, end_time, end_frame, fps, fps_value, content, metadata, vector_id ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?) ON CONFLICT (uuid, chunk_id) DO UPDATE SET content = EXCLUDED.content, metadata = EXCLUDED.metadata, vector_id = EXCLUDED.vector_id, updated_at = NOW() "#, chunk.uuid, chunk.chunk_id, chunk.chunk_index as i32, chunk.chunk_type.as_str(), chunk.start_time, chunk.start_frame, chunk.end_time, chunk.end_frame, chunk.fps, chunk.fps_value, serde_json::to_value(&chunk.content)?, serde_json::to_value(&chunk.metadata)?, chunk.vector_id, ) .execute(&db.pool) .await?; Ok(()) } ``` --- ### 10.2 MongoDB 儲存 #### Collection Schema ```javascript // chunks collection { _id: ObjectId, uuid: "1636719dc31f78ac", chunk_id: "sentence_0001", chunk_index: 1, chunk_type: "sentence", start_time: 10.5, start_frame: 252, end_time: 15.75, end_frame: 378, fps: "24/1", fps_value: 24.0, content: { text: "Hello world, this is a test", text_normalized: "hello world this is a test", word_count: 7, char_count: 34 }, metadata: { source: "asr", confidence: 0.95, language: "en" }, vector_id: "vec_sentence_0001", created_at: ISODate("2026-03-16T10:00:00Z"), updated_at: ISODate("2026-03-16T10:00:00Z") } // 索引 db.chunks.createIndex({ uuid: 1 }) db.chunks.createIndex({ chunk_type: 1 }) db.chunks.createIndex({ start_time: 1, end_time: 1 }) db.chunks.createIndex({ vector_id: 1 }) db.chunks.createIndex({ uuid: 1, chunk_type: 1 }) ``` #### 儲存範例 ```rust pub async fn store_chunk_to_mongodb(db: &MongoDb, chunk: &Chunk) -> Result<()> { let doc = bson::doc! { "uuid": chunk.uuid, "chunk_id": chunk.chunk_id, "chunk_index": chunk.chunk_index, "chunk_type": chunk.chunk_type.as_str(), "start_time": chunk.start_time, "start_frame": chunk.start_frame, "end_time": chunk.end_time, "end_frame": chunk.end_frame, "fps": chunk.fps, "fps_value": chunk.fps_value, "content": serde_json::to_value(&chunk.content)?, "metadata": serde_json::to_value(&chunk.metadata)?, "vector_id": chunk.vector_id, "created_at": chrono::Utc::now(), "updated_at": chrono::Utc::now() }; let collection = db.database("momentry").collection("chunks"); collection.update_one( doc! { "uuid": &chunk.uuid, "chunk_id": &chunk.chunk_id }, doc! { "$set": doc }, UpdateOptions::builder().upsert(true).build(), ).await?; Ok(()) } ``` --- ## 11. 向量儲存設計 ### 11.1 設計原則 **統一向量 ID 格式**,確保 Qdrant 與 PostgreSQL 相容: ``` {chunk_type}_{chunk_index:04} 範例: sentence_0001 cut_0002 time_based_0015 ``` ### 11.2 Qdrant Collection #### 建立 Collection ```bash # 使用 Qdrant client 建立 collection curl -X PUT http://localhost:6333/collections/chunks \ -H "Content-Type: application/json" \ -H "api-key: Test3200Test3200Test3200" \ -d '{ "vectors": { "size": 768, "distance": "Cosine" } }' ``` #### Point 結構 ```json { "id": "sentence_0001", "vector": [0.123, -0.456, ...], "payload": { "uuid": "1636719dc31f78ac", "chunk_id": "sentence_0001", "chunk_type": "sentence", "chunk_index": 1, "start_time": 10.5, "end_time": 15.75, "text": "Hello world, this is a test", "metadata": { "confidence": 0.95, "language": "en" } } } ``` #### Rust 結構 ```rust #[derive(Debug, Clone, Serialize, Deserialize)] pub struct VectorPoint { pub id: String, pub vector: Vec, pub payload: VectorPayload, } #[derive(Debug, Clone, Serialize, Deserialize)] pub struct VectorPayload { pub uuid: String, pub chunk_id: String, pub chunk_type: String, pub chunk_index: u32, pub start_time: f64, pub end_time: f64, #[serde(skip_serializing_if = "Option::is_none")] pub text: Option, #[serde(skip_serializing_if = "Option::is_none")] pub scene_id: Option, #[serde(skip_serializing_if = "Option::is_none")] pub segment_number: Option, pub metadata: Option, } ``` ### 11.3 PostgreSQL Vector 儲存 #### Table Schema ```sql -- 使用 pgvector 擴展 CREATE EXTENSION IF NOT EXISTS vector; CREATE TABLE chunk_vectors ( id BIGSERIAL PRIMARY KEY, vector_id VARCHAR(64) NOT NULL UNIQUE, uuid VARCHAR(16) NOT NULL, chunk_id VARCHAR(64) NOT NULL, chunk_type VARCHAR(32) NOT NULL, chunk_index INTEGER NOT NULL, start_time DOUBLE PRECISION NOT NULL, end_time DOUBLE PRECISION NOT NULL, embedding vector(768) NOT NULL, metadata JSONB, created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(), FOREIGN KEY (uuid, chunk_id) REFERENCES chunks(uuid, chunk_id) ); -- 向量檢索索引 (IVFFlat) CREATE INDEX idx_chunk_vectors_embedding ON chunk_vectors USING ivfflat (embedding vector_cosine_ops) WITH (lists = 100); -- 查詢索引 CREATE INDEX idx_chunk_vectors_uuid ON chunk_vectors(uuid); CREATE INDEX idx_chunk_vectors_type ON chunk_vectors(chunk_type); ``` #### 儲存範例 ```rust pub async fn store_vector_to_postgres(db: &PostgresDb, point: &VectorPoint) -> Result<()> { sqlx::query!( r#" INSERT INTO chunk_vectors ( vector_id, uuid, chunk_id, chunk_type, chunk_index, start_time, end_time, embedding, metadata ) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?) ON CONFLICT (vector_id) DO UPDATE SET embedding = EXCLUDED.embedding, metadata = EXCLUDED.metadata "#, point.id, point.payload.uuid, point.payload.chunk_id, point.payload.chunk_type, point.payload.chunk_index as i32, point.payload.start_time, point.payload.end_time, point.vector, serde_json::to_value(&point.payload.metadata)?, ) .execute(&db.pool) .await?; Ok(()) } ``` --- ## 12. 查詢範例 ### 12.1 語義搜尋 (Semantic Search) #### 查詢類型 1: 相似文字搜尋 ```rust // 搜尋與問句相似的 chunks pub async fn semantic_search( qdrant: &QdrantDb, query: &str, limit: usize, ) -> Result> { // 1. 將問句向量化 let query_vector = embed_text(query).await?; // 2. 搜尋 Qdrant let results = qdrant.search( "chunks", &query_vector, limit, Some(&Filter::must([ Condition::Match("chunk_type", "sentence"), ])), ).await?; Ok(results) } // 使用範例 let results = semantic_search(&qdrant, "找出有人在說話的片段", 10).await?; for r in results { println!("{}: {:.3}", r.payload.chunk_id, r.score); println!(" Time: {}s - {}s", r.payload.start_time, r.payload.end_time); println!(" Text: {:?}", r.payload.text); } ``` #### 查詢類型 2: 語音/文字混合搜尋 ```sql -- PostgreSQL: 搜尋特定文字的 chunks SELECT c.chunk_id, c.chunk_type, c.start_time, c.end_time, c.content->>'text' as text, v.embedding <=> query_embedding('找出開車的場景') as similarity FROM chunks c LEFT JOIN chunk_vectors v ON c.chunk_id = v.chunk_id WHERE c.chunk_type = 'sentence' AND c.content->>'text' ILIKE '%car%' ORDER BY v.embedding <=> query_embedding('找出開車的場景') LIMIT 10; ``` ### 12.2 時間範圍搜尋 #### 查詢類型 3: 特定時間範圍 ```rust // 找出 30-60 秒之間的所有 chunks pub async fn search_by_time_range( db: &PostgresDb, uuid: &str, start: f64, end: f64, ) -> Result> { let chunks = sqlx::query_as!( Chunk, r#" SELECT * FROM chunks WHERE uuid = $1 AND start_time < $3 AND end_time > $2 ORDER BY chunk_type, chunk_index "#, uuid, start, end ) .fetch_all(&db.pool) .await?; Ok(chunks) } // 使用範例 let chunks = search_by_time_range(&db, "1636719dc31f78ac", 30.0, 60.0).await?; ``` ```javascript // MongoDB: 時間範圍查詢 db.chunks.find({ uuid: "1636719dc31f78ac", start_time: { $lt: 60 }, end_time: { $gt: 30 } }).sort({ chunk_type: 1, chunk_index: 1 }) ``` ### 12.3 混合搜尋 (Hybrid Search) #### 查詢類型 4: 文字關鍵詞 + 向量相似度 ```rust // 結合關鍵詞匹配與向量相似度 pub async fn hybrid_search( db: &PostgresDb, qdrant: &QdrantDb, query: &str, keywords: &[&str], limit: usize, ) -> Result> { // 1. 向量搜尋 let query_vector = embed_text(query).await?; let vector_results = qdrant.search("chunks", &query_vector, limit * 2, None).await?; // 2. 關鍵詞過濾 let keyword_filter: Vec<_> = keywords.iter() .map(|k| format!("%{}%", k)) .collect(); let filtered: Vec<_> = vector_results.into_iter() .filter(|r| { if let Some(text) = &r.payload.text { keyword_filter.iter().any(|k| text.contains(k.as_str())) } else { false } }) .take(limit) .collect(); Ok(filtered) } ``` ### 12.4 場景搜尋 #### 查詢類型 5: 找出特定場景 ```sql -- PostgreSQL: 找出特定場景 ID 的 chunks SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'cut' AND (content->>'scene_id')::int = 5; -- 找出包含轉場效果的 chunks SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'cut' AND content->>'transition_type' = 'dissolve'; ``` ### 12.5 影片摘要 #### 查詢類型 6: 產生影片摘要 ```sql -- 合併影片所有語句 SELECT string_agg(content->>'text', ' ' ORDER BY start_time) as full_transcript FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'sentence' AND content->>'text' IS NOT NULL; -- 按場景聚合文字 SELECT content->>'scene_id' as scene, string_agg(content->>'text', ' ' ORDER BY start_time) as scene_text FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'cut' GROUP BY content->>'scene_id' ORDER BY MIN(start_time); ``` ### 12.6 常見查詢模式 | 查詢類型 | 描述 | 資料庫 | SQL/程式碼 | |----------|------|--------|-------------| | 語義搜尋 | 找相似內容 | Qdrant | `search(vector, limit)` | | 關鍵詞搜尋 | 精確文字匹配 | PostgreSQL | `ILIKE '%keyword%'` | | 時間範圍 | 特定時段 | Both | `start_time < end AND end_time > start` | | 場景搜尋 | 特定鏡頭 | PostgreSQL | `scene_id = N` | | 混合搜尋 | 向量+關鍵詞 | Both |結合以上兩種 | | 摘要產生 | 合併文字 | PostgreSQL | `string_agg()` | --- ## 13. 資料庫選擇建議 ### 13.1 儲存策略 | 資料類型 | 主要儲存 | 備份/查詢 | 說明 | |----------|----------|-----------|------| | **Chunk 元數據** | PostgreSQL | MongoDB | 結構化查詢為主 | | **向量資料** | Qdrant | PostgreSQL | 向量搜尋為主 | | **全文檢索** | PostgreSQL | - | 關鍵詞搜尋 | | **日誌/歷史** | MongoDB | - | 靈活性為主 | ### 13.2 讀寫模式 | 場景 | 寫入 | 讀取 | |------|------|------| | **影片處理** | PostgreSQL + Qdrant | - | | **語義搜尋** | - | Qdrant | | **時間軸瀏覽** | - | PostgreSQL | | **系統分析** | MongoDB | MongoDB | --- ## 14. 相關文件 - [JSON_OUTPUT_SPEC.md](./JSON_OUTPUT_SPEC.md) - JSON 輸出規範 - [RUST_DEVELOPMENT.md](./RUST_DEVELOPMENT.md) - Rust 開發規範 - [AGENTS.md](../AGENTS.md) - 開發規範