## v0.9.20260325_144654 ### Features - API Key Authentication System - Job Worker System - V2 Backup Versioning ### Bug Fixes - get_processor_results_by_job column mapping Co-authored-by: OpenCode
28 KiB
28 KiB
Video Chunk 切分規範
| 項目 | 內容 |
|---|---|
| 建立者 | Warren |
| 建立時間 | 2026-03-16 |
| 文件版本 | V1.0 |
版本歷史
| 版本 | 日期 | 目的 | 操作人 | 工具/模型 |
|---|---|---|---|---|
| V1.0 | 2026-03-16 | 創建文件 | Warren | OpenCode / MiniMax M2.5 |
本文檔定義 Momentry Core 系統中影片 chunks 的切分原則與資料結構。
1. Chunk 概述
1.1 設計原則
- 允許重疊: 不同類型的 chunk 可以重疊(如語句 chunk 與時間 chunk)
- Frame 精確度: 時間坐標精確到影片 frame
- 多元分類: 支援語句、場景、時間三種分割方式
1.2 Chunk 類型
| 類型 | 說明 | 是否可重疊 |
|---|---|---|
| Sentence | 語句分割 | ✅ 可與其他類型重疊 |
| Cut | 場景切割 | ✅ 可與其他類型重疊 |
| TimeBased | 時間長度切割 | ✅ 可與其他類型重疊 |
2. 時間坐標系統
2.1 時間格式
所有時間使用 秒 為單位,精確到 微秒 (浮點數):
{
"start_time": 10.5,
"end_time": 15.75
}
2.2 Frame 計算
frame_number = floor(time_in_seconds * fps)
time_at_frame = frame_number / fps
範例:
- 影片 FPS: 24/1 (24 fps)
- 時間: 10.5 秒
- Frame: floor(10.5 * 24) = 252
- 校驗: 252 / 24 = 10.5 秒 ✅
2.3 Frame 資訊結構
{
"start_time": 10.5,
"start_frame": 252,
"end_time": 15.75,
"end_frame": 378,
"fps": "24/1",
"fps_value": 24.0
}
3. 三種切分方式
3.1 Sentence (語句分割)
原則:
- 根據 ASR 語音識別結果
- 每個識別的語句為一個 chunk
- 文字內容來自 ASR 輸出
範例:
ASR 輸出:
[
{"start": 10.0, "end": 15.0, "text": "Hello world"},
{"start": 15.0, "end": 20.0, "text": "This is a test"},
{"start": 20.0, "end": 25.5, "text": "Processing video"}
]
轉換為 Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 10.0s - 15.0s "Hello world" │
├────────────────────────────────────────┤
│ chunk_0002: 15.0s - 20.0s "This is a test" │
├────────────────────────────────────────┤
│ chunk_0003: 20.0s - 25.5s "Processing video" │
└────────────────────────────────────────┘
3.2 Cut (場景切割)
原則:
- 根據影片鏡頭變化 (scene change / cut detection)
- 使用 ffmpeg 或 Python (scenedetect) 偵測
- 每個場景為一個 chunk
偵測方法:
# 使用 ffmpeg 偵測場景變化
ffmpeg -i input.mp4 -filter:v "select='gt(scene,0.3)',showinfo" -f null -
範例:
場景偵測結果:
[
{"start": 0.0, "end": 45.2, "scene_id": 1},
{"start": 45.2, "end": 120.5, "scene_id": 2},
{"start": 120.5, "end": 180.0, "scene_id": 3}
]
轉換為 Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 45.2s (Scene 1) │
├────────────────────────────────────────┤
│ chunk_0002: 45.2s - 120.5s (Scene 2) │
├────────────────────────────────────────┤
│ chunk_0003: 120.5s - 180.0s (Scene 3) │
└────────────────────────────────────────┘
3.3 TimeBased (時間長度切割)
原則:
- 固定時間長度切割
- 預設 10 秒 為一個 chunk
- 最後一個 chunk 可能不足 10 秒
- 支援重疊 (可設定 overlap 秒數)
參數配置:
| 參數 | 預設值 | 說明 |
|---|---|---|
| duration | 10.0 | 每個 chunk 時長 (秒) |
| overlap | 0.0 | 重疊時長 (秒) |
範例 (無重疊):
影片時長: 35 秒, duration=10
Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 10.0s │
├────────────────────────────────────────┤
│ chunk_0002: 10.0s - 20.0s │
├────────────────────────────────────────┤
│ chunk_0003: 20.0s - 30.0s │
├────────────────────────────────────────┤
│ chunk_0004: 30.0s - 35.0s (不足10秒) │
└────────────────────────────────────────┘
範例 (有重疊, overlap=2):
影片時長: 35 秒, duration=10, overlap=2
Chunks:
┌────────────────────────────────────────┐
│ chunk_0001: 0.0s - 10.0s │
├────────────────────────────────────────┤
│ chunk_0002: 8.0s - 18.0s (重疊 2秒) │
├────────────────────────────────────────┤
│ chunk_0003: 16.0s - 26.0s (重疊 2秒) │
├────────────────────────────────────────┤
│ chunk_0004: 24.0s - 34.0s (重疊 2秒) │
├────────────────────────────────────────┤
│ chunk_0005: 32.0s - 35.0s (重疊+不足) │
└────────────────────────────────────────┘
4. Chunk 資料結構
4.1 基本結構
{
"uuid": "1636719dc31f78ac",
"chunk_id": "sentence_0001",
"chunk_index": 1,
"chunk_type": "sentence",
"start_time": 10.5,
"start_frame": 252,
"end_time": 15.75,
"end_frame": 378,
"fps": "24/1",
"fps_value": 24.0,
"content": {
"text": "Hello world, this is a test"
},
"metadata": {
"source": "asr",
"confidence": 0.95,
"language": "en"
}
}
4.2 欄位說明
| 欄位 | 類型 | 必填 | 說明 |
|---|---|---|---|
uuid |
String | ✅ | 影片 UUID (16 字元) |
chunk_id |
String | ✅ | Chunk 唯一 ID |
chunk_index |
Integer | ✅ | Chunk 索引 (從 0 開始) |
chunk_type |
String | ✅ | 類型: sentence/cut/time_based |
start_time |
Float | ✅ | 開始時間 (秒) |
start_frame |
Integer | ✅ | 開始 frame 編號 |
end_time |
Float | ✅ | 結束時間 (秒) |
end_frame |
Integer | ✅ | 結束 frame 編號 |
fps |
String | ✅ | FPS 表示 (如 "24/1") |
fps_value |
Float | ✅ | FPS 數值 (如 24.0) |
content |
Object | ✅ | 內容 (見下文) |
metadata |
Object | ❌ | 額外資訊 (見下文) |
4.3 Content 結構
根據 chunk_type 不同,content 結構也不同:
Sentence Content
{
"content": {
"text": "Hello world, this is a test message",
"text_normalized": "hello world this is a test message",
"word_count": 7,
"char_count": 34
}
}
| 欄位 | 類型 | 說明 |
|---|---|---|
text |
String | 原始識別文字 |
text_normalized |
String | 正規化文字 (小寫,去除標點) |
word_count |
Integer | 字詞數量 |
char_count |
Integer | 字元數量 |
Cut Content
{
"content": {
"scene_id": 2,
"scene_number": 2,
"transition_type": "cut",
"scene_change_score": 0.95
}
}
| 欄位 | 類型 | 說明 |
|---|---|---|
scene_id |
Integer | 場景 ID |
scene_number |
Integer | 場景編號 |
transition_type |
String | 轉場類型: cut/dissolve/fade |
scene_change_score |
Float | 場景變化分數 (0-1) |
TimeBased Content
{
"content": {
"duration": 10.0,
"is_last": false,
"segment_number": 3,
"total_segments": 10
}
}
| 欄位 | 類型 | 說明 |
|---|---|---|
duration |
Float | 時長 (秒) |
is_last |
Boolean | 是否最後一個 chunk |
segment_number |
Integer | 分段編號 |
total_segments |
Integer | 總分段數 |
4.4 Metadata 結構
{
"metadata": {
"source": "asr",
"confidence": 0.95,
"language": "en",
"model": "tiny",
"created_at": "2026-03-16T10:00:00Z"
}
}
| 欄位 | 類型 | 說明 |
|---|---|---|
source |
String | 來源: asr/scene_detect/time_based |
confidence |
Float | 信心度 (0-1) |
language |
String | 語言代碼 |
model |
String | 使用模型 |
created_at |
String | 創建時間 (ISO 8601) |
5. Chunk ID 命名規範
5.1 格式
{chunk_type}_{chunk_index:04}
| 類型 | 前綴 | 範例 |
|---|---|---|
| Sentence | sentence_ |
sentence_0001 |
| Cut | cut_ |
cut_0001 |
| TimeBased | time_based_ |
time_based_0001 |
5.2 編號規則
- 從 0 開始
- 使用 4 位數 補零
- 按時間順序遞增
6. 資料庫 Schema
6.1 PostgreSQL Table
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
uuid VARCHAR(16) NOT NULL,
chunk_id VARCHAR(64) NOT NULL,
chunk_index INTEGER NOT NULL,
chunk_type VARCHAR(32) NOT NULL,
start_time DOUBLE PRECISION NOT NULL,
start_frame BIGINT NOT NULL,
end_time DOUBLE PRECISION NOT NULL,
end_frame BIGINT NOT NULL,
fps VARCHAR(16) NOT NULL,
fps_value DOUBLE PRECISION NOT NULL,
content JSONB NOT NULL,
metadata JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(uuid, chunk_id)
);
-- 索引
CREATE INDEX idx_chunks_uuid ON chunks(uuid);
CREATE INDEX idx_chunks_type ON chunks(chunk_type);
CREATE INDEX idx_chunks_time ON chunks(start_time, end_time);
CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type);
6.2 查詢範例
-- 查詢影片所有 chunks
SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac';
-- 查詢特定類型的 chunks
SELECT * FROM chunks WHERE uuid = '1636719dc31f78ac' AND chunk_type = 'sentence';
-- 查詢時間範圍內的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND start_time <= 30.0 AND end_time >= 20.0;
-- 查詢時間範圍內的所有 chunks (混合類型)
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND start_time <= 30.0 AND end_time >= 20.0
ORDER BY chunk_type, chunk_index;
7. Rust 資料結構
7.1 Chunk 定義
use serde::{Deserialize, Serialize};
#[derive(Debug, Clone, Copy, Serialize, Deserialize, PartialEq)]
#[serde(rename_all = "snake_case")]
pub enum ChunkType {
Sentence,
Cut,
TimeBased,
}
impl ChunkType {
pub fn as_str(&self) -> &'static str {
match self {
ChunkType::Sentence => "sentence",
ChunkType::Cut => "cut",
ChunkType::TimeBased => "time_based",
}
}
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct Chunk {
pub uuid: String,
pub chunk_id: String,
pub chunk_index: u32,
pub chunk_type: ChunkType,
pub start_time: f64,
pub start_frame: i64,
pub end_time: f64,
pub end_frame: i64,
pub fps: String,
pub fps_value: f64,
pub content: serde_json::Value,
pub metadata: Option<serde_json::Value>,
}
7.2 建立 Chunk
impl Chunk {
pub fn new(
uuid: String,
chunk_index: u32,
chunk_type: ChunkType,
start_time: f64,
end_time: f64,
fps: &str,
content: serde_json::Value,
) -> Self {
let fps_value = parse_fps(fps);
let start_frame = (start_time * fps_value) as i64;
let end_frame = (end_time * fps_value) as i64;
let chunk_id = format!("{}_{:04}", chunk_type.as_str(), chunk_index);
Self {
uuid,
chunk_id,
chunk_index,
chunk_type,
start_time,
start_frame,
end_time,
end_frame,
fps: fps.to_string(),
fps_value,
content,
metadata: None,
}
}
}
8. 時間切割器實作
8.1 TimeBasedSplitter
pub struct TimeBasedSplitter {
pub duration: f64, // 每個 chunk 時長 (秒)
pub overlap: f64, // 重疊時長 (秒)
}
impl TimeBasedSplitter {
pub fn new(duration: f64, overlap: f64) -> Self {
Self { duration, overlap }
}
pub fn split(&self, uuid: &str, video_duration: f64, fps: f64) -> Vec<Chunk> {
let mut chunks = Vec::new();
let step = self.duration - self.overlap;
let mut current_time = 0.0;
let mut index = 0;
while current_time < video_duration {
let end_time = (current_time + self.duration).min(video_duration);
let chunk = Chunk::new(
uuid.to_string(),
index,
ChunkType::TimeBased,
current_time,
end_time,
&format!("{:.0}/1", fps as u32),
serde_json::json!({
"duration": end_time - current_time,
"is_last": end_time >= video_duration,
"segment_number": index + 1,
}),
);
chunks.push(chunk);
current_time += step;
index += 1;
}
chunks
}
}
8.2 使用範例
// 建立時間切割器 (10秒, 無重疊)
let splitter = TimeBasedSplitter::new(10.0, 0.0);
let chunks = splitter.split(&uuid, video_duration, 24.0);
// 建立時間切割器 (10秒, 2秒重疊)
let splitter = TimeBasedSplitter::new(10.0, 2.0);
let chunks = splitter.split(&uuid, video_duration, 24.0);
9. 處理流程
9.1 完整流程
1. Register (註冊影片)
└── 取得 UUID, video_duration, fps
2. Probe (探測影片)
└── 取得 streams, format, fps
3. 產生 Sentence Chunks
└── 讀取 ASR 輸出
└── 為每個 segment 建立 chunk
4. 產生 Cut Chunks
└── 執行場景偵測
└── 為每個 scene 建立 chunk
5. 產生 TimeBased Chunks
└── 使用 TimeBasedSplitter
└── 為每個時間段建立 chunk
6. 儲存至資料庫
└── 批次寫入 PostgreSQL
9.2 輸出範例
影片: 35 秒, FPS: 24
Sentence Chunks (3 個):
sentence_0000: 0.0s - 10.0s (252 frames)
sentence_0001: 10.0s - 20.0s (480 frames)
sentence_0002: 20.0s - 35.0s (840 frames)
Cut Chunks (3 個):
cut_0000: 0.0s - 15.0s (360 frames)
cut_0001: 15.0s - 28.0s (672 frames)
cut_0002: 28.0s - 35.0s (168 frames)
TimeBased Chunks (4 個, 重疊 2秒):
time_based_0000: 0.0s - 10.0s (240 frames)
time_based_0001: 8.0s - 18.0s (240 frames)
time_based_0002: 16.0s - 26.0s (240 frames)
time_based_0003: 24.0s - 35.0s (264 frames)
10. 資料庫儲存
10.1 PostgreSQL 儲存
Table Schema
CREATE TABLE chunks (
id BIGSERIAL PRIMARY KEY,
uuid VARCHAR(16) NOT NULL,
chunk_id VARCHAR(64) NOT NULL,
chunk_index INTEGER NOT NULL,
chunk_type VARCHAR(32) NOT NULL,
start_time DOUBLE PRECISION NOT NULL,
start_frame BIGINT NOT NULL,
end_time DOUBLE PRECISION NOT NULL,
end_frame BIGINT NOT NULL,
fps VARCHAR(16) NOT NULL,
fps_value DOUBLE PRECISION NOT NULL,
content JSONB NOT NULL,
metadata JSONB,
vector_id VARCHAR(64),
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
updated_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
UNIQUE(uuid, chunk_id)
);
-- 索引
CREATE INDEX idx_chunks_uuid ON chunks(uuid);
CREATE INDEX idx_chunks_type ON chunks(chunk_type);
CREATE INDEX idx_chunks_time ON chunks(start_time, end_time);
CREATE INDEX idx_chunks_uuid_type ON chunks(uuid, chunk_type);
CREATE INDEX idx_chunks_vector_id ON chunks(vector_id);
儲存範例
pub async fn store_chunk_to_postgres(db: &PostgresDb, chunk: &Chunk) -> Result<()> {
sqlx::query!(
r#"
INSERT INTO chunks (
uuid, chunk_id, chunk_index, chunk_type,
start_time, start_frame, end_time, end_frame,
fps, fps_value, content, metadata, vector_id
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT (uuid, chunk_id) DO UPDATE SET
content = EXCLUDED.content,
metadata = EXCLUDED.metadata,
vector_id = EXCLUDED.vector_id,
updated_at = NOW()
"#,
chunk.uuid,
chunk.chunk_id,
chunk.chunk_index as i32,
chunk.chunk_type.as_str(),
chunk.start_time,
chunk.start_frame,
chunk.end_time,
chunk.end_frame,
chunk.fps,
chunk.fps_value,
serde_json::to_value(&chunk.content)?,
serde_json::to_value(&chunk.metadata)?,
chunk.vector_id,
)
.execute(&db.pool)
.await?;
Ok(())
}
10.2 MongoDB 儲存
Collection Schema
// chunks collection
{
_id: ObjectId,
uuid: "1636719dc31f78ac",
chunk_id: "sentence_0001",
chunk_index: 1,
chunk_type: "sentence",
start_time: 10.5,
start_frame: 252,
end_time: 15.75,
end_frame: 378,
fps: "24/1",
fps_value: 24.0,
content: {
text: "Hello world, this is a test",
text_normalized: "hello world this is a test",
word_count: 7,
char_count: 34
},
metadata: {
source: "asr",
confidence: 0.95,
language: "en"
},
vector_id: "vec_sentence_0001",
created_at: ISODate("2026-03-16T10:00:00Z"),
updated_at: ISODate("2026-03-16T10:00:00Z")
}
// 索引
db.chunks.createIndex({ uuid: 1 })
db.chunks.createIndex({ chunk_type: 1 })
db.chunks.createIndex({ start_time: 1, end_time: 1 })
db.chunks.createIndex({ vector_id: 1 })
db.chunks.createIndex({ uuid: 1, chunk_type: 1 })
儲存範例
pub async fn store_chunk_to_mongodb(db: &MongoDb, chunk: &Chunk) -> Result<()> {
let doc = bson::doc! {
"uuid": chunk.uuid,
"chunk_id": chunk.chunk_id,
"chunk_index": chunk.chunk_index,
"chunk_type": chunk.chunk_type.as_str(),
"start_time": chunk.start_time,
"start_frame": chunk.start_frame,
"end_time": chunk.end_time,
"end_frame": chunk.end_frame,
"fps": chunk.fps,
"fps_value": chunk.fps_value,
"content": serde_json::to_value(&chunk.content)?,
"metadata": serde_json::to_value(&chunk.metadata)?,
"vector_id": chunk.vector_id,
"created_at": chrono::Utc::now(),
"updated_at": chrono::Utc::now()
};
let collection = db.database("momentry").collection("chunks");
collection.update_one(
doc! { "uuid": &chunk.uuid, "chunk_id": &chunk.chunk_id },
doc! { "$set": doc },
UpdateOptions::builder().upsert(true).build(),
).await?;
Ok(())
}
11. 向量儲存設計
11.1 設計原則
統一向量 ID 格式,確保 Qdrant 與 PostgreSQL 相容:
{chunk_type}_{chunk_index:04}
範例:
sentence_0001
cut_0002
time_based_0015
11.2 Qdrant Collection
建立 Collection
# 使用 Qdrant client 建立 collection
curl -X PUT http://localhost:6333/collections/chunks \
-H "Content-Type: application/json" \
-H "api-key: Test3200Test3200Test3200" \
-d '{
"vectors": {
"size": 768,
"distance": "Cosine"
}
}'
Point 結構
{
"id": "sentence_0001",
"vector": [0.123, -0.456, ...],
"payload": {
"uuid": "1636719dc31f78ac",
"chunk_id": "sentence_0001",
"chunk_type": "sentence",
"chunk_index": 1,
"start_time": 10.5,
"end_time": 15.75,
"text": "Hello world, this is a test",
"metadata": {
"confidence": 0.95,
"language": "en"
}
}
}
Rust 結構
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VectorPoint {
pub id: String,
pub vector: Vec<f32>,
pub payload: VectorPayload,
}
#[derive(Debug, Clone, Serialize, Deserialize)]
pub struct VectorPayload {
pub uuid: String,
pub chunk_id: String,
pub chunk_type: String,
pub chunk_index: u32,
pub start_time: f64,
pub end_time: f64,
#[serde(skip_serializing_if = "Option::is_none")]
pub text: Option<String>,
#[serde(skip_serializing_if = "Option::is_none")]
pub scene_id: Option<i32>,
#[serde(skip_serializing_if = "Option::is_none")]
pub segment_number: Option<i32>,
pub metadata: Option<serde_json::Value>,
}
11.3 PostgreSQL Vector 儲存
Table Schema
-- 使用 pgvector 擴展
CREATE EXTENSION IF NOT EXISTS vector;
CREATE TABLE chunk_vectors (
id BIGSERIAL PRIMARY KEY,
vector_id VARCHAR(64) NOT NULL UNIQUE,
uuid VARCHAR(16) NOT NULL,
chunk_id VARCHAR(64) NOT NULL,
chunk_type VARCHAR(32) NOT NULL,
chunk_index INTEGER NOT NULL,
start_time DOUBLE PRECISION NOT NULL,
end_time DOUBLE PRECISION NOT NULL,
embedding vector(768) NOT NULL,
metadata JSONB,
created_at TIMESTAMP WITH TIME ZONE DEFAULT NOW(),
FOREIGN KEY (uuid, chunk_id) REFERENCES chunks(uuid, chunk_id)
);
-- 向量檢索索引 (IVFFlat)
CREATE INDEX idx_chunk_vectors_embedding
ON chunk_vectors
USING ivfflat (embedding vector_cosine_ops)
WITH (lists = 100);
-- 查詢索引
CREATE INDEX idx_chunk_vectors_uuid ON chunk_vectors(uuid);
CREATE INDEX idx_chunk_vectors_type ON chunk_vectors(chunk_type);
儲存範例
pub async fn store_vector_to_postgres(db: &PostgresDb, point: &VectorPoint) -> Result<()> {
sqlx::query!(
r#"
INSERT INTO chunk_vectors (
vector_id, uuid, chunk_id, chunk_type, chunk_index,
start_time, end_time, embedding, metadata
) VALUES (?, ?, ?, ?, ?, ?, ?, ?, ?)
ON CONFLICT (vector_id) DO UPDATE SET
embedding = EXCLUDED.embedding,
metadata = EXCLUDED.metadata
"#,
point.id,
point.payload.uuid,
point.payload.chunk_id,
point.payload.chunk_type,
point.payload.chunk_index as i32,
point.payload.start_time,
point.payload.end_time,
point.vector,
serde_json::to_value(&point.payload.metadata)?,
)
.execute(&db.pool)
.await?;
Ok(())
}
12. 查詢範例
12.1 語義搜尋 (Semantic Search)
查詢類型 1: 相似文字搜尋
// 搜尋與問句相似的 chunks
pub async fn semantic_search(
qdrant: &QdrantDb,
query: &str,
limit: usize,
) -> Result<Vec<SearchResult>> {
// 1. 將問句向量化
let query_vector = embed_text(query).await?;
// 2. 搜尋 Qdrant
let results = qdrant.search(
"chunks",
&query_vector,
limit,
Some(&Filter::must([
Condition::Match("chunk_type", "sentence"),
])),
).await?;
Ok(results)
}
// 使用範例
let results = semantic_search(&qdrant, "找出有人在說話的片段", 10).await?;
for r in results {
println!("{}: {:.3}", r.payload.chunk_id, r.score);
println!(" Time: {}s - {}s", r.payload.start_time, r.payload.end_time);
println!(" Text: {:?}", r.payload.text);
}
查詢類型 2: 語音/文字混合搜尋
-- PostgreSQL: 搜尋特定文字的 chunks
SELECT
c.chunk_id,
c.chunk_type,
c.start_time,
c.end_time,
c.content->>'text' as text,
v.embedding <=> query_embedding('找出開車的場景') as similarity
FROM chunks c
LEFT JOIN chunk_vectors v ON c.chunk_id = v.chunk_id
WHERE c.chunk_type = 'sentence'
AND c.content->>'text' ILIKE '%car%'
ORDER BY v.embedding <=> query_embedding('找出開車的場景')
LIMIT 10;
12.2 時間範圍搜尋
查詢類型 3: 特定時間範圍
// 找出 30-60 秒之間的所有 chunks
pub async fn search_by_time_range(
db: &PostgresDb,
uuid: &str,
start: f64,
end: f64,
) -> Result<Vec<Chunk>> {
let chunks = sqlx::query_as!(
Chunk,
r#"
SELECT * FROM chunks
WHERE uuid = $1
AND start_time < $3
AND end_time > $2
ORDER BY chunk_type, chunk_index
"#,
uuid, start, end
)
.fetch_all(&db.pool)
.await?;
Ok(chunks)
}
// 使用範例
let chunks = search_by_time_range(&db, "1636719dc31f78ac", 30.0, 60.0).await?;
// MongoDB: 時間範圍查詢
db.chunks.find({
uuid: "1636719dc31f78ac",
start_time: { $lt: 60 },
end_time: { $gt: 30 }
}).sort({ chunk_type: 1, chunk_index: 1 })
12.3 混合搜尋 (Hybrid Search)
查詢類型 4: 文字關鍵詞 + 向量相似度
// 結合關鍵詞匹配與向量相似度
pub async fn hybrid_search(
db: &PostgresDb,
qdrant: &QdrantDb,
query: &str,
keywords: &[&str],
limit: usize,
) -> Result<Vec<HybridResult>> {
// 1. 向量搜尋
let query_vector = embed_text(query).await?;
let vector_results = qdrant.search("chunks", &query_vector, limit * 2, None).await?;
// 2. 關鍵詞過濾
let keyword_filter: Vec<_> = keywords.iter()
.map(|k| format!("%{}%", k))
.collect();
let filtered: Vec<_> = vector_results.into_iter()
.filter(|r| {
if let Some(text) = &r.payload.text {
keyword_filter.iter().any(|k| text.contains(k.as_str()))
} else {
false
}
})
.take(limit)
.collect();
Ok(filtered)
}
12.4 場景搜尋
查詢類型 5: 找出特定場景
-- PostgreSQL: 找出特定場景 ID 的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
AND (content->>'scene_id')::int = 5;
-- 找出包含轉場效果的 chunks
SELECT * FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
AND content->>'transition_type' = 'dissolve';
12.5 影片摘要
查詢類型 6: 產生影片摘要
-- 合併影片所有語句
SELECT
string_agg(content->>'text', ' ' ORDER BY start_time) as full_transcript
FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'sentence'
AND content->>'text' IS NOT NULL;
-- 按場景聚合文字
SELECT
content->>'scene_id' as scene,
string_agg(content->>'text', ' ' ORDER BY start_time) as scene_text
FROM chunks
WHERE uuid = '1636719dc31f78ac'
AND chunk_type = 'cut'
GROUP BY content->>'scene_id'
ORDER BY MIN(start_time);
12.6 常見查詢模式
| 查詢類型 | 描述 | 資料庫 | SQL/程式碼 |
|---|---|---|---|
| 語義搜尋 | 找相似內容 | Qdrant | search(vector, limit) |
| 關鍵詞搜尋 | 精確文字匹配 | PostgreSQL | ILIKE '%keyword%' |
| 時間範圍 | 特定時段 | Both | start_time < end AND end_time > start |
| 場景搜尋 | 特定鏡頭 | PostgreSQL | scene_id = N |
| 混合搜尋 | 向量+關鍵詞 | Both | 結合以上兩種 |
| 摘要產生 | 合併文字 | PostgreSQL | string_agg() |
13. 資料庫選擇建議
13.1 儲存策略
| 資料類型 | 主要儲存 | 備份/查詢 | 說明 |
|---|---|---|---|
| Chunk 元數據 | PostgreSQL | MongoDB | 結構化查詢為主 |
| 向量資料 | Qdrant | PostgreSQL | 向量搜尋為主 |
| 全文檢索 | PostgreSQL | - | 關鍵詞搜尋 |
| 日誌/歷史 | MongoDB | - | 靈活性為主 |
13.2 讀寫模式
| 場景 | 寫入 | 讀取 |
|---|---|---|
| 影片處理 | PostgreSQL + Qdrant | - |
| 語義搜尋 | - | Qdrant |
| 時間軸瀏覽 | - | PostgreSQL |
| 系統分析 | MongoDB | MongoDB |
14. 相關文件
- JSON_OUTPUT_SPEC.md - JSON 輸出規範
- RUST_DEVELOPMENT.md - Rust 開發規範
- AGENTS.md - 開發規範