## v0.9.20260325_144654 ### Features - API Key Authentication System - Job Worker System - V2 Backup Versioning ### Bug Fixes - get_processor_results_by_job column mapping Co-authored-by: OpenCode
4.0 KiB
4.0 KiB
Document Embedding Strategy - Parent-Child Chunks
Overview
Momentry uses a parent-child chunk hierarchy for improved RAG retrieval. This document describes the embedding strategy for this hierarchy.
Chunk Structure
Parent Chunk
- Purpose: Summarize multiple child chunks with narrative description
- Content: High-level description of multiple scenes/segments
- Example:
{
"chunk_id": "story_asr_0000",
"chunk_type": "story",
"text_content": "[0s-125s] A man enters a building. He walks down a hallway.",
"child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"]
}
Child Chunk
- Purpose: Individual segments from ASR, scenes from CUT, etc.
- Content: Raw transcription or detection results
- Example:
{
"chunk_id": "asr_0001",
"chunk_type": "sentence",
"text_content": "Hello world",
"parent_chunk_id": "story_asr_0000"
}
Embedding Strategy
For Vector Search
When embedding chunks for vector search, we combine parent description + child content to provide both context and detail.
Parent Chunk Embedding
embedding_text = f"Summary: {parent.text_content}
Children: {child_text_1}. {child_text_2}. {child_text_3}..."
Prefix: search_document: (for documents in Qdrant)
Example:
search_document: Summary: A man enters a building. He walks down a hallway.
Children: Hello, how are you? I'm fine thank you. The weather is nice today.
Child Chunk Embedding
embedding_text = f"[{child.chunk_type}] {child.text_content}
Parent: {parent.description}"
Prefix: search_document:
Example:
search_document: [sentence] Hello, how are you?
Parent: A man enters a building. He walks down a hallway.
For BM25 Text Search
BM25 operates on raw text with PostgreSQL full-text search.
- Index:
search_vector(TSVECTOR) onchunks.text_content - Search: Uses
ts_rank_cd()for ranking
Hybrid Search Ranking
Combined score = (vector_score * 0.7) + (bm25_score * 0.3)
Why 0.7/0.3?
| Weight | Vector | BM25 |
|---|---|---|
| Pros | Semantic similarity | Exact keyword match |
| Cons | May miss specific terms | No semantic understanding |
| Best for | Thematic queries | Fact lookup |
Query Patterns
Thematic Query ("What are the main themes?")
- Use higher
vector_weight(0.8-0.9) - Vector search finds semantically similar content
Fact Lookup ("Who said X?")
- Use higher
bm25_weight(0.5-0.7) - BM25 finds exact matches
Balanced ("Tell me about scene 5")
- Use default 0.7/0.3
Implementation
Embedding Generation
fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String {
match chunk.chunk_type {
ChunkType::Story => {
format!(
"Summary: {}\nChildren: {}",
chunk.text_content,
get_children_text(chunk)
)
}
_ => {
format!(
"[{}] {}\nParent: {}",
chunk.chunk_type.as_str(),
chunk.text_content,
parent_text.unwrap_or("N/A")
)
}
}
}
Storage
- Parent chunks stored with their
child_chunk_ids - Child chunks reference
parent_chunk_id - Both stored in PostgreSQL with full-text index
- Vectors stored in Qdrant
Example Flow
- Story Processing generates parent-child hierarchy
- Embedding creates vector for each chunk
- Storage saves to PostgreSQL + Qdrant
- Search retrieves using hybrid search
- Results include both parent context and child details
Best Practices
- Chunk Size: 5 child chunks per parent (configurable)
- Text Length: Keep embeddings under 512 tokens
- Parent Description: Include temporal markers (timestamps)
- Child Content: Preserve original transcription
Future Enhancements
- GraphRAG integration for relationship traversal
- Cross-chunk entity linking
- Temporal graph building