# Document Embedding Strategy - Parent-Child Chunks | Item | Content | |------|---------| | Author | Warren | | Created | 2026-03-23 | | Document Version | V1.0 | --- ## Version History | Version | Date | Purpose | Operator | Tool/Model | |---------|------|---------|----------|------------| | V1.0 | 2026-03-23 | Create document embedding strategy | Warren | OpenCode | --- ## Overview Momentry uses a **parent-child chunk hierarchy** for improved RAG retrieval. This document describes the embedding strategy for this hierarchy. ## Chunk Structure ### Parent Chunk - **Purpose**: Summarize multiple child chunks with narrative description - **Content**: High-level description of multiple scenes/segments - **Example**: ```json { "chunk_id": "story_asr_0000", "chunk_type": "story", "text_content": "[0s-125s] A man enters a building. He walks down a hallway.", "child_chunk_ids": ["asr_0001", "asr_0002", "asr_0003", "asr_0004", "asr_0005"] } ``` ### Child Chunk - **Purpose**: Individual segments from ASR, scenes from CUT, etc. - **Content**: Raw transcription or detection results - **Example**: ```json { "chunk_id": "asr_0001", "chunk_type": "sentence", "text_content": "Hello world", "parent_chunk_id": "story_asr_0000" } ``` ## Embedding Strategy ### For Vector Search When embedding chunks for vector search, we combine **parent description + child content** to provide both context and detail. #### Parent Chunk Embedding ``` embedding_text = f"Summary: {parent.text_content} Children: {child_text_1}. {child_text_2}. {child_text_3}..." ``` **Prefix**: `search_document:` (for documents in Qdrant) **Example**: ``` search_document: Summary: A man enters a building. He walks down a hallway. Children: Hello, how are you? I'm fine thank you. The weather is nice today. ``` #### Child Chunk Embedding ``` embedding_text = f"[{child.chunk_type}] {child.text_content} Parent: {parent.description}" ``` **Prefix**: `search_document:` **Example**: ``` search_document: [sentence] Hello, how are you? Parent: A man enters a building. He walks down a hallway. ``` ### For BM25 Text Search BM25 operates on raw text with PostgreSQL full-text search. - **Index**: `search_vector` (TSVECTOR) on `chunks.text_content` - **Search**: Uses `ts_rank_cd()` for ranking ## Hybrid Search Ranking Combined score = `(vector_score * 0.7) + (bm25_score * 0.3)` ### Why 0.7/0.3? | Weight | Vector | BM25 | |--------|--------|------| | Pros | Semantic similarity | Exact keyword match | | Cons | May miss specific terms | No semantic understanding | | Best for | Thematic queries | Fact lookup | ## Query Patterns ### Thematic Query ("What are the main themes?") - Use higher `vector_weight` (0.8-0.9) - Vector search finds semantically similar content ### Fact Lookup ("Who said X?") - Use higher `bm25_weight` (0.5-0.7) - BM25 finds exact matches ### Balanced ("Tell me about scene 5") - Use default 0.7/0.3 ## Implementation ### Embedding Generation ```rust fn build_embedding_text(chunk: &Chunk, parent_text: Option<&str>) -> String { match chunk.chunk_type { ChunkType::Story => { format!( "Summary: {}\nChildren: {}", chunk.text_content, get_children_text(chunk) ) } _ => { format!( "[{}] {}\nParent: {}", chunk.chunk_type.as_str(), chunk.text_content, parent_text.unwrap_or("N/A") ) } } } ``` ### Storage - Parent chunks stored with their `child_chunk_ids` - Child chunks reference `parent_chunk_id` - Both stored in PostgreSQL with full-text index - Vectors stored in Qdrant ## Example Flow 1. **Story Processing** generates parent-child hierarchy 2. **Embedding** creates vector for each chunk 3. **Storage** saves to PostgreSQL + Qdrant 4. **Search** retrieves using hybrid search 5. **Results** include both parent context and child details ## Best Practices 1. **Chunk Size**: 5 child chunks per parent (configurable) 2. **Text Length**: Keep embeddings under 512 tokens 3. **Parent Description**: Include temporal markers (timestamps) 4. **Child Content**: Preserve original transcription ## Future Enhancements - [ ] GraphRAG integration for relationship traversal - [ ] Cross-chunk entity linking - [ ] Temporal graph building