Files
markbase/docs/CEPH_INTEGRATION_ANALYSIS.md

328 lines
14 KiB
Markdown
Raw Permalink Blame History

This file contains ambiguous Unicode characters
This file contains Unicode characters that might be confused with other characters. If you think that this is intentional, you can safely ignore this warning. Use the Escape button to reveal them.
# Ceph RADOS Integration Analysis for MarkBase
**Date**: 2026-06-25
**Status**: Shelved (不符合 macOS 跨平台定位)
**Library**: ceph-async (4.0.5)
**Constraint**: Linux-only (requires librados.so symlink)
---
## Executive Summary
### Goal
Add Ceph RADOS as a VfsBackend option for distributed, highly scalable storage.
### Key Findings
| Aspect | Finding |
|--------|---------|
| **Platform** | ❌ Linux-only (librados.so FFI, macOS needs Docker/VM) |
| **Deployment** | ⚠️ Requires full cluster (Monitor + OSD + MGR) |
| **Complexity** | ⚠️⚠️⚠️⚠️⚠️ High (超出 Lightweight 定位) |
| **Positioning** | ❌ 不符合 MarkBase macOS 跨平台定位 |
### Recommendation
**当前搁置**。优先考虑:
1. **MinIO** — S3-compatible已有 S3Vfs 支持,跨平台
2. **内置分布式** — DedupFs + S3Vfs 组合,轻量级
---
## Architecture Overview
```
┌─────────────────────────────────────────────────────────────────────────┐
│ MarkBase Application Layer │
│ ├── SMB Server (Port 4445) │
│ ├── SFTP Server (Port 2024) │
│ ├── WebDAV Server (Port 11438) │
│ └───────────────────────────────────────────────────────────────────────┘
│ ↓ │
┌─────────────────────────────────────────────────────────────────────────┐
│ VFS Abstraction Layer (VfsBackend trait) │
│ ├── LocalFs — POSIX local filesystem │
│ ├── S3Vfs — S3-compatible storage (HTTP API) │
│ ├── SmbVfs — SMB client backend │
│ ├── CephVfs — Ceph RADOS backend (搁置) │
│ ├── EncryptedFs — Encryption layer │
│ ├── Compression — ZSTD/LZ4 compression layer │
│ ├── DedupFs — Block deduplication layer │
│ ├── RaidFs — RAID-Z emulation layer │
│ └─────────────────────────────────────────────────────────────────────┘
│ ↓ │
┌─────────────────────────────────────────────────────────────────────────┐
│ Ceph Storage Cluster (RADOS) │
│ ├── Monitor (MON) — Cluster map, authentication │
│ ├── OSD Daemons — Object storage (data replication) │
│ ├── Manager (MGR) — Dashboard, telemetry │
│ ├── MDS (optional) — CephFS metadata server │
│ ├── RGW (optional) — S3/Swift gateway │
│ └─────────────────────────────────────────────────────────────────────┘
```
---
## Library Analysis
### Rust Ceph Crates
| Crate | Version | Description | Platform |
|-------|---------|-------------|----------|
| `ceph` | 3.2.5 | Official librados FFI (sync) | Linux-only |
| `ceph-async` | 4.0.5 | Async librados FFI (futures 0.3) | Linux-only |
| `ceph-rbd` | 0.3.2 | RADOS Block Device bindings | Linux-only |
### ceph-async Module Structure
```
ceph_async::
├── CephClient — Admin operations (OSD/Pool/Mon commands)
├── rados:: — Low-level FFI bindings (100+ functions)
│ ├── rados_read/write/stat/remove — Object I/O
│ ├── rados_pool_create/delete/lookup — Pool management
│ ├── rados_ioctx_* — I/O context (pool handle)
│ ├── rados_snap_* — Snapshot management
│ ├── rados_lock_* — Distributed locking
│ ├── rados_aio_* — Async I/O
│ ├── rados_omap_* — Key-value store per object
│ └── rados_write_op_* / rados_read_op_* — Compound operations
├── completion:: — Async completion handling
├── read_stream:: — Async read stream
├── write_sink:: — Async write sink
└── list_stream:: — Async object listing
```
### CephClient API
```rust
let client = CephClient::new("admin", "/etc/ceph/ceph.conf")?;
// OSD operations
client.osd_tree()?; // Get OSD tree (CRUSH map)
client.osd_out(osd_id)?; // Mark OSD out
client.osd_crush_remove(osd_id)?; // Remove from CRUSH map
// Pool operations
client.osd_pool_get(pool, option)?; // Get pool config
client.osd_pool_set(pool, key, val)?; // Set pool config
client.osd_pool_quota_get(pool)?; // Get pool quota
// Cluster status
client.status()?; // Cluster health
client.mon_dump()?; // Monitor list
client.version()?; // Ceph version
```
---
## Implementation Phases
| Phase | Task | Code Lines | Priority | Risk | Dependencies |
|-------|------|------------|----------|------|--------------|
| **Phase 1** | CephVfs struct + basic I/O | ~400 | P0 | Medium ⚠️⚠️⚠️ | ceph-async crate |
| **Phase 2** | Pool management CLI | ~150 | P1 | Low ⚠️ | Phase 1 |
| **Phase 3** | Snapshot support | ~200 | P2 | Medium ⚠️⚠️⚠️ | librados snap API |
| **Phase 4** | Distributed locking | ~100 | P2 | Medium ⚠️⚠️⚠️ | librados lock API |
| **Phase 5** | OMAP key-value | ~150 | P3 | Low ⚠️ | librados omap API |
| **Phase 6** | Async integration | ~300 | P1 | High ⚠️⚠️⚠️⚠️ | async-vfs feature |
| **Phase 7** | Docker test environment | ~50 | P0 | Low ⚠️ | Docker compose |
| **Phase 8** | Performance benchmark | ~100 | P2 | Low ⚠️ | Benchmark scripts |
| **Total** | | **~1350** | | | |
---
## Phase 1: CephVfs Core Implementation
### Key Design Decisions
**1. Object vs File mapping**:
- RADOS is object storage (no directories)
- Path `/foo/bar.txt` → Object `foo/bar.txt` in pool
- Directories simulated via zero-byte objects with `/` suffix (like S3)
**2. Pool-per-share vs single pool**:
- Option A: Single pool + path prefix (simpler, less isolation)
- Option B: Pool-per-share (better isolation, quota per pool)
- **Recommend**: Option B (pool-per-share) for enterprise use
**3. I/O context caching**:
- Each pool requires separate `rados_ioctx_t`
- Cache ioctx per share to avoid recreation overhead
### CephVfs Struct (Draft)
```rust
pub struct CephVfs {
cluster: rados_t, // RADOS cluster handle
pool_name: String, // Pool name for this share
ioctx: rados_ioctx_t, // I/O context (cached)
root_prefix: String, // Path prefix within pool
}
pub struct CephVfsFile {
ioctx: rados_ioctx_t,
object_id: String, // Object name in pool
position: u64,
write_buffer: Vec<u8>, // Buffer for writes (flush on close)
size: u64,
}
```
### VfsBackend Method Mapping
| Method | RADOS equivalent | Complexity |
|--------|-----------------|------------|
| `read_dir()` | `rados_nobjects_list_*` | High (pagination) |
| `open_file()` | Custom (object ops) | Medium |
| `stat()` | `rados_stat()` | Low |
| `create_dir()` | `rados_write_full(0-byte)` | Low |
| `remove_dir()` | `rados_remove()` | Low |
| `remove_file()` | `rados_remove()` | Low |
| `rename()` | Custom (copy + delete) | Medium |
| `exists()` | `rados_stat()` | Low |
| `copy()` | `rados_clone_range()` | Low |
| `hard_link()` | `rados_clone_range()` | Low |
| `read_link()` | Unsupported | N/A |
| `create_symlink()` | Unsupported | N/A |
---
## Risk Assessment
| Risk | Level | Mitigation |
|------|-------|------------|
| **Linux-only** | ⚠️⚠️⚠️⚠️⚠️ Critical | Docker/VM for macOS; 不符合跨平台定位 |
| **librados.so symlink** | ⚠️⚠️⚠️ Medium | Document setup; CI check |
| **Pool-level snapshots** | ⚠️⚠️ Low | Document limitation; consider RGW |
| **Async overhead** | ⚠️⚠️⚠️ Medium | Benchmark; spawn_blocking wrapper |
| **Cluster complexity** | ⚠️⚠️⚠️⚠️⚠️ Critical | 超出 Lightweight 定位; Docker compose |
| **SMB Oplocks integration** | ⚠️⚠️⚠️ Medium | RADOS locking API; careful design |
---
## Alternatives (推荐方案)
### 方案对比
| 方案 | 跨平台 | 部署复杂度 | 定位匹配 | 状态 |
|------|--------|-----------|---------|------|
| **Ceph RADOS** | ❌ Linux-only | ⚠️⚠️⚠️⚠️⚠️ 极高 | ❌ 不匹配 | 搁置 |
| **Ceph RGW (S3)** | ✅ HTTP API | ⚠️⚠️⚠️⚠️ 高 | ⭐⭐⭐ 中等 | 已有 S3Vfs |
| **MinIO** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有 S3Vfs |
| **GlusterFS** | ✅ POSIX | ⚠️⚠️⚠️ 中 | ⭐⭐⭐⭐ 高 | 待研究 |
| **内置分布式** | ✅ 全平台 | ⚠️⚠️ 低 | ⭐⭐⭐⭐⭐ 完全匹配 | 已有基础 |
### 方案 1: MinIO (推荐)
**优势**:
- ✅ S3-compatible API已有 S3Vfs无需新代码
- ✅ 单节点部署(轻量级)
- ✅ 跨平台macOS/Linux/Windows
- ✅ 高性能(纠删码)
- ✅ 开源 + 企业版
**部署**:
```bash
# macOS 单节点
minio server /data --console-address ":9001"
# MarkBase 配置
MB_S3_ENDPOINT=http://localhost:9000
MB_S3_BUCKET=markbase
```
**集成**: 无需修改代码S3Vfs 已支持。
---
### 方案 2: 内置分布式存储
**已有基础**:
| 功能 | 文件 | 分布式潜力 |
|------|------|-----------|
| DedupFs | dedup.rs | ✅ SHA-256 块存储可跨节点共享 |
| RaidFs | raid.rs | ⚠️ 单节点 RAID-Z |
| Send-Receive | send_receive.rs | ⚠️ 类似 ZFS send/receive |
| Checksum | checksum.rs | ✅ 数据完整性验证 |
| Compression | compression.rs | ✅ ZSTD 压缩 |
**扩展方向**:
1. DedupFs + S3Vfs: Dedup 块存储到 MinIO/S3跨节点共享
2. Checksum + Replication: 增加跨节点复制
3. Send-Receive + Remote: 增加远程 replication
---
## Technical Details
### librados API Functions
**Object I/O**:
- `rados_read(ioctx, oid, buf, len, offset)` — Read at offset
- `rados_write(ioctx, oid, buf, len, offset)` — Write at offset
- `rados_write_full(ioctx, oid, buf, len)` — Write entire object
- `rados_append(ioctx, oid, buf, len)` — Append to object
- `rados_stat(ioctx, oid, psize, pmtime)` — Get object size/mtime
- `rados_remove(ioctx, oid)` — Delete object
**Pool Operations**:
- `rados_pool_create(cluster, pool_name)` — Create pool
- `rados_pool_delete(cluster, pool_name)` — Delete pool
- `rados_pool_lookup(cluster, pool_name)` — Find pool ID
- `rados_ioctx_create(cluster, pool_name, ioctx)` — Create I/O context
**Snapshots**:
- `rados_ioctx_snap_create(ioctx, snap_name)` — Create pool snapshot
- `rados_ioctx_snap_list(ioctx, snaps)` — List snapshots
- `rados_ioctx_snap_remove(ioctx, snap_id)` — Delete snapshot
- `rados_ioctx_snap_rollback(ioctx, oid, snap_id)` — Rollback object
**Locking**:
- `rados_lock_exclusive(ioctx, oid, name, cookie, desc, duration, flags)` — Exclusive lock
- `rados_lock_shared(ioctx, oid, name, cookie, tag, desc, duration, flags)` — Shared lock
- `rados_unlock(ioctx, oid, name, cookie)` — Release lock
- `rados_list_lockers(ioctx, oid, name, ...)` — List lock holders
**OMAP (Key-Value)**:
- `rados_omap_set(ioctx, oid, map)` — Set key-value pairs
- `rados_omap_get(ioctx, oid, ...)` — Get values by keys
- `rados_omap_get_keys(ioctx, oid, ...)` — List keys
- `rados_omap_rm_keys(ioctx, oid, keys)` — Delete keys
**Async I/O**:
- `rados_aio_read(ioctx, oid, completion, buf, len, offset)` — Async read
- `rados_aio_write(ioctx, oid, completion, buf, len, offset)` — Async write
- `rados_aio_flush(ioctx)` — Flush pending async ops
- `rados_aio_wait_for_complete(completion)` — Wait for completion
---
## Open Questions
1. **部署目标**: Linux-only production vs macOS development?
2. **Backend choice**: RADOS (librados) vs RGW (S3 API)?
3. **Pool strategy**: Pool-per-share vs single pool + path prefix?
4. **SMB Oplocks**: Should CephVfs support SMB Oplocks via RADOS locking?
5. **Priority**: Start with basic I/O or full async integration first?
---
## Conclusion
**当前搁置 Ceph RADOS 集成**,原因:
1. ❌ Linux-only 约束不符合 macOS 跨平台定位
2. ⚠️ 部署复杂度超出 Lightweight 定位
3. ⚠️ 需要完整 Ceph 集群Monitor + OSD + MGR
**推荐替代方案**
1. ⭐⭐⭐⭐⭐ **MinIO** — S3-compatible已有 S3Vfs轻量级
2. ⭐⭐⭐⭐⭐ **内置分布式** — DedupFs + S3Vfs 组合
**后续行动**
- MinIO 集成文档0 行代码)
- DedupFs + S3Vfs 组合研究(~100 行)
- 内置 Replication 功能(~400 行)
---
**文档创建**: 2026-06-25
**最后更新**: 2026-06-25