Document Processor Port

Overview

The Document Processor Port defines the contract for processing documents into searchable chunks for RAG (Retrieval-Augmented Generation) systems.

Purpose: Provides interfaces and domain models for splitting documents into semantic chunks, estimating token counts, and managing document processing pipelines.

Domain: Document processing, text chunking, content analysis, RAG preparation

Key Capabilities:

Document chunking with configurable strategies
Multiple content type support (text, markdown, HTML, PDF, JSON, CSV)
Flexible chunking configurations (size, overlap, boundaries)
Token estimation for LLM context management
Chunk metadata and position tracking
Quality metrics (average size, variance)
Configurable processing limits and timeouts

Port Type: Processor

When to Use:

Building RAG (Retrieval-Augmented Generation) systems
Processing documents for vector storage and semantic search
Splitting long documents into manageable chunks
Preparing content for LLM consumption
Implementing document analysis pipelines
Creating knowledge bases from unstructured content

Domain Models

DocumentContent

Raw document content with metadata. Immutable snapshot of the original document.

Field	Type	Required	Default	Description
`id`	`str`	Yes	`uuid4()`	Unique document identifier
`content`	`str`	Yes	-	Raw document text content
`content_type`	`ContentType`	Yes	`ContentType.TEXT`	MIME type of content
`title`	`Optional[str]`	No	`None`	Document title
`source_url`	`Optional[str]`	No	`None`	Original source URL
`metadata`	`Dict[str, Any]`	Yes	`{}`	Custom metadata
`created_at`	`datetime`	Yes	`now(UTC)`	Document creation timestamp

Example:

from portico.ports.document_processor import DocumentContent, ContentType

# Plain text document
doc = DocumentContent(
    content="This is a sample document for processing.",
    content_type=ContentType.TEXT,
    title="Sample Document",
    metadata={"author": "John Doe", "category": "tutorial"}
)

# Markdown document from web
markdown_doc = DocumentContent(
    content="# Introduction\n\nThis is markdown content...",
    content_type=ContentType.MARKDOWN,
    source_url="https://example.com/docs/intro.md",
    metadata={"version": "1.0"}
)

ChunkingConfig

Configuration for document chunking strategies.

Field	Type	Required	Default	Description
`chunk_size`	`int`	Yes	`1000`	Target chunk size in characters
`chunk_overlap`	`int`	Yes	`200`	Overlap between chunks in characters
`respect_sentence_boundaries`	`bool`	Yes	`True`	Avoid splitting sentences
`respect_paragraph_boundaries`	`bool`	Yes	`True`	Avoid splitting paragraphs
`min_chunk_size`	`int`	Yes	`100`	Minimum chunk size to avoid tiny chunks
`max_chunk_size`	`int`	Yes	`2000`	Maximum chunk size as hard limit
`preserve_code_blocks`	`bool`	Yes	`True`	Keep code blocks intact (markdown/HTML)
`preserve_headers`	`bool`	Yes	`True`	Include headers in chunk metadata

Example:

from portico.ports.document_processor import ChunkingConfig

# Standard configuration
config = ChunkingConfig(
    chunk_size=1000,
    chunk_overlap=200,
    respect_sentence_boundaries=True
)

# Large chunks for detailed context
large_config = ChunkingConfig(
    chunk_size=2000,
    chunk_overlap=300,
    max_chunk_size=3000
)

# Precise sentence-based chunking
sentence_config = ChunkingConfig(
    chunk_size=500,
    chunk_overlap=50,
    respect_sentence_boundaries=True,
    respect_paragraph_boundaries=False,
    min_chunk_size=200
)

# Code-focused configuration
code_config = ChunkingConfig(
    chunk_size=800,
    preserve_code_blocks=True,
    preserve_headers=True,
    respect_sentence_boundaries=False
)

ProcessedChunk

Individual chunk from document processing with position tracking and metadata.

Field	Type	Required	Default	Description
`id`	`str`	Yes	`uuid4()`	Unique chunk identifier
`content`	`str`	Yes	-	Chunk text content
`metadata`	`Dict[str, Any]`	Yes	`{}`	Chunk-specific metadata
`document_id`	`str`	Yes	-	Parent document ID
`chunk_index`	`int`	Yes	-	Sequential position in document (0-indexed)
`start_char`	`int`	Yes	-	Starting character position in original
`end_char`	`int`	Yes	-	Ending character position in original
`token_count`	`Optional[int]`	No	`None`	Estimated token count
`language`	`Optional[str]`	No	`None`	Detected language code
`content_type`	`ContentType`	Yes	`ContentType.TEXT`	Content type
`created_at`	`datetime`	Yes	`now(UTC)`	Chunk creation timestamp

Example:

from portico.ports.document_processor import ProcessedChunk, ContentType

chunk = ProcessedChunk(
    content="This is the first paragraph of the document.",
    document_id="doc-123",
    chunk_index=0,
    start_char=0,
    end_char=45,
    token_count=12,
    language="en",
    content_type=ContentType.TEXT,
    metadata={"section": "introduction"}
)

# Chunks maintain order and position
print(f"Chunk {chunk.chunk_index}: chars {chunk.start_char}-{chunk.end_char}")
print(f"Estimated tokens: {chunk.token_count}")

ProcessedDocument

Document split into searchable chunks with processing metadata and quality metrics.

Field	Type	Required	Default	Description
`id`	`str`	Yes	`uuid4()`	Unique processed document ID
`original_document`	`DocumentContent`	Yes	-	Original document reference
`chunks`	`List[ProcessedChunk]`	Yes	-	List of processed chunks
`chunking_strategy`	`str`	Yes	-	Strategy name used for chunking
`chunking_config`	`ChunkingConfig`	Yes	-	Configuration used
`total_chunks`	`int`	Yes	-	Number of chunks created
`total_characters`	`int`	Yes	-	Total characters in original
`total_tokens`	`Optional[int]`	No	`None`	Total estimated tokens
`average_chunk_size`	`float`	Yes	-	Mean chunk size in characters
`chunk_size_variance`	`float`	Yes	-	Statistical variance of chunk sizes
`processed_at`	`datetime`	Yes	`now(UTC)`	Processing timestamp

Example:

from portico.ports.document_processor import ProcessedDocument

# After processing a document
processed = ProcessedDocument(
    original_document=doc,
    chunks=[chunk1, chunk2, chunk3],
    chunking_strategy="paragraph",
    chunking_config=config,
    total_chunks=3,
    total_characters=1500,
    total_tokens=350,
    average_chunk_size=500.0,
    chunk_size_variance=25.5
)

print(f"Split into {processed.total_chunks} chunks")
print(f"Average chunk: {processed.average_chunk_size:.0f} chars")
print(f"Total tokens: {processed.total_tokens}")

# Access chunks in order
for chunk in processed.chunks:
    print(f"Chunk {chunk.chunk_index}: {chunk.content[:50]}...")

DocumentProcessorConfig

Configuration for document processing operations.

Field	Type	Required	Default	Description
`default_chunking_config`	`ChunkingConfig`	Yes	`ChunkingConfig()`	Default chunking settings
`enable_token_counting`	`bool`	Yes	`True`	Estimate token counts
`enable_language_detection`	`bool`	Yes	`True`	Detect chunk language
`enable_content_analysis`	`bool`	Yes	`True`	Analyze content structure
`auto_detect_content_type`	`bool`	Yes	`True`	Auto-detect MIME type
`fallback_content_type`	`ContentType`	Yes	`ContentType.TEXT`	Fallback if detection fails
`max_document_size`	`int`	Yes	`10_000_000`	Maximum document size (10MB)
`max_chunks_per_document`	`int`	Yes	`1000`	Maximum chunks to generate
`processing_timeout_seconds`	`float`	Yes	`30.0`	Processing timeout
`default_tokenizer_model`	`str`	Yes	`"gpt-3.5-turbo"`	Model for token estimation
`tokens_per_chunk_target`	`int`	Yes	`300`	Target tokens per chunk

Example:

from portico.ports.document_processor import DocumentProcessorConfig, ChunkingConfig

# Production configuration
config = DocumentProcessorConfig(
    default_chunking_config=ChunkingConfig(
        chunk_size=1200,
        chunk_overlap=200
    ),
    enable_token_counting=True,
    max_document_size=20_000_000,  # 20MB
    max_chunks_per_document=2000,
    processing_timeout_seconds=60.0,
    default_tokenizer_model="gpt-4"
)

# Fast processing (minimal analysis)
fast_config = DocumentProcessorConfig(
    enable_token_counting=False,
    enable_language_detection=False,
    enable_content_analysis=False,
    processing_timeout_seconds=10.0
)

Enumerations

ContentType

Supported content types for document processing.

Value	Description
`TEXT`	Plain text (`text/plain`)
`MARKDOWN`	Markdown formatted text (`text/markdown`)
`HTML`	HTML content (`text/html`)
`PDF`	PDF documents (`application/pdf`)
`JSON`	JSON structured data (`application/json`)
`CSV`	Comma-separated values (`text/csv`)

Example:

from portico.ports.document_processor import ContentType

# Type-safe content type specification
doc = DocumentContent(
    content=markdown_text,
    content_type=ContentType.MARKDOWN
)

# Different processing for different types
if doc.content_type == ContentType.MARKDOWN:
    # Use markdown-aware chunking
    strategy = MarkdownChunker()
elif doc.content_type == ContentType.PDF:
    # Use PDF-specific processing
    strategy = PDFChunker()

Port Interfaces

DocumentProcessor

The DocumentProcessor abstract base class defines the contract for document processing pipelines.

Location: portico.ports.document_processor.DocumentProcessor

Key Methods

process_document

async def process_document(
    document: DocumentContent,
    chunking_config: Optional[ChunkingConfig] = None
) -> ProcessedDocument

Process a document into searchable chunks. Primary method for document processing.

Parameters:

document: DocumentContent - Raw document content to process
chunking_config: Optional[ChunkingConfig] - Optional chunking configuration (uses default if None)

Returns: ProcessedDocument with chunks and processing metadata

Raises:

ValueError - If document size exceeds maximum allowed

Example:

from portico.ports.document_processor import DocumentProcessor, DocumentContent, ChunkingConfig

# Process with default configuration
doc = DocumentContent(content="Long document text...")
processed = await processor.process_document(doc)

print(f"Created {processed.total_chunks} chunks")
for chunk in processed.chunks:
    print(f"  Chunk {chunk.chunk_index}: {len(chunk.content)} chars")

# Process with custom configuration
config = ChunkingConfig(chunk_size=500, chunk_overlap=100)
processed = await processor.process_document(doc, config)

process_text

async def process_text(
    text: str,
    content_type: ContentType = ContentType.TEXT,
    chunking_config: Optional[ChunkingConfig] = None,
    metadata: Optional[Dict[str, Any]] = None
) -> ProcessedDocument

Process raw text into searchable chunks. Convenience method for text-only content.

Parameters:

text: str - Raw text content
content_type: ContentType - Type of content for processing hints
chunking_config: Optional[ChunkingConfig] - Optional chunking configuration
metadata: Optional[Dict[str, Any]] - Optional metadata for the document

Returns: ProcessedDocument with chunks

Example:

# Simple text processing
text = "This is a long document that needs to be chunked..."
processed = await processor.process_text(text)

# Markdown processing with metadata
markdown_text = "# Title\n\nContent..."
processed = await processor.process_text(
    markdown_text,
    content_type=ContentType.MARKDOWN,
    metadata={"source": "docs", "author": "team"}
)

# Custom chunking for specific use case
config = ChunkingConfig(chunk_size=800, preserve_code_blocks=True)
processed = await processor.process_text(
    code_documentation,
    content_type=ContentType.MARKDOWN,
    chunking_config=config
)

Other Methods

chunk_document

async def chunk_document(
    document: DocumentContent,
    strategy: ChunkingStrategy,
    config: Optional[ChunkingConfig] = None
) -> List[ProcessedChunk]

Chunk a document using a specific strategy. Returns list of processed chunks.

get_supported_content_types

def get_supported_content_types() -> List[ContentType]

Get list of supported content types for processing. Returns list of ContentType enum values.

estimate_token_count

def estimate_token_count(
    text: str,
    model: Optional[str] = None
) -> int

Estimate token count for text using the specified model. Returns estimated token count.

ChunkingStrategy

The ChunkingStrategy abstract base class defines the contract for chunking algorithms.

Location: portico.ports.document_processor.ChunkingStrategy

Key Methods

chunk_text

def chunk_text(
    text: str,
    config: ChunkingConfig
) -> List[str]

Split text into chunks according to the strategy.

Parameters:

text: str - Text content to chunk
config: ChunkingConfig - Chunking configuration

Returns: List of text chunks

Example:

from portico.ports.document_processor import ChunkingStrategy, ChunkingConfig

# Use a specific strategy
strategy = ParagraphChunker()
config = ChunkingConfig(chunk_size=1000)

chunks = strategy.chunk_text(long_text, config)
print(f"Split into {len(chunks)} chunks")

get_chunk_boundaries

def get_chunk_boundaries(
    text: str,
    config: ChunkingConfig
) -> List[tuple[int, int]]

Get start and end character positions for each chunk.

Parameters:

text: str - Text content to analyze
config: ChunkingConfig - Chunking configuration

Returns: List of (start_char, end_char) tuples for each chunk

strategy_name

@property
def strategy_name() -> str

Name of the chunking strategy. Returns strategy identifier.

Common Patterns

Basic Document Processing for RAG

from portico.ports.document_processor import (
    DocumentProcessor,
    DocumentContent,
    ChunkingConfig,
    ContentType
)

async def prepare_document_for_rag(
    processor: DocumentProcessor,
    text: str,
    title: str,
    source_url: str
) -> ProcessedDocument:
    """Prepare a document for RAG system."""

    # Create document
    doc = DocumentContent(
        content=text,
        content_type=ContentType.TEXT,
        title=title,
        source_url=source_url,
        metadata={"indexed_at": datetime.now(UTC).isoformat()}
    )

    # Configure chunking for optimal RAG performance
    config = ChunkingConfig(
        chunk_size=1000,
        chunk_overlap=200,
        respect_sentence_boundaries=True,
        min_chunk_size=100,
        max_chunk_size=1500
    )

    # Process document
    processed = await processor.process_document(doc, config)

    # Log processing results
    print(f"Processed: {processed.original_document.title}")
    print(f"  Chunks: {processed.total_chunks}")
    print(f"  Avg size: {processed.average_chunk_size:.0f} chars")
    print(f"  Tokens: {processed.total_tokens}")

    return processed

# Usage
processed = await prepare_document_for_rag(
    processor,
    article_text,
    "API Documentation",
    "https://docs.example.com/api"
)

# Store chunks in vector database
for chunk in processed.chunks:
    await vector_store.add(
        text=chunk.content,
        metadata={
            "document_id": processed.original_document.id,
            "chunk_index": chunk.chunk_index,
            "source_url": processed.original_document.source_url
        }
    )

Content-Type Specific Processing

from portico.ports.document_processor import ContentType, ChunkingConfig

async def process_by_content_type(
    processor: DocumentProcessor,
    content: str,
    content_type: ContentType
):
    """Process document with type-specific configuration."""

    # Different configs for different content types
    if content_type == ContentType.MARKDOWN:
        config = ChunkingConfig(
            chunk_size=1200,
            preserve_code_blocks=True,
            preserve_headers=True,
            respect_paragraph_boundaries=True
        )
    elif content_type == ContentType.CODE:
        config = ChunkingConfig(
            chunk_size=800,
            preserve_code_blocks=True,
            respect_sentence_boundaries=False
        )
    elif content_type == ContentType.PDF:
        config = ChunkingConfig(
            chunk_size=1500,
            chunk_overlap=300,
            respect_paragraph_boundaries=True
        )
    else:
        config = ChunkingConfig()  # Default

    # Process with appropriate configuration
    processed = await processor.process_text(
        content,
        content_type=content_type,
        chunking_config=config
    )

    return processed

# Usage
markdown_processed = await process_by_content_type(
    processor,
    markdown_content,
    ContentType.MARKDOWN
)

Token-Aware Chunking for LLMs

from portico.ports.document_processor import DocumentProcessor, ChunkingConfig

async def chunk_for_llm_context(
    processor: DocumentProcessor,
    text: str,
    max_tokens_per_chunk: int = 500
):
    """Chunk document to fit LLM context windows."""

    # Estimate characters per token (rough approximation)
    chars_per_token = 4  # Average for English text
    target_chars = max_tokens_per_chunk * chars_per_token

    config = ChunkingConfig(
        chunk_size=target_chars,
        chunk_overlap=max(100, target_chars // 10),
        respect_sentence_boundaries=True,
        max_chunk_size=target_chars + 500
    )

    processed = await processor.process_text(text, chunking_config=config)

    # Verify chunks fit in token budget
    for chunk in processed.chunks:
        if chunk.token_count and chunk.token_count > max_tokens_per_chunk:
            print(f"Warning: Chunk {chunk.chunk_index} exceeds token limit")

    return processed

# Usage for GPT-3.5 context
processed = await chunk_for_llm_context(processor, document, max_tokens_per_chunk=500)

# Use chunks in LLM prompts
for chunk in processed.chunks:
    prompt = f"Analyze this text:\n\n{chunk.content}\n\nProvide summary:"
    response = await llm.complete(prompt)

Batch Document Processing

async def batch_process_documents(
    processor: DocumentProcessor,
    documents: List[DocumentContent],
    config: Optional[ChunkingConfig] = None
) -> List[ProcessedDocument]:
    """Process multiple documents in batch."""

    processed_docs = []
    failed_docs = []

    for doc in documents:
        try:
            processed = await processor.process_document(doc, config)
            processed_docs.append(processed)

            print(f"✓ Processed: {doc.title} ({processed.total_chunks} chunks)")

        except ValueError as e:
            print(f"✗ Failed: {doc.title} - {e}")
            failed_docs.append((doc, str(e)))
        except Exception as e:
            print(f"✗ Error: {doc.title} - {e}")
            failed_docs.append((doc, str(e)))

    # Summary
    print(f"\nProcessed: {len(processed_docs)}/{len(documents)} documents")
    print(f"Failed: {len(failed_docs)} documents")

    # Calculate total statistics
    total_chunks = sum(p.total_chunks for p in processed_docs)
    total_chars = sum(p.total_characters for p in processed_docs)

    print(f"Total chunks: {total_chunks}")
    print(f"Total characters: {total_chars:,}")

    return processed_docs

# Usage
docs = [
    DocumentContent(content=text1, title="Doc 1"),
    DocumentContent(content=text2, title="Doc 2"),
    DocumentContent(content=text3, title="Doc 3"),
]

config = ChunkingConfig(chunk_size=1000, chunk_overlap=200)
processed_batch = await batch_process_documents(processor, docs, config)

Integration with Kits

The Document Processor Port is used by the RAG Kit to prepare documents for vector storage and semantic search.

from portico import compose
from portico.ports.document_processor import ChunkingConfig

# Configure RAG kit with document processing
app = compose.webapp(
    database_url="sqlite+aiosqlite:///./app.db",
    kits=[
        compose.rag(
            llm_provider="openai",
            llm_api_key="sk-...",
            embedding_api_key="sk-...",
            # Document processing configuration
            chunking_config=ChunkingConfig(
                chunk_size=1000,
                chunk_overlap=200,
                respect_sentence_boundaries=True
            )
        )
    ]
)

await app.initialize()

# Access RAG service (uses document processor internally)
rag_service = app.kits["rag"].service

# Ingest document (automatically chunks it)
from portico.ports.document_processor import DocumentContent

doc = DocumentContent(
    content=long_article_text,
    title="Machine Learning Guide",
    source_url="https://example.com/ml-guide"
)

await rag_service.ingest_document(doc)
# Document is automatically chunked, embedded, and stored

# Query (searches across chunks)
results = await rag_service.query("What is supervised learning?")

The RAG Kit provides:

Automatic document processing and chunking
Vector embedding of chunks
Semantic search across document chunks
Context retrieval for LLM prompts
Document and chunk management

See the Kits Overview for more information about using kits.

Best Practices

Choose Appropriate Chunk Sizes: Match chunk size to your use case

# ✅ GOOD: Size matched to LLM context
ChunkingConfig(
    chunk_size=1000,  # ~250 tokens
    chunk_overlap=200  # 20% overlap for context
)

# ❌ BAD: Chunks too large for embedding models
ChunkingConfig(
    chunk_size=10000,  # Most embedding models max at ~512 tokens
    chunk_overlap=0
)

Use Overlap for Context Preservation: Include overlap to avoid breaking semantic meaning

# ✅ GOOD: Overlap preserves context across chunks
ChunkingConfig(
    chunk_size=1000,
    chunk_overlap=200  # 20% overlap
)

# ❌ BAD: No overlap loses context at boundaries
ChunkingConfig(
    chunk_size=1000,
    chunk_overlap=0  # Sentences may be split
)

Respect Content Boundaries: Keep semantic units together

# ✅ GOOD: Respect natural boundaries
ChunkingConfig(
    respect_sentence_boundaries=True,
    respect_paragraph_boundaries=True,
    preserve_code_blocks=True
)

# ❌ BAD: May split mid-sentence
ChunkingConfig(
    respect_sentence_boundaries=False,
    respect_paragraph_boundaries=False
)

Set Reasonable Limits: Protect against resource exhaustion

# ✅ GOOD: Reasonable limits
DocumentProcessorConfig(
    max_document_size=10_000_000,  # 10MB
    max_chunks_per_document=1000,
    processing_timeout_seconds=30.0
)

# ❌ BAD: No limits
DocumentProcessorConfig(
    max_document_size=999_999_999,  # Too large
    max_chunks_per_document=99999
)

Handle Content Types Appropriately: Use type-specific processing

# ✅ GOOD: Different configs for different types
if content_type == ContentType.MARKDOWN:
    config = ChunkingConfig(preserve_code_blocks=True, preserve_headers=True)
elif content_type == ContentType.TEXT:
    config = ChunkingConfig(respect_paragraph_boundaries=True)

# ❌ BAD: One config for all types
config = ChunkingConfig()  # Same for markdown, code, text, etc.

FAQs

What chunk size should I use?

Depends on your use case:

RAG with embedding models: 500-1500 characters (~125-375 tokens) - fits most embedding model limits
LLM context augmentation: 1000-2000 characters (~250-500 tokens) - provides good context without overwhelming
Detailed analysis: 2000-4000 characters (~500-1000 tokens) - preserves more context per chunk

# For RAG/embeddings (most common)
ChunkingConfig(chunk_size=1000, chunk_overlap=200)

# For LLM context windows
ChunkingConfig(chunk_size=2000, chunk_overlap=400)

How much overlap should I use?

Typically 10-20% of chunk size:

# Standard overlap (20%)
ChunkingConfig(chunk_size=1000, chunk_overlap=200)

# More overlap for critical context preservation (30%)
ChunkingConfig(chunk_size=1000, chunk_overlap=300)

# Minimal overlap for storage efficiency (10%)
ChunkingConfig(chunk_size=1000, chunk_overlap=100)

How do I estimate token counts accurately?

Use the estimate_token_count() method with your target model:

# Estimate for specific model
token_count = processor.estimate_token_count(
    text,
    model="gpt-4"  # Use your target model
)

# Configure processor with your model
config = DocumentProcessorConfig(
    default_tokenizer_model="gpt-4",
    tokens_per_chunk_target=500
)

Can I implement custom chunking strategies?

Yes! Implement the ChunkingStrategy interface:

from portico.ports.document_processor import ChunkingStrategy, ChunkingConfig

class CustomChunker(ChunkingStrategy):
    @property
    def strategy_name(self) -> str:
        return "custom"

    def chunk_text(self, text: str, config: ChunkingConfig) -> List[str]:
        # Your custom chunking logic
        chunks = []
        # ... implement your strategy
        return chunks

    def get_chunk_boundaries(
        self,
        text: str,
        config: ChunkingConfig
    ) -> List[tuple[int, int]]:
        # Return (start, end) positions for each chunk
        boundaries = []
        # ... calculate boundaries
        return boundaries

# Use custom strategy
custom_chunker = CustomChunker()
chunks = await processor.chunk_document(doc, custom_chunker, config)

How do I handle very large documents?

Use the configuration limits:

config = DocumentProcessorConfig(
    max_document_size=50_000_000,  # 50MB limit
    max_chunks_per_document=5000,
    processing_timeout_seconds=120.0
)

try:
    processed = await processor.process_document(huge_doc)
except ValueError as e:
    print(f"Document too large: {e}")
    # Handle: split into smaller docs, summarize first, etc.

How do I preserve code blocks in markdown?

Enable code block preservation:

config = ChunkingConfig(
    chunk_size=1200,
    preserve_code_blocks=True,  # Don't split code blocks
    preserve_headers=True,      # Include headers in metadata
    respect_paragraph_boundaries=True
)

processed = await processor.process_text(
    markdown_with_code,
    content_type=ContentType.MARKDOWN,
    chunking_config=config
)

How do I access chunk metadata?

Each ProcessedChunk includes metadata and position information:

for chunk in processed.chunks:
    print(f"Chunk {chunk.chunk_index}:")
    print(f"  Position: {chunk.start_char}-{chunk.end_char}")
    print(f"  Tokens: {chunk.token_count}")
    print(f"  Language: {chunk.language}")
    print(f"  Metadata: {chunk.metadata}")

    # Use in vector storage
    await vector_store.add(
        text=chunk.content,
        metadata={
            "doc_id": chunk.document_id,
            "chunk_idx": chunk.chunk_index,
            "start": chunk.start_char,
            "end": chunk.end_char,
            **chunk.metadata
        }
    )