Chunk Module
Text chunking utilities for processing large documents.
- class kerb.chunk.Chunker[source]
Bases:
ABCAbstract base class for all chunker implementations.
All chunker classes should inherit from this base class and implement the chunk method.
- kerb.chunk.chunk_text(text, chunk_size=1000, overlap=0)[source]
Simple utility function to split text into chunks of specified size.
This is a convenience function for basic chunking needs without creating a chunker instance.
- Parameters:
- Returns:
List of text chunks
- Return type:
Examples
>>> text = "Your long document here..." >>> chunks = chunk_text(text, chunk_size=500, overlap=50)
- class kerb.chunk.RecursiveChunker(chunk_size=1000, separators=None)[source]
Bases:
ChunkerRecursively split text using a hierarchy of separators.
Tries to split on larger semantic boundaries first (paragraphs, sentences) before falling back to character-level splitting. Similar to LangChain’s RecursiveCharacterTextSplitter.
- Parameters:
Examples
>>> chunker = RecursiveChunker(chunk_size=500) >>> chunks = chunker.chunk("Your long text here...")
- class kerb.chunk.SentenceChunker(window_sentences=5, overlap_sentences=1)[source]
Bases:
ChunkerSplit text into chunks based on sentence boundaries with optional overlap.
- Parameters:
Examples
>>> chunker = SentenceChunker(window_sentences=3, overlap_sentences=1) >>> chunks = chunker.chunk("First sentence. Second sentence. Third sentence.")
- class kerb.chunk.SemanticChunker(sentences_per_chunk=3)[source]
Bases:
ChunkerSplit text into semantic chunks based on sentences.
This chunker groups sentences together into chunks, attempting to maintain semantic coherence by keeping related sentences together.
- Parameters:
sentences_per_chunk (
int) – Number of sentences per chunk. Defaults to 3.
Examples
>>> chunker = SemanticChunker(sentences_per_chunk=5) >>> chunks = chunker.chunk("Your text here...")
- class kerb.chunk.CodeChunker(max_chunk_size=1000, language='python')[source]
Bases:
ChunkerSplit code into chunks while respecting code structure.
Attempts to split on function/class boundaries to maintain semantic coherence.
- Parameters:
Examples
>>> chunker = CodeChunker(max_chunk_size=500, language="python") >>> chunks = chunker.chunk(code_text)
- class kerb.chunk.MarkdownChunker(max_chunk_size=1000)[source]
Bases:
ChunkerSplit markdown text based on heading hierarchy.
Respects markdown structure by splitting on headers while trying to keep related content together.
- Parameters:
max_chunk_size (
int) – Maximum size per chunk. Defaults to 1000.
Examples
>>> chunker = MarkdownChunker(max_chunk_size=500) >>> chunks = chunker.chunk(markdown_text)
- kerb.chunk.simple_chunker(text, chunk_size=1000, overlap=0)[source]
Split text into chunks of specified size.
- kerb.chunk.overlap_chunker(text, chunk_size=1000, overlap_ratio=0.1)[source]
Split text with proportional overlap between chunks.
- kerb.chunk.paragraph_chunker(text, max_paragraphs=3)[source]
Split text into chunks based on paragraph boundaries.
- kerb.chunk.sliding_window_chunker(text, window_size=1000, stride=500)[source]
Create chunks using a sliding window approach.
Similar to simple_chunker with overlap, but stride-based for more control. Common in NLP tasks and document processing pipelines.
- kerb.chunk.token_based_chunker(text, max_tokens=512, tokenizer=None)[source]
Split text based on token count.
Uses the specified tokenizer to estimate chunk sizes. For accurate token-based chunking with OpenAI models, ensure tiktoken is installed.
- Parameters:
- Returns:
List of token-based chunks
- Return type:
Examples
>>> from kerb.tokenizer import Tokenizer >>> chunks = token_based_chunker(text, max_tokens=512, tokenizer=Tokenizer.CL100K_BASE)
- kerb.chunk.recursive_chunker(text, chunk_size=1000, separators=None)[source]
Recursively split text using a hierarchy of separators.
Functional interface for RecursiveChunker.
- kerb.chunk.sentence_window_chunker(text, window_sentences=5, overlap_sentences=1)[source]
Create overlapping chunks based on sentence boundaries.
Functional interface for SentenceChunker.
- kerb.chunk.merge_chunks(chunks, max_size=2000, separator='\\n\\n')[source]
Merge smaller chunks together up to a maximum size.
Useful for optimizing chunk sizes after initial splitting or when dealing with many small chunks that could be combined for better efficiency.
- Parameters:
- Returns:
List of merged chunks
- Return type:
Examples
>>> small_chunks = ["chunk1", "chunk2", "chunk3"] >>> merged = merge_chunks(small_chunks, max_size=100)
- kerb.chunk.optimize_chunk_size(text, target_size=1000, tolerance=0.2)[source]
Calculate an optimized chunk size based on text length and target.
Adjusts the chunk size to minimize uneven chunks and ensure better distribution of content across chunks.
- Parameters:
- Returns:
Optimized chunk size
- Return type:
Examples
>>> text = "Your long document..." >>> optimal_size = optimize_chunk_size(text, target_size=500, tolerance=0.15)
- kerb.chunk.custom_chunker(text, chunk_size=1000, split_fn=None)[source]
Split text using a custom splitting function.
Provides flexibility for domain-specific chunking strategies.
- Parameters:
- Returns:
List of custom-split chunks
- Return type:
Examples
>>> def my_splitter(text): ... return text.split('|') # Split on custom delimiter >>> chunks = custom_chunker(text, split_fn=my_splitter)
Text chunking utilities for optimal context windows and retrieval.