Chunk Module

Text chunking utilities for processing large documents.

class kerb.chunk.Chunker[source]

Bases: ABC

Abstract base class for all chunker implementations.

All chunker classes should inherit from this base class and implement the chunk method.

abstractmethod chunk(text)[source]

Split text into chunks.

Parameters:: text (str) – The text to chunk
Returns:: List of text chunks
Return type:: List[str]

kerb.chunk.chunk_text(text, chunk_size=1000, overlap=0)[source]

Simple utility function to split text into chunks of specified size.

This is a convenience function for basic chunking needs without creating a chunker instance.

Parameters:

text (str) – The text to chunk
chunk_size (int) – Maximum size of each chunk. Defaults to 1000.
overlap (int) – Number of characters to overlap between chunks. Defaults to 0.

Returns:

List of text chunks

Return type:

List[str]

Examples

>>> text = "Your long document here..."
>>> chunks = chunk_text(text, chunk_size=500, overlap=50)

class kerb.chunk.RecursiveChunker(chunk_size=1000, separators=None)[source]

Bases: Chunker

Recursively split text using a hierarchy of separators.

Tries to split on larger semantic boundaries first (paragraphs, sentences) before falling back to character-level splitting. Similar to LangChain’s RecursiveCharacterTextSplitter.

Parameters:

chunk_size (int) – Target size for each chunk. Defaults to 1000.
separators (Optional[List[str]]) – List of separators in priority order. Defaults to [’nn’, ‘n’, ‘. ‘, ‘ ‘, ‘’].

Examples

>>> chunker = RecursiveChunker(chunk_size=500)
>>> chunks = chunker.chunk("Your long text here...")

__init__(chunk_size=1000, separators=None)[source]

chunk(text)[source]

Split text into chunks recursively.

Parameters:: text (str) – The text to chunk
Returns:: List of recursively split chunks
Return type:: List[str]

class kerb.chunk.SentenceChunker(window_sentences=5, overlap_sentences=1)[source]

Bases: Chunker

Split text into chunks based on sentence boundaries with optional overlap.

Parameters:

window_sentences (int) – Number of sentences per chunk. Defaults to 5.
overlap_sentences (int) – Number of sentences to overlap. Defaults to 1.

Examples

>>> chunker = SentenceChunker(window_sentences=3, overlap_sentences=1)
>>> chunks = chunker.chunk("First sentence. Second sentence. Third sentence.")

__init__(window_sentences=5, overlap_sentences=1)[source]

chunk(text)[source]

Split text into sentence-based chunks with overlap.

Parameters:: text (str) – The text to chunk
Returns:: List of sentence-windowed chunks
Return type:: List[str]

class kerb.chunk.SemanticChunker(sentences_per_chunk=3)[source]

Bases: Chunker

Split text into semantic chunks based on sentences.

This chunker groups sentences together into chunks, attempting to maintain semantic coherence by keeping related sentences together.

Parameters:: sentences_per_chunk (int) – Number of sentences per chunk. Defaults to 3.

Examples

>>> chunker = SemanticChunker(sentences_per_chunk=5)
>>> chunks = chunker.chunk("Your text here...")

__init__(sentences_per_chunk=3)[source]

chunk(text)[source]

Split text into semantic chunks.

Parameters:: text (str) – The text to chunk
Returns:: List of semantic text chunks
Return type:: List[str]

class kerb.chunk.CodeChunker(max_chunk_size=1000, language='python')[source]

Bases: Chunker

Split code into chunks while respecting code structure.

Attempts to split on function/class boundaries to maintain semantic coherence.

Parameters:

max_chunk_size (int) – Maximum size per chunk. Defaults to 1000.
language (str) – Programming language (for language-specific handling). Defaults to “python”.

Examples

>>> chunker = CodeChunker(max_chunk_size=500, language="python")
>>> chunks = chunker.chunk(code_text)

__init__(max_chunk_size=1000, language='python')[source]

chunk(text)[source]

Split code into chunks.

Parameters:: text (str) – Code text to chunk
Returns:: List of code chunks
Return type:: List[str]

class kerb.chunk.MarkdownChunker(max_chunk_size=1000)[source]

Bases: Chunker

Split markdown text based on heading hierarchy.

Respects markdown structure by splitting on headers while trying to keep related content together.

Parameters:: max_chunk_size (int) – Maximum size per chunk. Defaults to 1000.

Examples

>>> chunker = MarkdownChunker(max_chunk_size=500)
>>> chunks = chunker.chunk(markdown_text)

__init__(max_chunk_size=1000)[source]

chunk(text)[source]

Split markdown text into chunks.

Parameters:: text (str) – Markdown text to chunk
Returns:: List of markdown-aware chunks
Return type:: List[str]

kerb.chunk.simple_chunker(text, chunk_size=1000, overlap=0)[source]

Split text into chunks of specified size.

Parameters:

text (str) – The text to chunk
chunk_size (int) – Maximum size of each chunk. Defaults to 1000.
overlap (int) – Number of characters to overlap between chunks. Defaults to 0.

Returns:

List of text chunks

Return type:

List[str]

kerb.chunk.overlap_chunker(text, chunk_size=1000, overlap_ratio=0.1)[source]

Split text with proportional overlap between chunks.

Parameters:

text (str) – The text to chunk
chunk_size (int) – Maximum size of each chunk. Defaults to 1000.
overlap_ratio (float) – Proportion of chunk to overlap (0.0-1.0). Defaults to 0.1.

Returns:

List of overlapping text chunks

Return type:

List[str]

kerb.chunk.paragraph_chunker(text, max_paragraphs=3)[source]

Split text into chunks based on paragraph boundaries.

Parameters:

text (str) – The text to chunk
max_paragraphs (int) – Maximum number of paragraphs per chunk. Defaults to 3.

Returns:

List of paragraph-based chunks

Return type:

List[str]

kerb.chunk.sliding_window_chunker(text, window_size=1000, stride=500)[source]

Create chunks using a sliding window approach.

Similar to simple_chunker with overlap, but stride-based for more control. Common in NLP tasks and document processing pipelines.

Parameters:

text (str) – The text to chunk
window_size (int) – Size of each window/chunk. Defaults to 1000.
stride (int) – Number of characters to move forward for next window. Defaults to 500.

Returns:

List of sliding window chunks

Return type:

List[str]

kerb.chunk.token_based_chunker(text, max_tokens=512, tokenizer=None)[source]

Split text based on token count.

Uses the specified tokenizer to estimate chunk sizes. For accurate token-based chunking with OpenAI models, ensure tiktoken is installed.

Parameters:

text (str) – The text to chunk
max_tokens (int) – Maximum tokens per chunk. Defaults to 512.
tokenizer – Tokenizer to use for estimation. If None, uses character approximation.

Returns:

List of token-based chunks

Return type:

List[str]

Examples

>>> from kerb.tokenizer import Tokenizer
>>> chunks = token_based_chunker(text, max_tokens=512, tokenizer=Tokenizer.CL100K_BASE)

kerb.chunk.recursive_chunker(text, chunk_size=1000, separators=None)[source]

Recursively split text using a hierarchy of separators.

Functional interface for RecursiveChunker.

Parameters:

text (str) – The text to chunk
chunk_size (int) – Target size for each chunk. Defaults to 1000.
separators (Optional[List[str]]) – List of separators in priority order. Defaults to [’nn’, ‘n’, ‘. ‘, ‘ ‘, ‘’].

Returns:

List of recursively split chunks

Return type:

List[str]

kerb.chunk.sentence_window_chunker(text, window_sentences=5, overlap_sentences=1)[source]

Create overlapping chunks based on sentence boundaries.

Functional interface for SentenceChunker.

Parameters:

text (str) – The text to chunk
window_sentences (int) – Number of sentences per chunk. Defaults to 5.
overlap_sentences (int) – Number of sentences to overlap. Defaults to 1.

Returns:

List of sentence-windowed chunks

Return type:

List[str]

kerb.chunk.merge_chunks(chunks, max_size=2000, separator='\\n\\n')[source]

Merge smaller chunks together up to a maximum size.

Useful for optimizing chunk sizes after initial splitting or when dealing with many small chunks that could be combined for better efficiency.

Parameters:

chunks (List[str]) – List of text chunks to merge
max_size (int) – Maximum size of merged chunks. Defaults to 2000.
separator (str) – Separator to use when joining chunks. Defaults to “nn”.

Returns:

List of merged chunks

Return type:

List[str]

Examples

>>> small_chunks = ["chunk1", "chunk2", "chunk3"]
>>> merged = merge_chunks(small_chunks, max_size=100)

kerb.chunk.optimize_chunk_size(text, target_size=1000, tolerance=0.2)[source]

Calculate an optimized chunk size based on text length and target.

Adjusts the chunk size to minimize uneven chunks and ensure better distribution of content across chunks.

Parameters:

text (str) – The text to analyze
target_size (int) – Target chunk size. Defaults to 1000.
tolerance (float) – Acceptable variance from target (0.0-1.0). Defaults to 0.2.

Returns:

Optimized chunk size

Return type:

int

Examples

>>> text = "Your long document..."
>>> optimal_size = optimize_chunk_size(text, target_size=500, tolerance=0.15)

kerb.chunk.custom_chunker(text, chunk_size=1000, split_fn=None)[source]

Split text using a custom splitting function.

Provides flexibility for domain-specific chunking strategies.

Parameters:

text (str) – The text to chunk
chunk_size (int) – Target chunk size. Defaults to 1000.
split_fn (Optional[Callable[[str], List[str]]]) – Custom function that takes text and returns list of segments. If None, uses simple character-based splitting.

Returns:

List of custom-split chunks

Return type:

List[str]

Examples

>>> def my_splitter(text):
...     return text.split('|')  # Split on custom delimiter
>>> chunks = custom_chunker(text, split_fn=my_splitter)

Text chunking utilities for optimal context windows and retrieval.