Document Module

Document loading and processing utilities for LLM applications.

This module provides comprehensive tools for working with various document formats.

Common Usage:

# Load documents (top-level) from kerb.document import load_document, Document

# Specialized loaders (submodule) from kerb.document.loaders import load_text, load_markdown

# Utilities (submodule) from kerb.document.utils import detect_format, load_directory

# Text processing (submodules) from kerb.document.extractors import extract_text_from_html from kerb.document.cleaners import clean_text from kerb.document.preprocessors import preprocess_pdf_text from kerb.document.metadata import extract_metadata

Document Loading:

load_document() - Load any supported document (auto-detects format)

Submodules:

loaders - Format-specific document loaders (PDF, DOCX, HTML, etc.) utils - Utilities for format detection, batch loading, and merging extractors - Text extraction from various formats cleaners - Text cleaning and normalization preprocessors - Format-specific preprocessing metadata - Metadata and entity extraction

Data Classes:

Document - Document with content and metadata (from kerb.core.types) DocumentFormat - Enum of supported formats (from kerb.core.types)

class kerb.document.Document(content, metadata=<factory>, id=None, source=None, format=DocumentFormat.UNKNOWN, score=0.0, page_content=None)[source]

Bases: object

Universal document representation across the toolkit.

Consolidates the Document classes from document/ and retrieval/ packages to provide a single, consistent document representation.

content

The text content of the document

metadata

Additional metadata about the document

id

Optional unique identifier for the document

source

Optional source path or URL where document was loaded from

format

Document format (defaults to UNKNOWN)

score

Relevance score (used in retrieval contexts, defaults to 0.0)

page_content

Optional list of content per page (for multi-page documents)

Examples

>>> # Simple document
>>> doc = Document(content="Hello, world!")
>>> # Document with metadata
>>> doc = Document(
...     content="Important document",
...     metadata={"author": "John", "created": "2025-01-01"},
...     source="doc.txt"
... )
>>> # Retrieval result with score
>>> doc = Document(
...     id="doc_123",
...     content="Relevant content",
...     score=0.95
... )
content: str
metadata: Dict[str, Any]
id: str | None = None
source: str | None = None
format: DocumentFormat = 'unknown'
score: float = 0.0
page_content: List[str] | None = None
__len__()[source]

Return the length of the document content.

Return type:

int

to_dict()[source]

Convert document to dictionary.

Return type:

Dict[str, Any]

Returns:

Dictionary representation of the document

classmethod from_dict(data)[source]

Create document from dictionary.

Parameters:

data (Dict[str, Any]) – Dictionary with document data

Return type:

Document

Returns:

New Document instance

__repr__()[source]

String representation of the document.

Return type:

str

__init__(content, metadata=<factory>, id=None, source=None, format=DocumentFormat.UNKNOWN, score=0.0, page_content=None)
class kerb.document.DocumentFormat(*values)[source]

Bases: Enum

Supported document formats.

PDF = 'pdf'
DOCX = 'docx'
DOC = 'doc'
HTML = 'html'
MARKDOWN = 'markdown'
TXT = 'txt'
CSV = 'csv'
JSON = 'json'
XML = 'xml'
RTF = 'rtf'
ODT = 'odt'
EPUB = 'epub'
UNKNOWN = 'unknown'
kerb.document.load_document(file_path, **kwargs)[source]

Load a document from file, automatically detecting format.

This is the main entry point for loading documents. It detects the format and delegates to the appropriate loader.

Parameters:
  • file_path (str) – Path to the document file

  • **kwargs – Additional arguments passed to format-specific loaders

Returns:

Loaded document with content and metadata

Return type:

Document

Raises:

Examples

>>> doc = load_document("report.pdf")
>>> print(doc.content[:100])
>>> doc = load_document("data.csv", parse_as_dict=True)
>>> print(doc.metadata['rows'])
kerb.document.load_text(file_path, encoding='utf-8')[source]

Load a plain text file.

Parameters:
  • file_path (str) – Path to text file

  • encoding (str) – Text encoding. Defaults to ‘utf-8’.

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_text("notes.txt")
>>> print(doc.content)
kerb.document.load_markdown(file_path, extract_frontmatter=True)[source]

Load a Markdown file.

Parameters:
  • file_path (str) – Path to markdown file

  • extract_frontmatter (bool) – Extract YAML frontmatter if present

Returns:

Loaded document with frontmatter in metadata

Return type:

Document

Examples

>>> doc = load_markdown("README.md")
>>> if 'frontmatter' in doc.metadata:
...     print(doc.metadata['frontmatter'])
kerb.document.load_json(file_path, as_string=False)[source]

Load a JSON file.

Parameters:
  • file_path (str) – Path to JSON file

  • as_string (bool) – If True, return formatted JSON as string content. If False, store parsed object in metadata.

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_json("data.json", as_string=True)
>>> print(doc.content)
>>> doc = load_json("config.json")
>>> print(doc.metadata['json_data'])
kerb.document.load_csv(file_path, parse_as_dict=True, encoding='utf-8')[source]

Load a CSV file.

Parameters:
  • file_path (str) – Path to CSV file

  • parse_as_dict (bool) – Parse CSV and store structured data in metadata

  • encoding (str) – Text encoding

Returns:

Loaded document with CSV data in metadata

Return type:

Document

Examples

>>> doc = load_csv("data.csv")
>>> rows = doc.metadata['rows']
>>> headers = doc.metadata['headers']
kerb.document.load_xml(file_path, encoding='utf-8')[source]

Load an XML file.

Parameters:
  • file_path (str) – Path to XML file

  • encoding (str) – Text encoding

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_xml("data.xml")
>>> print(doc.content)
kerb.document.load_html(file_path, extract_text=True, encoding='utf-8')[source]

Load an HTML file.

Parameters:
  • file_path (str) – Path to HTML file

  • extract_text (bool) – If True, extract plain text from HTML

  • encoding (str) – Text encoding

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_html("page.html", extract_text=True)
>>> print(doc.content)  # Plain text without HTML tags
kerb.document.load_pdf(file_path, extract_images=False)[source]

Load a PDF file.

Requires: pypdf or PyPDF2 package

Parameters:
  • file_path (str) – Path to PDF file

  • extract_images (bool) – Whether to extract image information

Returns:

Loaded document with page-by-page content

Return type:

Document

Examples

>>> doc = load_pdf("report.pdf")
>>> print(f"Pages: {doc.metadata['num_pages']}")
>>> print(doc.content)  # All pages concatenated
kerb.document.load_docx(file_path)[source]

Load a DOCX file.

Requires: python-docx package

Parameters:

file_path (str) – Path to DOCX file

Returns:

Loaded document

Return type:

Document

Examples

>>> doc = load_docx("report.docx")
>>> print(doc.content)
kerb.document.detect_format(file_path)[source]

Detect document format from file extension.

Parameters:

file_path (str) – Path to the file

Returns:

Detected format enum

Return type:

DocumentFormat

Examples

>>> detect_format("document.pdf")
DocumentFormat.PDF
>>> detect_format("notes.md")
DocumentFormat.MARKDOWN
kerb.document.is_supported_format(file_path)[source]

Check if file format is supported.

Parameters:

file_path (str) – Path to the file

Returns:

True if format is supported

Return type:

bool

kerb.document.load_directory(directory_path, pattern='*', recursive=False, max_files=None)[source]

Load all supported documents from a directory.

Parameters:
  • directory_path (str) – Path to directory

  • pattern (str) – File pattern to match (e.g., “.pdf”, “.txt”)

  • recursive (bool) – Search subdirectories

  • max_files (Optional[int]) – Maximum number of files to load

Returns:

List of loaded documents

Return type:

List[Document]

Examples

>>> docs = load_directory("./documents", pattern="*.pdf")
>>> print(f"Loaded {len(docs)} documents")
>>> docs = load_directory("./data", recursive=True, max_files=100)
kerb.document.load_from_url(url, timeout=30, max_size_mb=100, max_retries=3, **kwargs)[source]

Load document from a URL.

Requires: requests package

Parameters:
  • url (str) – URL to fetch document from

  • timeout (int) – Request timeout in seconds. Defaults to 30.

  • max_size_mb (float) – Maximum file size in MB. Defaults to 100.

  • max_retries (int) – Maximum number of retry attempts. Defaults to 3.

  • **kwargs – Additional arguments for requests.get()

Returns:

Loaded document

Return type:

Document

Raises:
  • ValueError – If content exceeds max_size_mb

  • requests.exceptions.Timeout – If request times out

  • requests.exceptions.HTTPError – If HTTP error occurs

Examples

>>> doc = load_from_url("https://example.com/document.pdf")
>>> print(doc.content)
>>> # Custom timeout and size limit
>>> doc = load_from_url("https://example.com/large.pdf",
...                     timeout=60, max_size_mb=200)
async kerb.document.load_from_url_async(url, timeout=30, max_size_mb=100, max_retries=3, **kwargs)[source]

Load document from a URL asynchronously.

Requires: aiohttp package

Parameters:
  • url (str) – URL to fetch document from

  • timeout (int) – Request timeout in seconds. Defaults to 30.

  • max_size_mb (float) – Maximum file size in MB. Defaults to 100.

  • max_retries (int) – Maximum number of retry attempts. Defaults to 3.

  • **kwargs – Additional arguments for aiohttp.ClientSession.get()

Returns:

Loaded document

Return type:

Document

Raises:

Examples

>>> import asyncio
>>> doc = asyncio.run(load_from_url_async("https://example.com/document.pdf"))
>>> print(doc.content)
kerb.document.merge_documents(documents, separator='\\n\\n---\\n\\n')[source]

Merge multiple documents into one.

Parameters:
  • documents (List[Document]) – Documents to merge

  • separator (str) – Separator between documents

Returns:

Merged document

Return type:

Document

Examples

>>> doc1 = Document(content="First doc", metadata={"id": 1})
>>> doc2 = Document(content="Second doc", metadata={"id": 2})
>>> merged = merge_documents([doc1, doc2])
>>> print(merged.content)
kerb.document.extract_text_from_html(html, remove_scripts=True)[source]

Extract plain text from HTML content.

Parameters:
  • html (str) – HTML content

  • remove_scripts (bool) – Remove script and style tags

Returns:

Extracted plain text

Return type:

str

Examples

>>> html = '<html><body><p>Hello World</p></body></html>'
>>> extract_text_from_html(html)
'Hello World'
kerb.document.strip_markdown(text)[source]

Remove Markdown formatting from text.

Parameters:

text (str) – Markdown text

Returns:

Plain text without Markdown formatting

Return type:

str

Examples

>>> strip_markdown("# Hello **World**")
'Hello World'
kerb.document.split_into_sentences(text)[source]

Split text into sentences.

Parameters:

text (str) – Text to split

Returns:

List of sentences

Return type:

List[str]

Examples

>>> split_into_sentences("Hello world. This is a test!")
['Hello world.', 'This is a test!']
kerb.document.split_into_paragraphs(text)[source]

Split text into paragraphs.

Parameters:

text (str) – Text to split

Returns:

List of paragraphs

Return type:

List[str]

Examples

>>> split_into_paragraphs("Para 1\n\nPara 2\n\nPara 3")
['Para 1', 'Para 2', 'Para 3']
kerb.document.clean_text(text, normalize_whitespace=True, remove_urls=False, remove_emails=False, remove_special_chars=False, lowercase=False)[source]

Clean and normalize text.

Parameters:
  • text (str) – Text to clean

  • normalize_whitespace (bool) – Normalize whitespace to single spaces

  • remove_urls (bool) – Remove URLs

  • remove_emails (bool) – Remove email addresses

  • remove_special_chars (bool) – Remove special characters

  • lowercase (bool) – Convert to lowercase

Returns:

Cleaned text

Return type:

str

Examples

>>> text = "Check   out https://example.com  for more info!"
>>> clean_text(text, normalize_whitespace=True, remove_urls=True)
'Check out for more info!'
kerb.document.remove_extra_newlines(text, max_consecutive=2)[source]

Remove excessive newlines from text.

Parameters:
  • text (str) – Text to process

  • max_consecutive (int) – Maximum consecutive newlines to keep

Returns:

Text with limited newlines

Return type:

str

Examples

>>> remove_extra_newlines("Hello\n\n\n\nWorld", max_consecutive=2)
'Hello\n\nWorld'
kerb.document.preprocess_pdf_text(text)[source]

Preprocess text extracted from PDF.

PDFs often have formatting artifacts like broken lines, extra spaces, etc.

Parameters:

text (str) – Text extracted from PDF

Returns:

Cleaned text

Return type:

str

Examples

>>> pdf_text = "This is a sen-\ntence with line break."
>>> preprocess_pdf_text(pdf_text)
'This is a sentence with line break.'
kerb.document.preprocess_html_text(html)[source]

Preprocess HTML to extract clean text.

Parameters:

html (str) – HTML content

Returns:

Cleaned text

Return type:

str

Examples

>>> html = '<div>Hello <span>World</span></div>'
>>> preprocess_html_text(html)
'Hello World'
kerb.document.preprocess_markdown(text, keep_structure=True)[source]

Preprocess Markdown text.

Parameters:
  • text (str) – Markdown text

  • keep_structure (bool) – Keep headings and structure markers

Returns:

Processed text

Return type:

str

Examples

>>> md = "# Title\n\nSome **bold** text"
>>> preprocess_markdown(md, keep_structure=False)
'Title\n\nSome bold text'
kerb.document.extract_metadata(file_path)[source]

Extract metadata from a file.

Parameters:

file_path (str) – Path to file

Returns:

Extracted metadata

Return type:

Dict[str, Any]

Examples

>>> metadata = extract_metadata("document.pdf")
>>> print(metadata['size'], metadata['created'])
kerb.document.extract_document_stats(text)[source]

Extract statistics from document text.

Parameters:

text (str) – Document text

Returns:

Document statistics

Return type:

Dict[str, int]

Examples

>>> stats = extract_document_stats("Hello world. This is a test.")
>>> print(stats['word_count'], stats['sentence_count'])
kerb.document.extract_urls(text)[source]

Extract URLs from text.

Parameters:

text (str) – Text to extract URLs from

Returns:

List of URLs

Return type:

List[str]

Examples

>>> extract_urls("Visit https://example.com and www.test.com")
['https://example.com', 'www.test.com']
kerb.document.extract_emails(text)[source]

Extract email addresses from text.

Parameters:

text (str) – Text to extract emails from

Returns:

List of email addresses

Return type:

List[str]

Examples

>>> extract_emails("Contact us at info@example.com or sales@test.org")
['info@example.com', 'sales@test.org']
kerb.document.extract_dates(text)[source]

Extract dates from text (simple patterns).

Parameters:

text (str) – Text to extract dates from

Returns:

List of potential date strings

Return type:

List[str]

Examples

>>> extract_dates("Meeting on 2024-01-15 and 01/20/2024")
['2024-01-15', '01/20/2024']
kerb.document.extract_phone_numbers(text)[source]

Extract phone numbers from text (US format).

Parameters:

text (str) – Text to extract phone numbers from

Returns:

List of phone numbers

Return type:

List[str]

Examples

>>> extract_phone_numbers("Call (555) 123-4567 or 555-987-6543")
['(555) 123-4567', '555-987-6543']

Document loading and processing for PDFs, web pages, and more.