Document Module
Document loading and processing utilities for LLM applications.
This module provides comprehensive tools for working with various document formats.
- Common Usage:
# Load documents (top-level) from kerb.document import load_document, Document
# Specialized loaders (submodule) from kerb.document.loaders import load_text, load_markdown
# Utilities (submodule) from kerb.document.utils import detect_format, load_directory
# Text processing (submodules) from kerb.document.extractors import extract_text_from_html from kerb.document.cleaners import clean_text from kerb.document.preprocessors import preprocess_pdf_text from kerb.document.metadata import extract_metadata
- Document Loading:
load_document() - Load any supported document (auto-detects format)
- Submodules:
loaders - Format-specific document loaders (PDF, DOCX, HTML, etc.) utils - Utilities for format detection, batch loading, and merging extractors - Text extraction from various formats cleaners - Text cleaning and normalization preprocessors - Format-specific preprocessing metadata - Metadata and entity extraction
- Data Classes:
Document - Document with content and metadata (from kerb.core.types) DocumentFormat - Enum of supported formats (from kerb.core.types)
- class kerb.document.Document(content, metadata=<factory>, id=None, source=None, format=DocumentFormat.UNKNOWN, score=0.0, page_content=None)[source]
Bases:
objectUniversal document representation across the toolkit.
Consolidates the Document classes from document/ and retrieval/ packages to provide a single, consistent document representation.
- content
The text content of the document
- metadata
Additional metadata about the document
- id
Optional unique identifier for the document
- source
Optional source path or URL where document was loaded from
- format
Document format (defaults to UNKNOWN)
- score
Relevance score (used in retrieval contexts, defaults to 0.0)
- page_content
Optional list of content per page (for multi-page documents)
Examples
>>> # Simple document >>> doc = Document(content="Hello, world!")
>>> # Document with metadata >>> doc = Document( ... content="Important document", ... metadata={"author": "John", "created": "2025-01-01"}, ... source="doc.txt" ... )
>>> # Retrieval result with score >>> doc = Document( ... id="doc_123", ... content="Relevant content", ... score=0.95 ... )
- format: DocumentFormat = 'unknown'
- __init__(content, metadata=<factory>, id=None, source=None, format=DocumentFormat.UNKNOWN, score=0.0, page_content=None)
- class kerb.document.DocumentFormat(*values)[source]
Bases:
EnumSupported document formats.
- PDF = 'pdf'
- DOCX = 'docx'
- DOC = 'doc'
- HTML = 'html'
- MARKDOWN = 'markdown'
- TXT = 'txt'
- CSV = 'csv'
- JSON = 'json'
- XML = 'xml'
- RTF = 'rtf'
- ODT = 'odt'
- EPUB = 'epub'
- UNKNOWN = 'unknown'
- kerb.document.load_document(file_path, **kwargs)[source]
Load a document from file, automatically detecting format.
This is the main entry point for loading documents. It detects the format and delegates to the appropriate loader.
- Parameters:
file_path (
str) – Path to the document file**kwargs – Additional arguments passed to format-specific loaders
- Returns:
Loaded document with content and metadata
- Return type:
- Raises:
FileNotFoundError – If file doesn’t exist
ValueError – If format is not supported
Examples
>>> doc = load_document("report.pdf") >>> print(doc.content[:100])
>>> doc = load_document("data.csv", parse_as_dict=True) >>> print(doc.metadata['rows'])
- kerb.document.load_text(file_path, encoding='utf-8')[source]
Load a plain text file.
- Parameters:
- Returns:
Loaded document
- Return type:
Examples
>>> doc = load_text("notes.txt") >>> print(doc.content)
- kerb.document.load_markdown(file_path, extract_frontmatter=True)[source]
Load a Markdown file.
- Parameters:
- Returns:
Loaded document with frontmatter in metadata
- Return type:
Examples
>>> doc = load_markdown("README.md") >>> if 'frontmatter' in doc.metadata: ... print(doc.metadata['frontmatter'])
- kerb.document.load_json(file_path, as_string=False)[source]
Load a JSON file.
- Parameters:
- Returns:
Loaded document
- Return type:
Examples
>>> doc = load_json("data.json", as_string=True) >>> print(doc.content)
>>> doc = load_json("config.json") >>> print(doc.metadata['json_data'])
- kerb.document.load_csv(file_path, parse_as_dict=True, encoding='utf-8')[source]
Load a CSV file.
- Parameters:
- Returns:
Loaded document with CSV data in metadata
- Return type:
Examples
>>> doc = load_csv("data.csv") >>> rows = doc.metadata['rows'] >>> headers = doc.metadata['headers']
- kerb.document.load_xml(file_path, encoding='utf-8')[source]
Load an XML file.
- Parameters:
- Returns:
Loaded document
- Return type:
Examples
>>> doc = load_xml("data.xml") >>> print(doc.content)
- kerb.document.load_html(file_path, extract_text=True, encoding='utf-8')[source]
Load an HTML file.
- Parameters:
- Returns:
Loaded document
- Return type:
Examples
>>> doc = load_html("page.html", extract_text=True) >>> print(doc.content) # Plain text without HTML tags
- kerb.document.load_pdf(file_path, extract_images=False)[source]
Load a PDF file.
Requires: pypdf or PyPDF2 package
- Parameters:
- Returns:
Loaded document with page-by-page content
- Return type:
Examples
>>> doc = load_pdf("report.pdf") >>> print(f"Pages: {doc.metadata['num_pages']}") >>> print(doc.content) # All pages concatenated
- kerb.document.load_docx(file_path)[source]
Load a DOCX file.
Requires: python-docx package
Examples
>>> doc = load_docx("report.docx") >>> print(doc.content)
- kerb.document.detect_format(file_path)[source]
Detect document format from file extension.
- Parameters:
file_path (
str) – Path to the file- Returns:
Detected format enum
- Return type:
Examples
>>> detect_format("document.pdf") DocumentFormat.PDF
>>> detect_format("notes.md") DocumentFormat.MARKDOWN
- kerb.document.load_directory(directory_path, pattern='*', recursive=False, max_files=None)[source]
Load all supported documents from a directory.
- Parameters:
- Returns:
List of loaded documents
- Return type:
Examples
>>> docs = load_directory("./documents", pattern="*.pdf") >>> print(f"Loaded {len(docs)} documents")
>>> docs = load_directory("./data", recursive=True, max_files=100)
- kerb.document.load_from_url(url, timeout=30, max_size_mb=100, max_retries=3, **kwargs)[source]
Load document from a URL.
Requires: requests package
- Parameters:
- Returns:
Loaded document
- Return type:
- Raises:
ValueError – If content exceeds max_size_mb
requests.exceptions.Timeout – If request times out
requests.exceptions.HTTPError – If HTTP error occurs
Examples
>>> doc = load_from_url("https://example.com/document.pdf") >>> print(doc.content)
>>> # Custom timeout and size limit >>> doc = load_from_url("https://example.com/large.pdf", ... timeout=60, max_size_mb=200)
- async kerb.document.load_from_url_async(url, timeout=30, max_size_mb=100, max_retries=3, **kwargs)[source]
Load document from a URL asynchronously.
Requires: aiohttp package
- Parameters:
- Returns:
Loaded document
- Return type:
- Raises:
ValueError – If content exceeds max_size_mb
asyncio.TimeoutError – If request times out
aiohttp.ClientError – If HTTP error occurs
Examples
>>> import asyncio >>> doc = asyncio.run(load_from_url_async("https://example.com/document.pdf")) >>> print(doc.content)
- kerb.document.merge_documents(documents, separator='\\n\\n---\\n\\n')[source]
Merge multiple documents into one.
- Parameters:
- Returns:
Merged document
- Return type:
Examples
>>> doc1 = Document(content="First doc", metadata={"id": 1}) >>> doc2 = Document(content="Second doc", metadata={"id": 2}) >>> merged = merge_documents([doc1, doc2]) >>> print(merged.content)
- kerb.document.extract_text_from_html(html, remove_scripts=True)[source]
Extract plain text from HTML content.
- Parameters:
- Returns:
Extracted plain text
- Return type:
Examples
>>> html = '<html><body><p>Hello World</p></body></html>' >>> extract_text_from_html(html) 'Hello World'
- kerb.document.strip_markdown(text)[source]
Remove Markdown formatting from text.
- Parameters:
text (
str) – Markdown text- Returns:
Plain text without Markdown formatting
- Return type:
Examples
>>> strip_markdown("# Hello **World**") 'Hello World'
- kerb.document.split_into_sentences(text)[source]
Split text into sentences.
Examples
>>> split_into_sentences("Hello world. This is a test!") ['Hello world.', 'This is a test!']
- kerb.document.split_into_paragraphs(text)[source]
Split text into paragraphs.
Examples
>>> split_into_paragraphs("Para 1\n\nPara 2\n\nPara 3") ['Para 1', 'Para 2', 'Para 3']
- kerb.document.clean_text(text, normalize_whitespace=True, remove_urls=False, remove_emails=False, remove_special_chars=False, lowercase=False)[source]
Clean and normalize text.
- Parameters:
- Returns:
Cleaned text
- Return type:
Examples
>>> text = "Check out https://example.com for more info!" >>> clean_text(text, normalize_whitespace=True, remove_urls=True) 'Check out for more info!'
- kerb.document.remove_extra_newlines(text, max_consecutive=2)[source]
Remove excessive newlines from text.
- Parameters:
- Returns:
Text with limited newlines
- Return type:
Examples
>>> remove_extra_newlines("Hello\n\n\n\nWorld", max_consecutive=2) 'Hello\n\nWorld'
- kerb.document.preprocess_pdf_text(text)[source]
Preprocess text extracted from PDF.
PDFs often have formatting artifacts like broken lines, extra spaces, etc.
Examples
>>> pdf_text = "This is a sen-\ntence with line break." >>> preprocess_pdf_text(pdf_text) 'This is a sentence with line break.'
- kerb.document.preprocess_html_text(html)[source]
Preprocess HTML to extract clean text.
Examples
>>> html = '<div>Hello <span>World</span></div>' >>> preprocess_html_text(html) 'Hello World'
- kerb.document.preprocess_markdown(text, keep_structure=True)[source]
Preprocess Markdown text.
- Parameters:
- Returns:
Processed text
- Return type:
Examples
>>> md = "# Title\n\nSome **bold** text" >>> preprocess_markdown(md, keep_structure=False) 'Title\n\nSome bold text'
- kerb.document.extract_metadata(file_path)[source]
Extract metadata from a file.
Examples
>>> metadata = extract_metadata("document.pdf") >>> print(metadata['size'], metadata['created'])
- kerb.document.extract_document_stats(text)[source]
Extract statistics from document text.
Examples
>>> stats = extract_document_stats("Hello world. This is a test.") >>> print(stats['word_count'], stats['sentence_count'])
- kerb.document.extract_urls(text)[source]
Extract URLs from text.
Examples
>>> extract_urls("Visit https://example.com and www.test.com") ['https://example.com', 'www.test.com']
- kerb.document.extract_emails(text)[source]
Extract email addresses from text.
- Parameters:
text (
str) – Text to extract emails from- Returns:
List of email addresses
- Return type:
Examples
>>> extract_emails("Contact us at info@example.com or sales@test.org") ['info@example.com', 'sales@test.org']
- kerb.document.extract_dates(text)[source]
Extract dates from text (simple patterns).
- Parameters:
text (
str) – Text to extract dates from- Returns:
List of potential date strings
- Return type:
Examples
>>> extract_dates("Meeting on 2024-01-15 and 01/20/2024") ['2024-01-15', '01/20/2024']
- kerb.document.extract_phone_numbers(text)[source]
Extract phone numbers from text (US format).
- Parameters:
text (
str) – Text to extract phone numbers from- Returns:
List of phone numbers
- Return type:
Examples
>>> extract_phone_numbers("Call (555) 123-4567 or 555-987-6543") ['(555) 123-4567', '555-987-6543']
Document loading and processing for PDFs, web pages, and more.