Preprocessing Module
Text preprocessing utilities for LLM applications.
This module provides comprehensive text preprocessing tools for cleaning, normalizing, and preparing text data for LLM processing.
- Usage Examples:
>>> # Common usage - normalize text >>> from kerb.preprocessing import normalize_text >>> clean = normalize_text(" Hello World! ", lowercase=True)
>>> # Text operations >>> from kerb.preprocessing.text import ( ... normalize_whitespace, ... remove_special_chars, ... truncate_text ... )
>>> # Language detection >>> from kerb.preprocessing.language import detect_language >>> result = detect_language("Bonjour le monde")
>>> # Content filtering >>> from kerb.preprocessing.filtering import filter_by_length >>> filtered = filter_by_length(["hi", "hello world"], min_length=5)
>>> # Batch processing >>> from kerb.preprocessing.batch import preprocess_batch >>> processed = preprocess_batch([" text1 ", " text2 "])
- Organization:
Top-level: Core functions and most common operations
- Submodules: Specialized implementations organized by functionality
text: Text normalization, cleaning, case handling
language: Language detection and filtering
deduplication: Text deduplication operations
filtering: Content filtering and quality control
analysis: Content analysis and classification
transforms: Advanced text transformations
batch: Batch processing utilities
enums: Enumeration types
types: Data classes and type definitions
- class kerb.preprocessing.NormalizationLevel(*values)[source]
Bases:
EnumText normalization intensity.
- MINIMAL = 'minimal'
- STANDARD = 'standard'
- AGGRESSIVE = 'aggressive'
- class kerb.preprocessing.LanguageDetectionMode(*values)[source]
Bases:
EnumLanguage detection strategy.
- FAST = 'fast'
- ACCURATE = 'accurate'
- SIMPLE = 'simple'
- class kerb.preprocessing.DeduplicationMode(*values)[source]
Bases:
EnumDeduplication strategy.
- EXACT = 'exact'
- FUZZY = 'fuzzy'
- SEMANTIC = 'semantic'
- class kerb.preprocessing.ContentType(*values)[source]
Bases:
EnumText content type classification.
- PLAIN_TEXT = 'plain_text'
- CODE = 'code'
- MARKDOWN = 'markdown'
- HTML = 'html'
- JSON = 'json'
- MIXED = 'mixed'
- UNKNOWN = 'unknown'
- class kerb.preprocessing.LanguageResult(language, confidence, alternatives=<factory>)[source]
Bases:
objectLanguage detection result.
- __init__(language, confidence, alternatives=<factory>)
- class kerb.preprocessing.QualityMetrics(length, word_count, avg_word_length, sentence_count, avg_sentence_length, special_char_ratio, digit_ratio, uppercase_ratio, readability_score)[source]
Bases:
objectText quality metrics.
- __init__(length, word_count, avg_word_length, sentence_count, avg_sentence_length, special_char_ratio, digit_ratio, uppercase_ratio, readability_score)
- class kerb.preprocessing.NormalizationConfig(level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True)[source]
Bases:
objectConfiguration for text normalization operations.
- level
Normalization intensity level
- lowercase
Convert to lowercase
- remove_urls
Remove URLs from text
- remove_emails
Remove email addresses
- remove_extra_spaces
Remove redundant whitespace
- level: NormalizationLevel = 'standard'
- __init__(level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True)
- kerb.preprocessing.normalize_text(text, level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True, config=None)[source]
Comprehensive text normalization with configurable intensity.
- Parameters:
text (
str) – Input text to normalizelevel (
NormalizationLevel) – Normalization intensity level (ignored if config is provided)lowercase (
bool) – Convert to lowercase (ignored if config is provided)remove_urls (
bool) – Remove URLs from text (ignored if config is provided)remove_emails (
bool) – Remove email addresses (ignored if config is provided)remove_extra_spaces (
bool) – Remove redundant whitespace (ignored if config is provided)config (
Optional[NormalizationConfig]) – NormalizationConfig object with all parameters (recommended)
- Return type:
- Returns:
Normalized text
Examples
>>> # Using config object (recommended) >>> from kerb.preprocessing import NormalizationConfig, NormalizationLevel >>> config = NormalizationConfig( ... level=NormalizationLevel.STANDARD, ... lowercase=True, ... remove_urls=True ... ) >>> normalized = normalize_text("Check this: https://example.com", config=config)
>>> # Using individual parameters (backward compatible) >>> normalized = normalize_text("HELLO WORLD", lowercase=True)
- kerb.preprocessing.normalize_whitespace(text)[source]
Normalize whitespace and newlines.
Examples
>>> normalize_whitespace("Hello world\n\n\ntest") 'Hello world\n\ntest'
- kerb.preprocessing.normalize_unicode(text, form='NFKC')[source]
Normalize unicode characters.
- Parameters:
- Return type:
- Returns:
Unicode-normalized text
Examples
>>> normalize_unicode("café") # Normalizes different accent representations 'café'
- kerb.preprocessing.normalize_quotes(text)[source]
Convert smart quotes to standard quotes.
Examples
>>> normalize_quotes('"Hello" and 'world'") '"Hello" and \'world\''
- kerb.preprocessing.normalize_dashes(text)[source]
Convert various dashes to standard forms.
Examples
>>> normalize_dashes("em—dash and en–dash") 'em-dash and en-dash'
- kerb.preprocessing.remove_accents(text)[source]
Remove diacritical marks from text.
Examples
>>> remove_accents("café résumé") 'cafe resume'
- kerb.preprocessing.clean_html(text, keep_newlines=True)[source]
Remove HTML tags and entities.
- Parameters:
- Return type:
- Returns:
Plain text without HTML
Examples
>>> clean_html("<p>Hello <b>world</b></p>") 'Hello world'
- kerb.preprocessing.clean_markdown(text, keep_structure=False)[source]
Remove or normalize markdown formatting.
- Parameters:
- Return type:
- Returns:
Plain or lightly formatted text
Examples
>>> clean_markdown("# Hello **world**") 'Hello world'
- kerb.preprocessing.remove_urls(text, replacement='')[source]
Remove or replace URLs.
- Parameters:
- Return type:
- Returns:
Text without URLs
Examples
>>> remove_urls("Check https://example.com for info") 'Check for info'
- kerb.preprocessing.remove_emails(text, replacement='')[source]
Remove or replace email addresses.
- Parameters:
- Return type:
- Returns:
Text without email addresses
Examples
>>> remove_emails("Contact me@example.com") 'Contact '
- kerb.preprocessing.remove_phone_numbers(text, replacement='')[source]
Remove or replace phone numbers.
- Parameters:
- Return type:
- Returns:
Text without phone numbers
Examples
>>> remove_phone_numbers("Call 555-123-4567") 'Call '
- kerb.preprocessing.remove_special_chars(text, keep_basic=True)[source]
Remove special characters with options.
- Parameters:
- Return type:
- Returns:
Text with special characters removed
Examples
>>> remove_special_chars("Hello@#$world!") 'Hello world!'
- kerb.preprocessing.remove_extra_whitespace(text)[source]
Remove redundant whitespace.
Examples
>>> remove_extra_whitespace("Hello world") 'Hello world'
- kerb.preprocessing.remove_control_chars(text)[source]
Remove control characters.
Examples
>>> remove_control_chars("Hello\x00world\x01") 'Helloworld'
- kerb.preprocessing.strip_punctuation(text, keep_internal=True)[source]
Remove punctuation with options.
- Parameters:
- Return type:
- Returns:
Text with punctuation removed
Examples
>>> strip_punctuation("Hello, world!") 'Hello world'
- kerb.preprocessing.normalize_case(text, mode='sentence')[source]
Smart case normalization.
- Parameters:
- Return type:
- Returns:
Case-normalized text
Examples
>>> normalize_case("HELLO WORLD", mode=CaseMode.SENTENCE) 'Hello world'
>>> normalize_case("hello world", mode="title") 'Hello World'
- kerb.preprocessing.to_title_case(text)[source]
Convert to title case.
Examples
>>> to_title_case("hello world from python") 'Hello World From Python'
- kerb.preprocessing.to_sentence_case(text)[source]
Convert to sentence case.
Examples
>>> to_sentence_case("hello world. this is a test.") 'Hello world. This is a test.'
- kerb.preprocessing.preserve_acronyms(text, acronyms=None)[source]
Smart case conversion preserving acronyms.
- Parameters:
- Return type:
- Returns:
Text with preserved acronyms
Examples
>>> preserve_acronyms("nasa and fbi are agencies", ["NASA", "FBI"]) 'NASA and FBI are agencies'
- kerb.preprocessing.detect_language(text, mode=LanguageDetectionMode.FAST)[source]
Detect text language with multiple strategies.
Uses langdetect library if available, otherwise falls back to heuristic-based detection supporting 50+ languages.
- Parameters:
text (
str) – Input textmode (
LanguageDetectionMode) – Detection mode - FAST: Quick heuristic-based detection - ACCURATE: Use langdetect library if available - SIMPLE: Basic character range detection
- Return type:
- Returns:
LanguageResult with detected language and confidence
Examples
>>> result = detect_language("Hello world") >>> result.language 'en' >>> result = detect_language("Bonjour le monde") >>> result.language 'fr' >>> result = detect_language("こんにちは世界") >>> result.language 'ja'
- kerb.preprocessing.detect_language_batch(texts, mode=LanguageDetectionMode.FAST)[source]
Batch language detection.
- Parameters:
mode (
LanguageDetectionMode) – Detection mode
- Return type:
- Returns:
List of LanguageResult objects
Examples
>>> results = detect_language_batch(["Hello", "Bonjour"]) >>> [r.language for r in results] ['en', 'fr']
- kerb.preprocessing.is_language(text, language, threshold=0.5)[source]
Check if text is specific language.
- Parameters:
- Return type:
- Returns:
True if text is detected as specified language
Examples
>>> is_language("Hello world", "en") True
- kerb.preprocessing.filter_by_language(texts, language, threshold=0.5)[source]
Filter texts by language.
- Parameters:
- Return type:
- Returns:
List of texts in specified language
Examples
>>> filter_by_language(["Hello", "Bonjour"], "en") ['Hello']
- kerb.preprocessing.get_supported_languages()[source]
Get list of supported languages.
Returns heuristic-supported languages. With langdetect library installed, 55+ languages are supported. Without it, 20+ languages are supported through character-based and pattern detection.
Examples
>>> langs = get_supported_languages() >>> "en" in langs True >>> len(langs) >= 20 True
- kerb.preprocessing.deduplicate_exact(texts, keep_order=True)[source]
Remove exact duplicates.
- Parameters:
- Return type:
- Returns:
List with duplicates removed
Examples
>>> deduplicate_exact(["a", "b", "a", "c"]) ['a', 'b', 'c']
- kerb.preprocessing.deduplicate_fuzzy(texts, similarity_threshold=0.9, keep_order=True)[source]
Remove fuzzy/near duplicates.
- Parameters:
- Return type:
- Returns:
List with fuzzy duplicates removed
Examples
>>> deduplicate_fuzzy(["hello world", "hello world", "goodbye"]) ['hello world', 'goodbye']
- kerb.preprocessing.deduplicate_semantic(texts, similarity_threshold=0.85, embed_fn=None)[source]
Remove semantically similar texts.
- Parameters:
- Return type:
- Returns:
List with semantic duplicates removed
Examples
>>> deduplicate_semantic(["hello", "hi", "goodbye"]) ['hello', 'goodbye']
- kerb.preprocessing.deduplicate_lines(text, keep_order=True)[source]
Remove duplicate lines.
- Parameters:
- Return type:
- Returns:
Text with duplicate lines removed
Examples
>>> deduplicate_lines("line1\nline2\nline1\nline3") 'line1\nline2\nline3'
- kerb.preprocessing.deduplicate_sentences(text, keep_order=True)[source]
Remove duplicate sentences.
- Parameters:
- Return type:
- Returns:
Text with duplicate sentences removed
Examples
>>> deduplicate_sentences("Hello. World. Hello.") 'Hello. World.'
- kerb.preprocessing.find_duplicates(texts, mode=DeduplicationMode.EXACT)[source]
Find duplicate texts without removing.
- Parameters:
mode (
DeduplicationMode) – Deduplication mode
- Return type:
- Returns:
List of index groups representing duplicates
Examples
>>> find_duplicates(["a", "b", "a", "c", "b"]) [[0, 2], [1, 4]]
- kerb.preprocessing.compute_text_hash(text, algorithm='md5')[source]
Compute stable text hash for deduplication.
- Parameters:
- Return type:
- Returns:
Hex hash string
Examples
>>> hash1 = compute_text_hash("hello") >>> hash2 = compute_text_hash("hello") >>> hash1 == hash2 True
- kerb.preprocessing.filter_by_length(texts, min_length=None, max_length=None, unit='chars')[source]
Filter texts by length constraints.
- Parameters:
- Return type:
- Returns:
Filtered list of texts
Examples
>>> filter_by_length(["hi", "hello world", ""], min_length=3) ['hello world']
- kerb.preprocessing.filter_by_pattern(texts, pattern, keep_matches=True, flags=0)[source]
Filter texts by regex pattern.
- Parameters:
- Return type:
- Returns:
Filtered list of texts
Examples
>>> filter_by_pattern(["hello", "world", "hi"], r"^h", keep_matches=True) ['hello', 'hi']
- kerb.preprocessing.filter_profanity(text, replacement='***')[source]
Remove or mask profane content.
- Parameters:
- Return type:
- Returns:
Filtered text
Examples
>>> filter_profanity("This is clean text") 'This is clean text'
- kerb.preprocessing.filter_pii(text, replacement='[REDACTED]')[source]
Remove or mask personally identifiable information.
- Parameters:
- Return type:
- Returns:
Text with PII removed
Examples
>>> filter_pii("Email me@example.com or call 555-1234") 'Email [REDACTED] or call [REDACTED]'
- kerb.preprocessing.detect_spam(text, threshold=0.5)[source]
Detect spam or low-quality content.
- Parameters:
- Return type:
- Returns:
True if text is likely spam
Examples
>>> detect_spam("BUY NOW!!! CLICK HERE!!!") True
- kerb.preprocessing.filter_by_quality(texts, min_score=0.5)[source]
Filter by quality metrics.
- Parameters:
- Return type:
- Returns:
List of high-quality texts
Examples
>>> filter_by_quality(["Good text here.", "x", "Another good one."]) ['Good text here.', 'Another good one.']
- kerb.preprocessing.filter_non_ascii(text, replacement='', keep_extended=True)[source]
Filter or replace non-ASCII characters.
- Parameters:
- Return type:
- Returns:
ASCII-filtered text
Examples
>>> filter_non_ascii("Hello 世界") 'Hello '
- kerb.preprocessing.classify_content_type(text)[source]
Classify text content type.
- Parameters:
text (
str) – Input text- Return type:
- Returns:
ContentType enum value
Examples
>>> classify_content_type("def foo():\n pass") <ContentType.CODE: 'code'>
- kerb.preprocessing.detect_code(text)[source]
Detect if text contains code.
Examples
>>> detect_code("def foo(): return True") True
- kerb.preprocessing.detect_sentiment(text)[source]
Basic sentiment detection.
Examples
>>> detect_sentiment("I love this!") 'positive'
- kerb.preprocessing.measure_readability(text)[source]
Calculate readability score (0-1, higher is more readable).
Examples
>>> score = measure_readability("This is simple text.") >>> score > 0.5 True
- kerb.preprocessing.count_words(text)[source]
Smart word counting.
Examples
>>> count_words("Hello world, this is a test") 6
- kerb.preprocessing.count_sentences(text)[source]
Smart sentence counting.
Examples
>>> count_sentences("Hello. World! How are you?") 3
- kerb.preprocessing.count_paragraphs(text)[source]
Count paragraphs.
Examples
>>> count_paragraphs("Para 1\n\nPara 2\n\nPara 3") 3
- kerb.preprocessing.expand_contractions(text)[source]
Expand English contractions.
- Parameters:
text (
str) – Input text with contractions- Return type:
- Returns:
Text with expanded contractions
Examples
>>> expand_contractions("I'm doesn't can't") "I am does not cannot"
- kerb.preprocessing.standardize_numbers(text)[source]
Convert number words to digits.
Examples
>>> standardize_numbers("I have three apples and five oranges") 'I have 3 apples and 5 oranges'
- kerb.preprocessing.standardize_dates(text)[source]
Normalize date formats.
- Parameters:
text (
str) – Input text with dates- Return type:
- Returns:
Text with standardized dates (YYYY-MM-DD)
Examples
>>> standardize_dates("Meeting on 12/25/2024") 'Meeting on 2024-12-25'
- kerb.preprocessing.extract_entities(text, entity_type=None)[source]
Extract named entities (basic).
- Parameters:
- Return type:
- Returns:
List of extracted entities
Examples
>>> extract_entities("Apple Inc. is in California") ['Apple Inc.', 'California']
- kerb.preprocessing.segment_sentences(text)[source]
Sentence segmentation.
Examples
>>> segment_sentences("Hello world. How are you?") ['Hello world.', 'How are you?']
- kerb.preprocessing.segment_words(text)[source]
Word segmentation (tokenization).
Examples
>>> segment_words("Hello, world!") ['Hello', 'world']
- kerb.preprocessing.preprocess_batch(texts, operations=None, **kwargs)[source]
Apply preprocessing pipeline to batch.
- Parameters:
- Return type:
- Returns:
List of preprocessed texts
Examples
>>> preprocess_batch([" HELLO ", " WORLD "], [str.lower, str.strip]) ['hello', 'world']
- kerb.preprocessing.preprocess_pipeline(*operations)[source]
Create custom preprocessing pipeline.
- Parameters:
*operations (
Callable) – Preprocessing functions to chain- Return type:
- Returns:
Pipeline function
Examples
>>> pipeline = preprocess_pipeline(str.lower, str.strip) >>> pipeline(" HELLO ") 'hello'
- kerb.preprocessing.truncate_text(text, max_length, strategy='end', suffix='...')[source]
Truncate text intelligently.
- Parameters:
- Return type:
- Returns:
Truncated text
Examples
>>> truncate_text("Hello world", max_length=8) 'Hello...'
>>> truncate_text("Hello world", max_length=8, strategy=TruncateStrategy.MIDDLE) 'He...ld'
>>> truncate_text("This is a sentence. And another one.", max_length=20, strategy="smart") 'This is a sentence....'
- kerb.preprocessing.split_long_text(text, max_length, overlap=0, preserve_words=True)[source]
Split text exceeding length limit.
- Parameters:
- Return type:
- Returns:
List of text chunks
Examples
>>> split_long_text("Hello world test", max_length=8) ['Hello', 'world', 'test']
Text cleaning and preprocessing for LLM inputs.