Preprocessing Module

Text preprocessing utilities for LLM applications.

This module provides comprehensive text preprocessing tools for cleaning, normalizing, and preparing text data for LLM processing.

Usage Examples:
>>> # Common usage - normalize text
>>> from kerb.preprocessing import normalize_text
>>> clean = normalize_text("  Hello   World!  ", lowercase=True)
>>> # Text operations
>>> from kerb.preprocessing.text import (
...     normalize_whitespace,
...     remove_special_chars,
...     truncate_text
... )
>>> # Language detection
>>> from kerb.preprocessing.language import detect_language
>>> result = detect_language("Bonjour le monde")
>>> # Content filtering
>>> from kerb.preprocessing.filtering import filter_by_length
>>> filtered = filter_by_length(["hi", "hello world"], min_length=5)
>>> # Batch processing
>>> from kerb.preprocessing.batch import preprocess_batch
>>> processed = preprocess_batch(["  text1  ", "  text2  "])
Organization:
  • Top-level: Core functions and most common operations

  • Submodules: Specialized implementations organized by functionality
    • text: Text normalization, cleaning, case handling

    • language: Language detection and filtering

    • deduplication: Text deduplication operations

    • filtering: Content filtering and quality control

    • analysis: Content analysis and classification

    • transforms: Advanced text transformations

    • batch: Batch processing utilities

    • enums: Enumeration types

    • types: Data classes and type definitions

class kerb.preprocessing.NormalizationLevel(*values)[source]

Bases: Enum

Text normalization intensity.

MINIMAL = 'minimal'
STANDARD = 'standard'
AGGRESSIVE = 'aggressive'
class kerb.preprocessing.LanguageDetectionMode(*values)[source]

Bases: Enum

Language detection strategy.

FAST = 'fast'
ACCURATE = 'accurate'
SIMPLE = 'simple'
class kerb.preprocessing.DeduplicationMode(*values)[source]

Bases: Enum

Deduplication strategy.

EXACT = 'exact'
FUZZY = 'fuzzy'
SEMANTIC = 'semantic'
class kerb.preprocessing.ContentType(*values)[source]

Bases: Enum

Text content type classification.

PLAIN_TEXT = 'plain_text'
CODE = 'code'
MARKDOWN = 'markdown'
HTML = 'html'
JSON = 'json'
MIXED = 'mixed'
UNKNOWN = 'unknown'
class kerb.preprocessing.LanguageResult(language, confidence, alternatives=<factory>)[source]

Bases: object

Language detection result.

language: str
confidence: float
alternatives: List[Tuple[str, float]]
__init__(language, confidence, alternatives=<factory>)
class kerb.preprocessing.QualityMetrics(length, word_count, avg_word_length, sentence_count, avg_sentence_length, special_char_ratio, digit_ratio, uppercase_ratio, readability_score)[source]

Bases: object

Text quality metrics.

length: int
word_count: int
avg_word_length: float
sentence_count: int
avg_sentence_length: float
special_char_ratio: float
digit_ratio: float
uppercase_ratio: float
readability_score: float
__init__(length, word_count, avg_word_length, sentence_count, avg_sentence_length, special_char_ratio, digit_ratio, uppercase_ratio, readability_score)
class kerb.preprocessing.NormalizationConfig(level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True)[source]

Bases: object

Configuration for text normalization operations.

level

Normalization intensity level

lowercase

Convert to lowercase

remove_urls

Remove URLs from text

remove_emails

Remove email addresses

remove_extra_spaces

Remove redundant whitespace

level: NormalizationLevel = 'standard'
lowercase: bool = False
remove_urls: bool = True
remove_emails: bool = True
remove_extra_spaces: bool = True
__init__(level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True)
kerb.preprocessing.normalize_text(text, level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True, config=None)[source]

Comprehensive text normalization with configurable intensity.

Parameters:
  • text (str) – Input text to normalize

  • level (NormalizationLevel) – Normalization intensity level (ignored if config is provided)

  • lowercase (bool) – Convert to lowercase (ignored if config is provided)

  • remove_urls (bool) – Remove URLs from text (ignored if config is provided)

  • remove_emails (bool) – Remove email addresses (ignored if config is provided)

  • remove_extra_spaces (bool) – Remove redundant whitespace (ignored if config is provided)

  • config (Optional[NormalizationConfig]) – NormalizationConfig object with all parameters (recommended)

Return type:

str

Returns:

Normalized text

Examples

>>> # Using config object (recommended)
>>> from kerb.preprocessing import NormalizationConfig, NormalizationLevel
>>> config = NormalizationConfig(
...     level=NormalizationLevel.STANDARD,
...     lowercase=True,
...     remove_urls=True
... )
>>> normalized = normalize_text("Check this: https://example.com", config=config)
>>> # Using individual parameters (backward compatible)
>>> normalized = normalize_text("HELLO WORLD", lowercase=True)
kerb.preprocessing.normalize_whitespace(text)[source]

Normalize whitespace and newlines.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Text with normalized whitespace

Examples

>>> normalize_whitespace("Hello   world\n\n\ntest")
'Hello world\n\ntest'
kerb.preprocessing.normalize_unicode(text, form='NFKC')[source]

Normalize unicode characters.

Parameters:
  • text (str) – Input text

  • form (str) – Unicode normalization form (NFC, NFD, NFKC, NFKD)

Return type:

str

Returns:

Unicode-normalized text

Examples

>>> normalize_unicode("café")  # Normalizes different accent representations
'café'
kerb.preprocessing.normalize_quotes(text)[source]

Convert smart quotes to standard quotes.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Text with standard quotes

Examples

>>> normalize_quotes('"Hello" and 'world'")
'"Hello" and \'world\''
kerb.preprocessing.normalize_dashes(text)[source]

Convert various dashes to standard forms.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Text with standard dashes

Examples

>>> normalize_dashes("em—dash and en–dash")
'em-dash and en-dash'
kerb.preprocessing.remove_accents(text)[source]

Remove diacritical marks from text.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Text without accents

Examples

>>> remove_accents("café résumé")
'cafe resume'
kerb.preprocessing.clean_html(text, keep_newlines=True)[source]

Remove HTML tags and entities.

Parameters:
  • text (str) – Input text with HTML

  • keep_newlines (bool) – Keep newlines from <br> and <p> tags

Return type:

str

Returns:

Plain text without HTML

Examples

>>> clean_html("<p>Hello <b>world</b></p>")
'Hello world'
kerb.preprocessing.clean_markdown(text, keep_structure=False)[source]

Remove or normalize markdown formatting.

Parameters:
  • text (str) – Input markdown text

  • keep_structure (bool) – Keep basic structure (headings, lists)

Return type:

str

Returns:

Plain or lightly formatted text

Examples

>>> clean_markdown("# Hello **world**")
'Hello world'
kerb.preprocessing.remove_urls(text, replacement='')[source]

Remove or replace URLs.

Parameters:
  • text (str) – Input text

  • replacement (str) – String to replace URLs with

Return type:

str

Returns:

Text without URLs

Examples

>>> remove_urls("Check https://example.com for info")
'Check  for info'
kerb.preprocessing.remove_emails(text, replacement='')[source]

Remove or replace email addresses.

Parameters:
  • text (str) – Input text

  • replacement (str) – String to replace emails with

Return type:

str

Returns:

Text without email addresses

Examples

>>> remove_emails("Contact me@example.com")
'Contact '
kerb.preprocessing.remove_phone_numbers(text, replacement='')[source]

Remove or replace phone numbers.

Parameters:
  • text (str) – Input text

  • replacement (str) – String to replace phone numbers with

Return type:

str

Returns:

Text without phone numbers

Examples

>>> remove_phone_numbers("Call 555-123-4567")
'Call '
kerb.preprocessing.remove_special_chars(text, keep_basic=True)[source]

Remove special characters with options.

Parameters:
  • text (str) – Input text

  • keep_basic (bool) – Keep basic punctuation (.,!?;:)

Return type:

str

Returns:

Text with special characters removed

Examples

>>> remove_special_chars("Hello@#$world!")
'Hello world!'
kerb.preprocessing.remove_extra_whitespace(text)[source]

Remove redundant whitespace.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Text with single spaces only

Examples

>>> remove_extra_whitespace("Hello    world")
'Hello world'
kerb.preprocessing.remove_control_chars(text)[source]

Remove control characters.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Text without control characters

Examples

>>> remove_control_chars("Hello\x00world\x01")
'Helloworld'
kerb.preprocessing.strip_punctuation(text, keep_internal=True)[source]

Remove punctuation with options.

Parameters:
  • text (str) – Input text

  • keep_internal (bool) – Keep punctuation within words (e.g., apostrophes)

Return type:

str

Returns:

Text with punctuation removed

Examples

>>> strip_punctuation("Hello, world!")
'Hello world'
kerb.preprocessing.normalize_case(text, mode='sentence')[source]

Smart case normalization.

Parameters:
  • text (str) – Input text

  • mode (Union[CaseMode, str]) – Case mode (CaseMode enum or string: “lower”, “upper”, “title”, “sentence”)

Return type:

str

Returns:

Case-normalized text

Examples

>>> normalize_case("HELLO WORLD", mode=CaseMode.SENTENCE)
'Hello world'
>>> normalize_case("hello world", mode="title")
'Hello World'
kerb.preprocessing.to_title_case(text)[source]

Convert to title case.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Title-cased text

Examples

>>> to_title_case("hello world from python")
'Hello World From Python'
kerb.preprocessing.to_sentence_case(text)[source]

Convert to sentence case.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Sentence-cased text

Examples

>>> to_sentence_case("hello world. this is a test.")
'Hello world. This is a test.'
kerb.preprocessing.preserve_acronyms(text, acronyms=None)[source]

Smart case conversion preserving acronyms.

Parameters:
  • text (str) – Input text

  • acronyms (Optional[List[str]]) – List of acronyms to preserve (default: common ones)

Return type:

str

Returns:

Text with preserved acronyms

Examples

>>> preserve_acronyms("nasa and fbi are agencies", ["NASA", "FBI"])
'NASA and FBI are agencies'
kerb.preprocessing.detect_language(text, mode=LanguageDetectionMode.FAST)[source]

Detect text language with multiple strategies.

Uses langdetect library if available, otherwise falls back to heuristic-based detection supporting 50+ languages.

Parameters:
  • text (str) – Input text

  • mode (LanguageDetectionMode) – Detection mode - FAST: Quick heuristic-based detection - ACCURATE: Use langdetect library if available - SIMPLE: Basic character range detection

Return type:

LanguageResult

Returns:

LanguageResult with detected language and confidence

Examples

>>> result = detect_language("Hello world")
>>> result.language
'en'
>>> result = detect_language("Bonjour le monde")
>>> result.language
'fr'
>>> result = detect_language("こんにちは世界")
>>> result.language
'ja'
kerb.preprocessing.detect_language_batch(texts, mode=LanguageDetectionMode.FAST)[source]

Batch language detection.

Parameters:
Return type:

List[LanguageResult]

Returns:

List of LanguageResult objects

Examples

>>> results = detect_language_batch(["Hello", "Bonjour"])
>>> [r.language for r in results]
['en', 'fr']
kerb.preprocessing.is_language(text, language, threshold=0.5)[source]

Check if text is specific language.

Parameters:
  • text (str) – Input text

  • language (str) – Language code to check (e.g., ‘en’, ‘fr’)

  • threshold (float) – Confidence threshold

Return type:

bool

Returns:

True if text is detected as specified language

Examples

>>> is_language("Hello world", "en")
True
kerb.preprocessing.filter_by_language(texts, language, threshold=0.5)[source]

Filter texts by language.

Parameters:
  • texts (List[str]) – List of texts

  • language (str) – Language code to filter for

  • threshold (float) – Confidence threshold

Return type:

List[str]

Returns:

List of texts in specified language

Examples

>>> filter_by_language(["Hello", "Bonjour"], "en")
['Hello']
kerb.preprocessing.get_supported_languages()[source]

Get list of supported languages.

Returns heuristic-supported languages. With langdetect library installed, 55+ languages are supported. Without it, 20+ languages are supported through character-based and pattern detection.

Return type:

List[str]

Returns:

List of language codes

Examples

>>> langs = get_supported_languages()
>>> "en" in langs
True
>>> len(langs) >= 20
True
kerb.preprocessing.deduplicate_exact(texts, keep_order=True)[source]

Remove exact duplicates.

Parameters:
  • texts (List[str]) – List of texts

  • keep_order (bool) – Preserve original order

Return type:

List[str]

Returns:

List with duplicates removed

Examples

>>> deduplicate_exact(["a", "b", "a", "c"])
['a', 'b', 'c']
kerb.preprocessing.deduplicate_fuzzy(texts, similarity_threshold=0.9, keep_order=True)[source]

Remove fuzzy/near duplicates.

Parameters:
  • texts (List[str]) – List of texts

  • similarity_threshold (float) – Similarity threshold (0-1)

  • keep_order (bool) – Preserve original order

Return type:

List[str]

Returns:

List with fuzzy duplicates removed

Examples

>>> deduplicate_fuzzy(["hello world", "hello  world", "goodbye"])
['hello world', 'goodbye']
kerb.preprocessing.deduplicate_semantic(texts, similarity_threshold=0.85, embed_fn=None)[source]

Remove semantically similar texts.

Parameters:
  • texts (List[str]) – List of texts

  • similarity_threshold (float) – Semantic similarity threshold (0-1)

  • embed_fn (Optional[Callable]) – Optional embedding function (uses simple fallback if None)

Return type:

List[str]

Returns:

List with semantic duplicates removed

Examples

>>> deduplicate_semantic(["hello", "hi", "goodbye"])
['hello', 'goodbye']
kerb.preprocessing.deduplicate_lines(text, keep_order=True)[source]

Remove duplicate lines.

Parameters:
  • text (str) – Input text

  • keep_order (bool) – Preserve line order

Return type:

str

Returns:

Text with duplicate lines removed

Examples

>>> deduplicate_lines("line1\nline2\nline1\nline3")
'line1\nline2\nline3'
kerb.preprocessing.deduplicate_sentences(text, keep_order=True)[source]

Remove duplicate sentences.

Parameters:
  • text (str) – Input text

  • keep_order (bool) – Preserve sentence order

Return type:

str

Returns:

Text with duplicate sentences removed

Examples

>>> deduplicate_sentences("Hello. World. Hello.")
'Hello. World.'
kerb.preprocessing.find_duplicates(texts, mode=DeduplicationMode.EXACT)[source]

Find duplicate texts without removing.

Parameters:
Return type:

List[List[int]]

Returns:

List of index groups representing duplicates

Examples

>>> find_duplicates(["a", "b", "a", "c", "b"])
[[0, 2], [1, 4]]
kerb.preprocessing.compute_text_hash(text, algorithm='md5')[source]

Compute stable text hash for deduplication.

Parameters:
  • text (str) – Input text

  • algorithm (str) – Hash algorithm (md5, sha1, sha256)

Return type:

str

Returns:

Hex hash string

Examples

>>> hash1 = compute_text_hash("hello")
>>> hash2 = compute_text_hash("hello")
>>> hash1 == hash2
True
kerb.preprocessing.filter_by_length(texts, min_length=None, max_length=None, unit='chars')[source]

Filter texts by length constraints.

Parameters:
  • texts (List[str]) – List of texts

  • min_length (Optional[int]) – Minimum length

  • max_length (Optional[int]) – Maximum length

  • unit (str) – Length unit - “chars”, “words”, “sentences”

Return type:

List[str]

Returns:

Filtered list of texts

Examples

>>> filter_by_length(["hi", "hello world", ""], min_length=3)
['hello world']
kerb.preprocessing.filter_by_pattern(texts, pattern, keep_matches=True, flags=0)[source]

Filter texts by regex pattern.

Parameters:
  • texts (List[str]) – List of texts

  • pattern (str) – Regex pattern

  • keep_matches (bool) – Keep matching texts (False to keep non-matching)

  • flags (int) – Regex flags

Return type:

List[str]

Returns:

Filtered list of texts

Examples

>>> filter_by_pattern(["hello", "world", "hi"], r"^h", keep_matches=True)
['hello', 'hi']
kerb.preprocessing.filter_profanity(text, replacement='***')[source]

Remove or mask profane content.

Parameters:
  • text (str) – Input text

  • replacement (str) – Replacement string for profanity

Return type:

str

Returns:

Filtered text

Examples

>>> filter_profanity("This is clean text")
'This is clean text'
kerb.preprocessing.filter_pii(text, replacement='[REDACTED]')[source]

Remove or mask personally identifiable information.

Parameters:
  • text (str) – Input text

  • replacement (str) – Replacement string for PII

Return type:

str

Returns:

Text with PII removed

Examples

>>> filter_pii("Email me@example.com or call 555-1234")
'Email [REDACTED] or call [REDACTED]'
kerb.preprocessing.detect_spam(text, threshold=0.5)[source]

Detect spam or low-quality content.

Parameters:
  • text (str) – Input text

  • threshold (float) – Spam score threshold (0-1)

Return type:

bool

Returns:

True if text is likely spam

Examples

>>> detect_spam("BUY NOW!!! CLICK HERE!!!")
True
kerb.preprocessing.filter_by_quality(texts, min_score=0.5)[source]

Filter by quality metrics.

Parameters:
  • texts (List[str]) – List of texts

  • min_score (float) – Minimum quality score (0-1)

Return type:

List[str]

Returns:

List of high-quality texts

Examples

>>> filter_by_quality(["Good text here.", "x", "Another good one."])
['Good text here.', 'Another good one.']
kerb.preprocessing.filter_non_ascii(text, replacement='', keep_extended=True)[source]

Filter or replace non-ASCII characters.

Parameters:
  • text (str) – Input text

  • replacement (str) – Replacement for non-ASCII chars

  • keep_extended (bool) – Keep extended ASCII (128-255)

Return type:

str

Returns:

ASCII-filtered text

Examples

>>> filter_non_ascii("Hello 世界")
'Hello '
kerb.preprocessing.classify_content_type(text)[source]

Classify text content type.

Parameters:

text (str) – Input text

Return type:

ContentType

Returns:

ContentType enum value

Examples

>>> classify_content_type("def foo():\n    pass")
<ContentType.CODE: 'code'>
kerb.preprocessing.detect_code(text)[source]

Detect if text contains code.

Parameters:

text (str) – Input text

Return type:

bool

Returns:

True if text appears to be code

Examples

>>> detect_code("def foo(): return True")
True
kerb.preprocessing.detect_sentiment(text)[source]

Basic sentiment detection.

Parameters:

text (str) – Input text

Returns:

“positive”, “negative”, or “neutral”

Return type:

str

Examples

>>> detect_sentiment("I love this!")
'positive'
kerb.preprocessing.measure_readability(text)[source]

Calculate readability score (0-1, higher is more readable).

Parameters:

text (str) – Input text

Return type:

float

Returns:

Readability score

Examples

>>> score = measure_readability("This is simple text.")
>>> score > 0.5
True
kerb.preprocessing.count_words(text)[source]

Smart word counting.

Parameters:

text (str) – Input text

Return type:

int

Returns:

Word count

Examples

>>> count_words("Hello world, this is a test")
6
kerb.preprocessing.count_sentences(text)[source]

Smart sentence counting.

Parameters:

text (str) – Input text

Return type:

int

Returns:

Sentence count

Examples

>>> count_sentences("Hello. World! How are you?")
3
kerb.preprocessing.count_paragraphs(text)[source]

Count paragraphs.

Parameters:

text (str) – Input text

Return type:

int

Returns:

Paragraph count

Examples

>>> count_paragraphs("Para 1\n\nPara 2\n\nPara 3")
3
kerb.preprocessing.expand_contractions(text)[source]

Expand English contractions.

Parameters:

text (str) – Input text with contractions

Return type:

str

Returns:

Text with expanded contractions

Examples

>>> expand_contractions("I'm doesn't can't")
"I am does not cannot"
kerb.preprocessing.standardize_numbers(text)[source]

Convert number words to digits.

Parameters:

text (str) – Input text

Return type:

str

Returns:

Text with standardized numbers

Examples

>>> standardize_numbers("I have three apples and five oranges")
'I have 3 apples and 5 oranges'
kerb.preprocessing.standardize_dates(text)[source]

Normalize date formats.

Parameters:

text (str) – Input text with dates

Return type:

str

Returns:

Text with standardized dates (YYYY-MM-DD)

Examples

>>> standardize_dates("Meeting on 12/25/2024")
'Meeting on 2024-12-25'
kerb.preprocessing.extract_entities(text, entity_type=None)[source]

Extract named entities (basic).

Parameters:
  • text (str) – Input text

  • entity_type (Optional[str]) – Type of entities to extract (None for all)

Return type:

List[str]

Returns:

List of extracted entities

Examples

>>> extract_entities("Apple Inc. is in California")
['Apple Inc.', 'California']
kerb.preprocessing.segment_sentences(text)[source]

Sentence segmentation.

Parameters:

text (str) – Input text

Return type:

List[str]

Returns:

List of sentences

Examples

>>> segment_sentences("Hello world. How are you?")
['Hello world.', 'How are you?']
kerb.preprocessing.segment_words(text)[source]

Word segmentation (tokenization).

Parameters:

text (str) – Input text

Return type:

List[str]

Returns:

List of words

Examples

>>> segment_words("Hello, world!")
['Hello', 'world']
kerb.preprocessing.preprocess_batch(texts, operations=None, **kwargs)[source]

Apply preprocessing pipeline to batch.

Parameters:
  • texts (List[str]) – List of texts to preprocess

  • operations (Optional[List[Callable]]) – List of preprocessing functions

  • **kwargs – Arguments to pass to operations

Return type:

List[str]

Returns:

List of preprocessed texts

Examples

>>> preprocess_batch(["  HELLO  ", "  WORLD  "], [str.lower, str.strip])
['hello', 'world']
kerb.preprocessing.preprocess_pipeline(*operations)[source]

Create custom preprocessing pipeline.

Parameters:

*operations (Callable) – Preprocessing functions to chain

Return type:

Callable

Returns:

Pipeline function

Examples

>>> pipeline = preprocess_pipeline(str.lower, str.strip)
>>> pipeline("  HELLO  ")
'hello'
kerb.preprocessing.truncate_text(text, max_length, strategy='end', suffix='...')[source]

Truncate text intelligently.

Parameters:
  • text (str) – Input text

  • max_length (int) – Maximum length

  • strategy (Union[TruncateStrategy, str]) – Truncation strategy (TruncateStrategy enum or string: “end”, “middle”, “start”, “smart”)

  • suffix (str) – Suffix to add when truncated

Return type:

str

Returns:

Truncated text

Examples

>>> truncate_text("Hello world", max_length=8)
'Hello...'
>>> truncate_text("Hello world", max_length=8, strategy=TruncateStrategy.MIDDLE)
'He...ld'
>>> truncate_text("This is a sentence. And another one.", max_length=20, strategy="smart")
'This is a sentence....'
kerb.preprocessing.split_long_text(text, max_length, overlap=0, preserve_words=True)[source]

Split text exceeding length limit.

Parameters:
  • text (str) – Input text

  • max_length (int) – Maximum length per chunk

  • overlap (int) – Overlap between chunks

  • preserve_words (bool) – Don’t split words

Return type:

List[str]

Returns:

List of text chunks

Examples

>>> split_long_text("Hello world test", max_length=8)
['Hello', 'world', 'test']

Text cleaning and preprocessing for LLM inputs.