Preprocessing Module

Text preprocessing utilities for LLM applications.

This module provides comprehensive text preprocessing tools for cleaning, normalizing, and preparing text data for LLM processing.

Usage Examples:

>>> # Common usage - normalize text
>>> from kerb.preprocessing import normalize_text
>>> clean = normalize_text("  Hello   World!  ", lowercase=True)

>>> # Text operations
>>> from kerb.preprocessing.text import (
...     normalize_whitespace,
...     remove_special_chars,
...     truncate_text
... )

>>> # Language detection
>>> from kerb.preprocessing.language import detect_language
>>> result = detect_language("Bonjour le monde")

>>> # Content filtering
>>> from kerb.preprocessing.filtering import filter_by_length
>>> filtered = filter_by_length(["hi", "hello world"], min_length=5)

>>> # Batch processing
>>> from kerb.preprocessing.batch import preprocess_batch
>>> processed = preprocess_batch(["  text1  ", "  text2  "])

Organization:

Top-level: Core functions and most common operations
Submodules: Specialized implementations organized by functionality
- text: Text normalization, cleaning, case handling
- language: Language detection and filtering
- deduplication: Text deduplication operations
- filtering: Content filtering and quality control
- analysis: Content analysis and classification
- transforms: Advanced text transformations
- batch: Batch processing utilities
- enums: Enumeration types
- types: Data classes and type definitions

class kerb.preprocessing.NormalizationLevel(*values)[source]

Bases: Enum

Text normalization intensity.

MINIMAL = 'minimal'

STANDARD = 'standard'

AGGRESSIVE = 'aggressive'

class kerb.preprocessing.LanguageDetectionMode(*values)[source]

Bases: Enum

Language detection strategy.

FAST = 'fast'

ACCURATE = 'accurate'

SIMPLE = 'simple'

class kerb.preprocessing.DeduplicationMode(*values)[source]

Bases: Enum

Deduplication strategy.

EXACT = 'exact'

FUZZY = 'fuzzy'

SEMANTIC = 'semantic'

class kerb.preprocessing.ContentType(*values)[source]

Bases: Enum

Text content type classification.

PLAIN_TEXT = 'plain_text'

CODE = 'code'

MARKDOWN = 'markdown'

HTML = 'html'

JSON = 'json'

MIXED = 'mixed'

UNKNOWN = 'unknown'

class kerb.preprocessing.LanguageResult(language, confidence, alternatives=<factory>)[source]

Bases: object

Language detection result.

language: str

confidence: float

alternatives: List[Tuple[str, float]]

__init__(language, confidence, alternatives=<factory>)

class kerb.preprocessing.QualityMetrics(length, word_count, avg_word_length, sentence_count, avg_sentence_length, special_char_ratio, digit_ratio, uppercase_ratio, readability_score)[source]

Bases: object

Text quality metrics.

length: int

word_count: int

avg_word_length: float

sentence_count: int

avg_sentence_length: float

special_char_ratio: float

digit_ratio: float

uppercase_ratio: float

readability_score: float

__init__(length, word_count, avg_word_length, sentence_count, avg_sentence_length, special_char_ratio, digit_ratio, uppercase_ratio, readability_score)

class kerb.preprocessing.NormalizationConfig(level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True)[source]

Bases: object

Configuration for text normalization operations.

level: Normalization intensity level

lowercase: Convert to lowercase

remove_urls: Remove URLs from text

remove_emails: Remove email addresses

remove_extra_spaces: Remove redundant whitespace

level: NormalizationLevel = 'standard'

lowercase: bool = False

remove_urls: bool = True

remove_emails: bool = True

remove_extra_spaces: bool = True

__init__(level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True)

kerb.preprocessing.normalize_text(text, level=NormalizationLevel.STANDARD, lowercase=False, remove_urls=True, remove_emails=True, remove_extra_spaces=True, config=None)[source]

Comprehensive text normalization with configurable intensity.

Parameters:

text (str) – Input text to normalize
level (NormalizationLevel) – Normalization intensity level (ignored if config is provided)
lowercase (bool) – Convert to lowercase (ignored if config is provided)
remove_urls (bool) – Remove URLs from text (ignored if config is provided)
remove_emails (bool) – Remove email addresses (ignored if config is provided)
remove_extra_spaces (bool) – Remove redundant whitespace (ignored if config is provided)
config (Optional[NormalizationConfig]) – NormalizationConfig object with all parameters (recommended)

Return type:

str

Returns:

Normalized text

Examples

>>> # Using config object (recommended)
>>> from kerb.preprocessing import NormalizationConfig, NormalizationLevel
>>> config = NormalizationConfig(
...     level=NormalizationLevel.STANDARD,
...     lowercase=True,
...     remove_urls=True
... )
>>> normalized = normalize_text("Check this: https://example.com", config=config)

>>> # Using individual parameters (backward compatible)
>>> normalized = normalize_text("HELLO WORLD", lowercase=True)

kerb.preprocessing.normalize_whitespace(text)[source]

Normalize whitespace and newlines.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Text with normalized whitespace

Examples

>>> normalize_whitespace("Hello   world\n\n\ntest")
'Hello world\n\ntest'

kerb.preprocessing.normalize_unicode(text, form='NFKC')[source]

Normalize unicode characters.

Parameters:

text (str) – Input text
form (str) – Unicode normalization form (NFC, NFD, NFKC, NFKD)

Return type:

str

Returns:

Unicode-normalized text

Examples

>>> normalize_unicode("café")  # Normalizes different accent representations
'café'

kerb.preprocessing.normalize_quotes(text)[source]

Convert smart quotes to standard quotes.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Text with standard quotes

Examples

>>> normalize_quotes('"Hello" and 'world'")
'"Hello" and \'world\''

kerb.preprocessing.normalize_dashes(text)[source]

Convert various dashes to standard forms.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Text with standard dashes

Examples

>>> normalize_dashes("em—dash and en–dash")
'em-dash and en-dash'

kerb.preprocessing.remove_accents(text)[source]

Remove diacritical marks from text.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Text without accents

Examples

>>> remove_accents("café résumé")
'cafe resume'

kerb.preprocessing.clean_html(text, keep_newlines=True)[source]

Remove HTML tags and entities.

Parameters:

text (str) – Input text with HTML
keep_newlines (bool) – Keep newlines from <br> and <p> tags

Return type:

str

Returns:

Plain text without HTML

Examples

>>> clean_html("<p>Hello <b>world</b></p>")
'Hello world'

kerb.preprocessing.clean_markdown(text, keep_structure=False)[source]

Remove or normalize markdown formatting.

Parameters:

text (str) – Input markdown text
keep_structure (bool) – Keep basic structure (headings, lists)

Return type:

str

Returns:

Plain or lightly formatted text

Examples

>>> clean_markdown("# Hello **world**")
'Hello world'

kerb.preprocessing.remove_urls(text, replacement='')[source]

Remove or replace URLs.

Parameters:

text (str) – Input text
replacement (str) – String to replace URLs with

Return type:

str

Returns:

Text without URLs

Examples

>>> remove_urls("Check https://example.com for info")
'Check  for info'

kerb.preprocessing.remove_emails(text, replacement='')[source]

Remove or replace email addresses.

Parameters:

text (str) – Input text
replacement (str) – String to replace emails with

Return type:

str

Returns:

Text without email addresses

Examples

>>> remove_emails("Contact me@example.com")
'Contact '

kerb.preprocessing.remove_phone_numbers(text, replacement='')[source]

Remove or replace phone numbers.

Parameters:

text (str) – Input text
replacement (str) – String to replace phone numbers with

Return type:

str

Returns:

Text without phone numbers

Examples

>>> remove_phone_numbers("Call 555-123-4567")
'Call '

kerb.preprocessing.remove_special_chars(text, keep_basic=True)[source]

Remove special characters with options.

Parameters:

text (str) – Input text
keep_basic (bool) – Keep basic punctuation (.,!?;:)

Return type:

str

Returns:

Text with special characters removed

Examples

>>> remove_special_chars("Hello@#$world!")
'Hello world!'

kerb.preprocessing.remove_extra_whitespace(text)[source]

Remove redundant whitespace.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Text with single spaces only

Examples

>>> remove_extra_whitespace("Hello    world")
'Hello world'

kerb.preprocessing.remove_control_chars(text)[source]

Remove control characters.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Text without control characters

Examples

>>> remove_control_chars("Hello\x00world\x01")
'Helloworld'

kerb.preprocessing.strip_punctuation(text, keep_internal=True)[source]

Remove punctuation with options.

Parameters:

text (str) – Input text
keep_internal (bool) – Keep punctuation within words (e.g., apostrophes)

Return type:

str

Returns:

Text with punctuation removed

Examples

>>> strip_punctuation("Hello, world!")
'Hello world'

kerb.preprocessing.normalize_case(text, mode='sentence')[source]

Smart case normalization.

Parameters:

text (str) – Input text
mode (Union[CaseMode, str]) – Case mode (CaseMode enum or string: “lower”, “upper”, “title”, “sentence”)

Return type:

str

Returns:

Case-normalized text

Examples

>>> normalize_case("HELLO WORLD", mode=CaseMode.SENTENCE)
'Hello world'

>>> normalize_case("hello world", mode="title")
'Hello World'

kerb.preprocessing.to_title_case(text)[source]

Convert to title case.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Title-cased text

Examples

>>> to_title_case("hello world from python")
'Hello World From Python'

kerb.preprocessing.to_sentence_case(text)[source]

Convert to sentence case.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Sentence-cased text

Examples

>>> to_sentence_case("hello world. this is a test.")
'Hello world. This is a test.'

kerb.preprocessing.preserve_acronyms(text, acronyms=None)[source]

Smart case conversion preserving acronyms.

Parameters:

text (str) – Input text
acronyms (Optional[List[str]]) – List of acronyms to preserve (default: common ones)

Return type:

str

Returns:

Text with preserved acronyms

Examples

>>> preserve_acronyms("nasa and fbi are agencies", ["NASA", "FBI"])
'NASA and FBI are agencies'

kerb.preprocessing.detect_language(text, mode=LanguageDetectionMode.FAST)[source]

Detect text language with multiple strategies.

Uses langdetect library if available, otherwise falls back to heuristic-based detection supporting 50+ languages.

Parameters:

text (str) – Input text
mode (LanguageDetectionMode) – Detection mode - FAST: Quick heuristic-based detection - ACCURATE: Use langdetect library if available - SIMPLE: Basic character range detection

Return type:

LanguageResult

Returns:

LanguageResult with detected language and confidence

Examples

>>> result = detect_language("Hello world")
>>> result.language
'en'
>>> result = detect_language("Bonjour le monde")
>>> result.language
'fr'
>>> result = detect_language("こんにちは世界")
>>> result.language
'ja'

kerb.preprocessing.detect_language_batch(texts, mode=LanguageDetectionMode.FAST)[source]

Batch language detection.

Parameters:

texts (List[str]) – List of input texts
mode (LanguageDetectionMode) – Detection mode

Return type:

List[LanguageResult]

Returns:

List of LanguageResult objects

Examples

>>> results = detect_language_batch(["Hello", "Bonjour"])
>>> [r.language for r in results]
['en', 'fr']

kerb.preprocessing.is_language(text, language, threshold=0.5)[source]

Check if text is specific language.

Parameters:

text (str) – Input text
language (str) – Language code to check (e.g., ‘en’, ‘fr’)
threshold (float) – Confidence threshold

Return type:

bool

Returns:

True if text is detected as specified language

Examples

>>> is_language("Hello world", "en")
True

kerb.preprocessing.filter_by_language(texts, language, threshold=0.5)[source]

Filter texts by language.

Parameters:

texts (List[str]) – List of texts
language (str) – Language code to filter for
threshold (float) – Confidence threshold

Return type:

List[str]

Returns:

List of texts in specified language

Examples

>>> filter_by_language(["Hello", "Bonjour"], "en")
['Hello']

kerb.preprocessing.get_supported_languages()[source]

Get list of supported languages.

Returns heuristic-supported languages. With langdetect library installed, 55+ languages are supported. Without it, 20+ languages are supported through character-based and pattern detection.

Return type:: List[str]
Returns:: List of language codes

Examples

>>> langs = get_supported_languages()
>>> "en" in langs
True
>>> len(langs) >= 20
True

kerb.preprocessing.deduplicate_exact(texts, keep_order=True)[source]

Remove exact duplicates.

Parameters:

texts (List[str]) – List of texts
keep_order (bool) – Preserve original order

Return type:

List[str]

Returns:

List with duplicates removed

Examples

>>> deduplicate_exact(["a", "b", "a", "c"])
['a', 'b', 'c']

kerb.preprocessing.deduplicate_fuzzy(texts, similarity_threshold=0.9, keep_order=True)[source]

Remove fuzzy/near duplicates.

Parameters:

texts (List[str]) – List of texts
similarity_threshold (float) – Similarity threshold (0-1)
keep_order (bool) – Preserve original order

Return type:

List[str]

Returns:

List with fuzzy duplicates removed

Examples

>>> deduplicate_fuzzy(["hello world", "hello  world", "goodbye"])
['hello world', 'goodbye']

kerb.preprocessing.deduplicate_semantic(texts, similarity_threshold=0.85, embed_fn=None)[source]

Remove semantically similar texts.

Parameters:

texts (List[str]) – List of texts
similarity_threshold (float) – Semantic similarity threshold (0-1)
embed_fn (Optional[Callable]) – Optional embedding function (uses simple fallback if None)

Return type:

List[str]

Returns:

List with semantic duplicates removed

Examples

>>> deduplicate_semantic(["hello", "hi", "goodbye"])
['hello', 'goodbye']

kerb.preprocessing.deduplicate_lines(text, keep_order=True)[source]

Remove duplicate lines.

Parameters:

text (str) – Input text
keep_order (bool) – Preserve line order

Return type:

str

Returns:

Text with duplicate lines removed

Examples

>>> deduplicate_lines("line1\nline2\nline1\nline3")
'line1\nline2\nline3'

kerb.preprocessing.deduplicate_sentences(text, keep_order=True)[source]

Remove duplicate sentences.

Parameters:

text (str) – Input text
keep_order (bool) – Preserve sentence order

Return type:

str

Returns:

Text with duplicate sentences removed

Examples

>>> deduplicate_sentences("Hello. World. Hello.")
'Hello. World.'

kerb.preprocessing.find_duplicates(texts, mode=DeduplicationMode.EXACT)[source]

Find duplicate texts without removing.

Parameters:

texts (List[str]) – List of texts
mode (DeduplicationMode) – Deduplication mode

Return type:

List[List[int]]

Returns:

List of index groups representing duplicates

Examples

>>> find_duplicates(["a", "b", "a", "c", "b"])
[[0, 2], [1, 4]]

kerb.preprocessing.compute_text_hash(text, algorithm='md5')[source]

Compute stable text hash for deduplication.

Parameters:

text (str) – Input text
algorithm (str) – Hash algorithm (md5, sha1, sha256)

Return type:

str

Returns:

Hex hash string

Examples

>>> hash1 = compute_text_hash("hello")
>>> hash2 = compute_text_hash("hello")
>>> hash1 == hash2
True

kerb.preprocessing.filter_by_length(texts, min_length=None, max_length=None, unit='chars')[source]

Filter texts by length constraints.

Parameters:

texts (List[str]) – List of texts
min_length (Optional[int]) – Minimum length
max_length (Optional[int]) – Maximum length
unit (str) – Length unit - “chars”, “words”, “sentences”

Return type:

List[str]

Returns:

Filtered list of texts

Examples

>>> filter_by_length(["hi", "hello world", ""], min_length=3)
['hello world']

kerb.preprocessing.filter_by_pattern(texts, pattern, keep_matches=True, flags=0)[source]

Filter texts by regex pattern.

Parameters:

texts (List[str]) – List of texts
pattern (str) – Regex pattern
keep_matches (bool) – Keep matching texts (False to keep non-matching)
flags (int) – Regex flags

Return type:

List[str]

Returns:

Filtered list of texts

Examples

>>> filter_by_pattern(["hello", "world", "hi"], r"^h", keep_matches=True)
['hello', 'hi']

kerb.preprocessing.filter_profanity(text, replacement='***')[source]

Remove or mask profane content.

Parameters:

text (str) – Input text
replacement (str) – Replacement string for profanity

Return type:

str

Returns:

Filtered text

Examples

>>> filter_profanity("This is clean text")
'This is clean text'

kerb.preprocessing.filter_pii(text, replacement='[REDACTED]')[source]

Remove or mask personally identifiable information.

Parameters:

text (str) – Input text
replacement (str) – Replacement string for PII

Return type:

str

Returns:

Text with PII removed

Examples

>>> filter_pii("Email me@example.com or call 555-1234")
'Email [REDACTED] or call [REDACTED]'

kerb.preprocessing.detect_spam(text, threshold=0.5)[source]

Detect spam or low-quality content.

Parameters:

text (str) – Input text
threshold (float) – Spam score threshold (0-1)

Return type:

bool

Returns:

True if text is likely spam

Examples

>>> detect_spam("BUY NOW!!! CLICK HERE!!!")
True

kerb.preprocessing.filter_by_quality(texts, min_score=0.5)[source]

Filter by quality metrics.

Parameters:

texts (List[str]) – List of texts
min_score (float) – Minimum quality score (0-1)

Return type:

List[str]

Returns:

List of high-quality texts

Examples

>>> filter_by_quality(["Good text here.", "x", "Another good one."])
['Good text here.', 'Another good one.']

kerb.preprocessing.filter_non_ascii(text, replacement='', keep_extended=True)[source]

Filter or replace non-ASCII characters.

Parameters:

text (str) – Input text
replacement (str) – Replacement for non-ASCII chars
keep_extended (bool) – Keep extended ASCII (128-255)

Return type:

str

Returns:

ASCII-filtered text

Examples

>>> filter_non_ascii("Hello 世界")
'Hello '

kerb.preprocessing.classify_content_type(text)[source]

Classify text content type.

Parameters:: text (str) – Input text
Return type:: ContentType
Returns:: ContentType enum value

Examples

>>> classify_content_type("def foo():\n    pass")
<ContentType.CODE: 'code'>

kerb.preprocessing.detect_code(text)[source]

Detect if text contains code.

Parameters:: text (str) – Input text
Return type:: bool
Returns:: True if text appears to be code

Examples

>>> detect_code("def foo(): return True")
True

kerb.preprocessing.detect_sentiment(text)[source]

Basic sentiment detection.

Parameters:: text (str) – Input text
Returns:: “positive”, “negative”, or “neutral”
Return type:: str

Examples

>>> detect_sentiment("I love this!")
'positive'

kerb.preprocessing.measure_readability(text)[source]

Calculate readability score (0-1, higher is more readable).

Parameters:: text (str) – Input text
Return type:: float
Returns:: Readability score

Examples

>>> score = measure_readability("This is simple text.")
>>> score > 0.5
True

kerb.preprocessing.count_words(text)[source]

Smart word counting.

Parameters:: text (str) – Input text
Return type:: int
Returns:: Word count

Examples

>>> count_words("Hello world, this is a test")
6

kerb.preprocessing.count_sentences(text)[source]

Smart sentence counting.

Parameters:: text (str) – Input text
Return type:: int
Returns:: Sentence count

Examples

>>> count_sentences("Hello. World! How are you?")
3

kerb.preprocessing.count_paragraphs(text)[source]

Count paragraphs.

Parameters:: text (str) – Input text
Return type:: int
Returns:: Paragraph count

Examples

>>> count_paragraphs("Para 1\n\nPara 2\n\nPara 3")
3

kerb.preprocessing.expand_contractions(text)[source]

Expand English contractions.

Parameters:: text (str) – Input text with contractions
Return type:: str
Returns:: Text with expanded contractions

Examples

>>> expand_contractions("I'm doesn't can't")
"I am does not cannot"

kerb.preprocessing.standardize_numbers(text)[source]

Convert number words to digits.

Parameters:: text (str) – Input text
Return type:: str
Returns:: Text with standardized numbers

Examples

>>> standardize_numbers("I have three apples and five oranges")
'I have 3 apples and 5 oranges'

kerb.preprocessing.standardize_dates(text)[source]

Normalize date formats.

Parameters:: text (str) – Input text with dates
Return type:: str
Returns:: Text with standardized dates (YYYY-MM-DD)

Examples

>>> standardize_dates("Meeting on 12/25/2024")
'Meeting on 2024-12-25'

kerb.preprocessing.extract_entities(text, entity_type=None)[source]

Extract named entities (basic).

Parameters:

text (str) – Input text
entity_type (Optional[str]) – Type of entities to extract (None for all)

Return type:

List[str]

Returns:

List of extracted entities

Examples

>>> extract_entities("Apple Inc. is in California")
['Apple Inc.', 'California']

kerb.preprocessing.segment_sentences(text)[source]

Sentence segmentation.

Parameters:: text (str) – Input text
Return type:: List[str]
Returns:: List of sentences

Examples

>>> segment_sentences("Hello world. How are you?")
['Hello world.', 'How are you?']

kerb.preprocessing.segment_words(text)[source]

Word segmentation (tokenization).

Parameters:: text (str) – Input text
Return type:: List[str]
Returns:: List of words

Examples

>>> segment_words("Hello, world!")
['Hello', 'world']

kerb.preprocessing.preprocess_batch(texts, operations=None, **kwargs)[source]

Apply preprocessing pipeline to batch.

Parameters:

texts (List[str]) – List of texts to preprocess
operations (Optional[List[Callable]]) – List of preprocessing functions
**kwargs – Arguments to pass to operations

Return type:

List[str]

Returns:

List of preprocessed texts

Examples

>>> preprocess_batch(["  HELLO  ", "  WORLD  "], [str.lower, str.strip])
['hello', 'world']

kerb.preprocessing.preprocess_pipeline(*operations)[source]

Create custom preprocessing pipeline.

Parameters:: *operations (Callable) – Preprocessing functions to chain
Return type:: Callable
Returns:: Pipeline function

Examples

>>> pipeline = preprocess_pipeline(str.lower, str.strip)
>>> pipeline("  HELLO  ")
'hello'

kerb.preprocessing.truncate_text(text, max_length, strategy='end', suffix='...')[source]

Truncate text intelligently.

Parameters:

text (str) – Input text
max_length (int) – Maximum length
strategy (Union[TruncateStrategy, str]) – Truncation strategy (TruncateStrategy enum or string: “end”, “middle”, “start”, “smart”)
suffix (str) – Suffix to add when truncated

Return type:

str

Returns:

Truncated text

Examples

>>> truncate_text("Hello world", max_length=8)
'Hello...'

>>> truncate_text("Hello world", max_length=8, strategy=TruncateStrategy.MIDDLE)
'He...ld'

>>> truncate_text("This is a sentence. And another one.", max_length=20, strategy="smart")
'This is a sentence....'

kerb.preprocessing.split_long_text(text, max_length, overlap=0, preserve_words=True)[source]

Split text exceeding length limit.

Parameters:

text (str) – Input text
max_length (int) – Maximum length per chunk
overlap (int) – Overlap between chunks
preserve_words (bool) – Don’t split words

Return type:

List[str]

Returns:

List of text chunks

Examples

>>> split_long_text("Hello world test", max_length=8)
['Hello', 'world', 'test']

Text cleaning and preprocessing for LLM inputs.