Tokenizer Module

Tokenizer utilities for counting tokens across different LLM models.

This module provides comprehensive token counting support for: - OpenAI models (GPT-4o, GPT-4o-mini, etc.) via tiktoken - HuggingFace models (BERT, Llama, etc.) via transformers - Fast approximation methods for quick estimates

Key features: - Tokenizer enum for explicit, type-safe tokenizer specification - count_tokens: Count tokens for a single text - batch_count_tokens: Count tokens for multiple texts - count_tokens_for_messages: Count tokens in chat message format - truncate_to_token_limit: Truncate text to fit token limits - tokens_to_chars / chars_to_tokens: Convert between tokens and characters

kerb.tokenizer.count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)[source]

Count tokens in text using the specified tokenizer.

Parameters:

text (str) – Text to count tokens for
tokenizer (Union[Tokenizer, str]) – Tokenizer to use. Can be a Tokenizer enum value or a HuggingFace model name (e.g., “bert-base-uncased”, “meta-llama/Llama-2-7b-hf”). Defaults to Tokenizer.CL100K_BASE (used by GPT-4o and GPT-4o-mini).

Returns:

Token count

Return type:

int

Examples

>>> count_tokens("Hello world!", tokenizer=Tokenizer.CL100K_BASE)
3

>>> count_tokens("Hello world!", tokenizer=Tokenizer.P50K_BASE)
3

>>> count_tokens("Hello world!", tokenizer="bert-base-uncased")
4

>>> count_tokens("Hello world!", tokenizer=Tokenizer.CHAR_4)
3

kerb.tokenizer.batch_count_tokens(texts, tokenizer=Tokenizer.CL100K_BASE)[source]

Count tokens for multiple texts.

Parameters:

texts (List[str]) – List of texts to count tokens for
tokenizer (Union[Tokenizer, str]) – Tokenizer to use. Defaults to Tokenizer.CL100K_BASE.

Returns:

List of token counts

Return type:

List[int]

Examples

>>> texts = ["Hello world!", "How are you?", "Good morning!"]
>>> batch_count_tokens(texts, tokenizer=Tokenizer.CL100K_BASE)
[3, 4, 3]

kerb.tokenizer.count_tokens_for_messages(messages, tokenizer=Tokenizer.CL100K_BASE)[source]

Count tokens for a list of chat messages including format overhead.

OpenAI chat models format messages with special tokens. This function accounts for the overhead of message formatting. Works best with tiktoken tokenizers (CL100K_BASE, P50K_BASE, etc.).

Parameters:

messages (List[dict]) – List of message dicts with ‘role’ and ‘content’ keys. Example: [{“role”: “user”, “content”: “Hello!”}]
tokenizer (Union[Tokenizer, str]) – Tokenizer to use. Defaults to Tokenizer.CL100K_BASE.

Returns:

Total token count including message formatting overhead

Return type:

int

Examples

>>> messages = [
...     {"role": "system", "content": "You are a helpful assistant."},
...     {"role": "user", "content": "Hello!"}
... ]
>>> count_tokens_for_messages(messages, tokenizer=Tokenizer.CL100K_BASE)
28

kerb.tokenizer.truncate_to_token_limit(text, max_tokens, tokenizer=Tokenizer.CL100K_BASE, preserve_end=False, ellipsis='...')[source]

Truncate text to fit within a token limit.

Parameters:

text (str) – Text to truncate
max_tokens (int) – Maximum number of tokens
tokenizer (Union[Tokenizer, str]) – Tokenizer to use. Defaults to Tokenizer.CL100K_BASE.
preserve_end (bool) – If True, keep the end of text instead of beginning. Defaults to False.
ellipsis (str) – String to indicate truncation. Defaults to “…”.

Returns:

Truncated text

Return type:

str

Examples

>>> text = "This is a long text that needs to be truncated."
>>> truncate_to_token_limit(text, max_tokens=5, tokenizer=Tokenizer.CL100K_BASE)
'This is a long...'

class kerb.tokenizer.Tokenizer(*values)[source]

Bases: Enum

Enumeration of supported tokenizers for token counting.

Using explicit tokenizers instead of model names provides better control and consistency for LLM developers.

Tiktoken Encodings (OpenAI):: CL100K_BASE: GPT-4o, GPT-4o-mini, text-embedding-ada-002 P50K_BASE: Code models (Codex, text-davinci-002, text-davinci-003) R50K_BASE: GPT-3 models (davinci, curie, babbage, ada)
Approximation Methods:: CHAR_4: Fast approximation using 4 chars/token (good for GPT-like models) CHAR_5: Fast approximation using 5 chars/token (good for BERT-like models) WORD: Word-based approximation (1.3 tokens/word average)

CL100K_BASE = 'cl100k_base'

P50K_BASE = 'p50k_base'

R50K_BASE = 'r50k_base'

P50K_EDIT = 'p50k_edit'

CHAR_4 = 'approximate_char_4'

CHAR_5 = 'approximate_char_5'

WORD = 'approximate_word'

property method: str: Get the tokenization method for this tokenizer.

kerb.tokenizer.tokens_to_chars(token_count, tokenizer=Tokenizer.CL100K_BASE)[source]

Estimate character count from token count.

Parameters:

token_count (int) – Number of tokens
tokenizer (Tokenizer) – Tokenizer for estimation. Defaults to Tokenizer.CL100K_BASE.

Returns:

Estimated character count

Return type:

int

Examples

>>> tokens_to_chars(100, tokenizer=Tokenizer.CL100K_BASE)
400

kerb.tokenizer.chars_to_tokens(char_count, tokenizer=Tokenizer.CL100K_BASE)[source]

Estimate token count from character count.

Parameters:

char_count (int) – Number of characters
tokenizer (Tokenizer) – Tokenizer for estimation. Defaults to Tokenizer.CL100K_BASE.

Returns:

Estimated token count

Return type:

int

Examples

>>> chars_to_tokens(400, tokenizer=Tokenizer.CL100K_BASE)
100

kerb.tokenizer.estimate_cost(token_count, model='gpt-4o', is_input=True)[source]

Estimate API cost based on token usage.

Parameters:

token_count (int) – Number of tokens
model (str) – Model name for pricing. Defaults to “gpt-4o”.
is_input (bool) – Whether tokens are input (True) or output (False). Defaults to True.

Returns:

Estimated cost in USD

Return type:

float

Examples

>>> estimate_cost(1000, model="gpt-4o", is_input=True)
0.005

>>> estimate_cost(1000, model="gpt-4o-mini", is_input=False)
0.0006

Note

Pricing is approximate and may change. Check official pricing for accuracy.

kerb.tokenizer.optimize_token_usage(text, max_tokens=None, tokenizer=Tokenizer.CL100K_BASE)[source]

Analyze and suggest optimizations for token usage.

Parameters:

text (str) – Text to analyze
max_tokens (Optional[int]) – Maximum token limit. If provided, will check if text exceeds limit.
tokenizer (Tokenizer) – Tokenizer to use. Defaults to Tokenizer.CL100K_BASE.

Returns:

Analysis results including:

token_count: Actual token count
char_count: Character count
tokens_per_char: Token to character ratio
exceeds_limit: Whether text exceeds max_tokens (if provided)
suggested_action: Recommended action based on analysis

Return type:

dict

Examples

>>> result = optimize_token_usage("Hello world!", max_tokens=10)
>>> result["token_count"]
3
>>> result["exceeds_limit"]
False

Token counting and text splitting for any model.