Tokenizer Module
Tokenizer utilities for counting tokens across different LLM models.
This module provides comprehensive token counting support for: - OpenAI models (GPT-4o, GPT-4o-mini, etc.) via tiktoken - HuggingFace models (BERT, Llama, etc.) via transformers - Fast approximation methods for quick estimates
Key features: - Tokenizer enum for explicit, type-safe tokenizer specification - count_tokens: Count tokens for a single text - batch_count_tokens: Count tokens for multiple texts - count_tokens_for_messages: Count tokens in chat message format - truncate_to_token_limit: Truncate text to fit token limits - tokens_to_chars / chars_to_tokens: Convert between tokens and characters
- kerb.tokenizer.count_tokens(text, tokenizer=Tokenizer.CL100K_BASE)[source]
Count tokens in text using the specified tokenizer.
- Parameters:
- Returns:
Token count
- Return type:
Examples
>>> count_tokens("Hello world!", tokenizer=Tokenizer.CL100K_BASE) 3
>>> count_tokens("Hello world!", tokenizer=Tokenizer.P50K_BASE) 3
>>> count_tokens("Hello world!", tokenizer="bert-base-uncased") 4
>>> count_tokens("Hello world!", tokenizer=Tokenizer.CHAR_4) 3
- kerb.tokenizer.batch_count_tokens(texts, tokenizer=Tokenizer.CL100K_BASE)[source]
Count tokens for multiple texts.
- Parameters:
- Returns:
List of token counts
- Return type:
Examples
>>> texts = ["Hello world!", "How are you?", "Good morning!"] >>> batch_count_tokens(texts, tokenizer=Tokenizer.CL100K_BASE) [3, 4, 3]
- kerb.tokenizer.count_tokens_for_messages(messages, tokenizer=Tokenizer.CL100K_BASE)[source]
Count tokens for a list of chat messages including format overhead.
OpenAI chat models format messages with special tokens. This function accounts for the overhead of message formatting. Works best with tiktoken tokenizers (CL100K_BASE, P50K_BASE, etc.).
- Parameters:
- Returns:
Total token count including message formatting overhead
- Return type:
Examples
>>> messages = [ ... {"role": "system", "content": "You are a helpful assistant."}, ... {"role": "user", "content": "Hello!"} ... ] >>> count_tokens_for_messages(messages, tokenizer=Tokenizer.CL100K_BASE) 28
- kerb.tokenizer.truncate_to_token_limit(text, max_tokens, tokenizer=Tokenizer.CL100K_BASE, preserve_end=False, ellipsis='...')[source]
Truncate text to fit within a token limit.
- Parameters:
text (
str) – Text to truncatemax_tokens (
int) – Maximum number of tokenstokenizer (
Union[Tokenizer,str]) – Tokenizer to use. Defaults to Tokenizer.CL100K_BASE.preserve_end (
bool) – If True, keep the end of text instead of beginning. Defaults to False.ellipsis (
str) – String to indicate truncation. Defaults to “…”.
- Returns:
Truncated text
- Return type:
Examples
>>> text = "This is a long text that needs to be truncated." >>> truncate_to_token_limit(text, max_tokens=5, tokenizer=Tokenizer.CL100K_BASE) 'This is a long...'
- class kerb.tokenizer.Tokenizer(*values)[source]
Bases:
EnumEnumeration of supported tokenizers for token counting.
Using explicit tokenizers instead of model names provides better control and consistency for LLM developers.
- Tiktoken Encodings (OpenAI):
CL100K_BASE: GPT-4o, GPT-4o-mini, text-embedding-ada-002 P50K_BASE: Code models (Codex, text-davinci-002, text-davinci-003) R50K_BASE: GPT-3 models (davinci, curie, babbage, ada)
- Approximation Methods:
CHAR_4: Fast approximation using 4 chars/token (good for GPT-like models) CHAR_5: Fast approximation using 5 chars/token (good for BERT-like models) WORD: Word-based approximation (1.3 tokens/word average)
- CL100K_BASE = 'cl100k_base'
- P50K_BASE = 'p50k_base'
- R50K_BASE = 'r50k_base'
- P50K_EDIT = 'p50k_edit'
- CHAR_4 = 'approximate_char_4'
- CHAR_5 = 'approximate_char_5'
- WORD = 'approximate_word'
- kerb.tokenizer.tokens_to_chars(token_count, tokenizer=Tokenizer.CL100K_BASE)[source]
Estimate character count from token count.
- Parameters:
- Returns:
Estimated character count
- Return type:
Examples
>>> tokens_to_chars(100, tokenizer=Tokenizer.CL100K_BASE) 400
- kerb.tokenizer.chars_to_tokens(char_count, tokenizer=Tokenizer.CL100K_BASE)[source]
Estimate token count from character count.
- Parameters:
- Returns:
Estimated token count
- Return type:
Examples
>>> chars_to_tokens(400, tokenizer=Tokenizer.CL100K_BASE) 100
- kerb.tokenizer.estimate_cost(token_count, model='gpt-4o', is_input=True)[source]
Estimate API cost based on token usage.
- Parameters:
- Returns:
Estimated cost in USD
- Return type:
Examples
>>> estimate_cost(1000, model="gpt-4o", is_input=True) 0.005
>>> estimate_cost(1000, model="gpt-4o-mini", is_input=False) 0.0006
Note
Pricing is approximate and may change. Check official pricing for accuracy.
- kerb.tokenizer.optimize_token_usage(text, max_tokens=None, tokenizer=Tokenizer.CL100K_BASE)[source]
Analyze and suggest optimizations for token usage.
- Parameters:
- Returns:
- Analysis results including:
token_count: Actual token count
char_count: Character count
tokens_per_char: Token to character ratio
exceeds_limit: Whether text exceeds max_tokens (if provided)
suggested_action: Recommended action based on analysis
- Return type:
Examples
>>> result = optimize_token_usage("Hello world!", max_tokens=10) >>> result["token_count"] 3 >>> result["exceeds_limit"] False
Token counting and text splitting for any model.