Embedding Module

Embedding utilities for converting text to vector representations.

This module provides flexible embedding generation with multiple model options: - Local/hash-based (no dependencies, for testing) - Sentence Transformers (local ML models, high quality) - OpenAI API (cloud-based, highest quality)

Usage Examples:

# Common usage - core functions from kerb.embedding import embed, embed_batch

vec = embed(“Hello world”) vecs = embed_batch([“Hello”, “World”])

# Provider-specific usage from kerb.embedding.providers import OpenAIEmbedder, LocalEmbedder from kerb.embedding.providers import SentenceTransformerEmbedder

embedder = OpenAIEmbedder(model_name=”text-embedding-3-large”) vec = embedder.embed(“Hello”)

# Utilities from kerb.embedding.utils import cosine_similarity, euclidean_distance

similarity = cosine_similarity(vec1, vec2)

class kerb.embedding.EmbeddingModel(*values)[source]

Bases: Enum

Enum for embedding models.

For custom models not listed here, use a plain string instead.

LOCAL = 'local'
ALL_MINILM_L6_V2 = 'all-MiniLM-L6-v2'
ALL_MINILM_L12_V2 = 'all-MiniLM-L12-v2'
ALL_MPNET_BASE_V2 = 'all-mpnet-base-v2'
PARAPHRASE_MINILM_L6_V2 = 'paraphrase-MiniLM-L6-v2'
PARAPHRASE_MPNET_BASE_V2 = 'paraphrase-mpnet-base-v2'
TEXT_EMBEDDING_3_SMALL = 'text-embedding-3-small'
TEXT_EMBEDDING_3_LARGE = 'text-embedding-3-large'
TEXT_EMBEDDING_ADA_002 = 'text-embedding-ada-002'
class kerb.embedding.ModelBackend(*values)[source]

Bases: Enum

Enum for embedding backends.

LOCAL = 'local'
SENTENCE_TRANSFORMERS = 'sentence_transformers'
OPENAI = 'openai'
kerb.embedding.embed(text, model=EmbeddingModel.LOCAL, dimensions=384, api_key=None, **kwargs)[source]

Generate an embedding vector for text.

Parameters:
  • text (str) – The text to embed

  • model (Union[str, EmbeddingModel]) – Model to use: - EmbeddingModel.LOCAL - Hash-based (default, no dependencies) - EmbeddingModel.ALL_MINILM_L6_V2 - Sentence Transformers (384 dim) - EmbeddingModel.ALL_MPNET_BASE_V2 - Sentence Transformers (768 dim) - EmbeddingModel.TEXT_EMBEDDING_3_SMALL - OpenAI (1536 dim) - EmbeddingModel.TEXT_EMBEDDING_3_LARGE - OpenAI (3072 dim) - Or use a string for custom models: “custom-model-name”

  • dimensions (int) – Dimension for local embeddings (default: 384)

  • api_key (Optional[str]) – OpenAI API key (or set OPENAI_API_KEY env var)

  • **kwargs – Additional model-specific parameters

Returns:

Embedding vector (normalized to unit length)

Return type:

List[float]

Examples

# Using enum (recommended for known models) vec = embed(“Hello, world!”) vec = embed(“Hello”, model=EmbeddingModel.ALL_MINILM_L6_V2) vec = embed(“Hello”, model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL, api_key=”sk-…”)

# Using string for custom models vec = embed(“Hello”, model=”my-custom-sentence-transformer”)

kerb.embedding.embed_batch(texts, model=EmbeddingModel.LOCAL, dimensions=384, batch_size=32, api_key=None, **kwargs)[source]

Generate embeddings for multiple texts efficiently.

Parameters:
  • texts (List[str]) – List of texts to embed

  • model (Union[str, EmbeddingModel]) – Model to use (see embed() for options)

  • dimensions (int) – Dimension for local embeddings

  • batch_size (int) – Batch size for processing

  • api_key (Optional[str]) – OpenAI API key (or set OPENAI_API_KEY env var)

  • **kwargs – Additional model-specific parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

Examples

# Using enum embeddings = embed_batch([“doc1”, “doc2”, “doc3”]) embeddings = embed_batch(docs, model=EmbeddingModel.ALL_MINILM_L6_V2) embeddings = embed_batch(docs, model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL)

# Using string for custom models embeddings = embed_batch(docs, model=”custom-model”)

async kerb.embedding.embed_async(text, model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL, api_key=None, **kwargs)[source]

Generate embedding asynchronously (wrapper for API-based models).

Parameters:
  • text (str) – Text to embed

  • model (Union[str, EmbeddingModel]) – Embedding model to use

  • api_key (Optional[str]) – API key (for OpenAI models)

  • **kwargs – Additional model parameters

Returns:

Embedding vector

Return type:

List[float]

Note

Currently only supports async for OpenAI models. Local models will run synchronously in a thread pool.

Examples

>>> import asyncio
>>> embedding = asyncio.run(embed_async("Hello world"))
async kerb.embedding.embed_batch_async(texts, model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL, api_key=None, batch_size=100, max_concurrent=5, **kwargs)[source]

Generate embeddings for multiple texts asynchronously.

Parameters:
  • texts (List[str]) – Texts to embed

  • model (Union[str, EmbeddingModel]) – Embedding model to use

  • api_key (Optional[str]) – API key (for OpenAI models)

  • batch_size (int) – Number of texts per API call

  • max_concurrent (int) – Maximum concurrent requests (for API models)

  • **kwargs – Additional model parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

Examples

>>> import asyncio
>>> texts = ["Hello", "World", "AI"]
>>> embeddings = asyncio.run(embed_batch_async(texts))
kerb.embedding.embed_batch_stream(texts, model=EmbeddingModel.LOCAL, batch_size=32, api_key=None, **kwargs)[source]

Stream embeddings for large datasets (memory efficient).

Yields embeddings one at a time instead of loading all into memory. Useful for processing very large datasets.

Parameters:
  • texts (List[str]) – Texts to embed

  • model (Union[str, EmbeddingModel]) – Embedding model to use

  • batch_size (int) – Number of texts to process per batch

  • api_key (Optional[str]) – API key (for API-based models)

  • **kwargs – Additional model parameters

Yields:

Tuple[int, List[float]] – (index, embedding) pairs

Examples

>>> texts = ["text1", "text2", ...]  # Large list
>>> for idx, embedding in embed_batch_stream(texts, batch_size=100):
...     # Process embedding immediately
...     print(f"Processed {idx}")
async kerb.embedding.embed_batch_stream_async(texts, model=EmbeddingModel.TEXT_EMBEDDING_3_SMALL, batch_size=100, api_key=None, max_concurrent=5, **kwargs)[source]

Stream embeddings asynchronously for large datasets.

Parameters:
  • texts (List[str]) – Texts to embed

  • model (Union[str, EmbeddingModel]) – Embedding model to use

  • batch_size (int) – Number of texts per API call

  • api_key (Optional[str]) – API key (for API-based models)

  • max_concurrent (int) – Maximum concurrent requests

  • **kwargs – Additional model parameters

Yields:

Tuple[int, List[float]] – (index, embedding) pairs

Examples

>>> async def process():
...     texts = ["text1", "text2", ...]
...     async for idx, embedding in embed_batch_stream_async(texts):
...         print(f"Processed {idx}")
>>> asyncio.run(process())
class kerb.embedding.LocalEmbedder(dimensions=384)[source]

Bases: object

Local hash-based embedder

This is a simple, deterministic embedding that requires no external models. Suitable for testing, prototyping, or when you don’t need semantic quality.

Parameters:

dimensions (int) – Embedding dimension (default: 384)

Examples

embedder = LocalEmbedder(dimensions=512) vec = embedder.embed(“Hello world”) vecs = embedder.embed_batch([“Hello”, “World”])

__init__(dimensions=384)[source]

Initialize the local embedder.

Parameters:

dimensions (int) – Embedding dimension

embed(text)[source]

Generate embedding for a single text.

Parameters:

text (str) – Text to embed

Returns:

Embedding vector

Return type:

List[float]

embed_batch(texts)[source]

Generate embeddings for multiple texts.

Parameters:

texts (List[str]) – Texts to embed

Returns:

List of embedding vectors

Return type:

List[List[float]]

class kerb.embedding.OpenAIEmbedder(model_name='text-embedding-3-small', api_key=None)[source]

Bases: object

OpenAI embedding provider.

Requires: pip install openai

Parameters:
  • model_name (str) – OpenAI model name (default: “text-embedding-3-small”)

  • api_key (Optional[str]) – OpenAI API key (or set OPENAI_API_KEY env var)

Examples:

embedder = OpenAIEmbedder(model_name="text-embedding-3-large")
vec = embedder.embed("Hello world")
vecs = embedder.embed_batch(["Hello", "World"])

# Async usage
import asyncio
async def main():
    vec = await embedder.embed_async("Hello")
asyncio.run(main())
__init__(model_name='text-embedding-3-small', api_key=None)[source]

Initialize the OpenAI embedder.

Parameters:
  • model_name (str) – OpenAI model name

  • api_key (Optional[str]) – OpenAI API key

embed(text, **kwargs)[source]

Generate embedding for a single text.

Parameters:
  • text (str) – Text to embed

  • **kwargs – Additional API parameters

Returns:

Embedding vector

Return type:

List[float]

embed_batch(texts, batch_size=100, **kwargs)[source]

Generate embeddings for multiple texts.

Parameters:
  • texts (List[str]) – Texts to embed

  • batch_size (int) – Number of texts per API call

  • **kwargs – Additional API parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

async embed_async(text, **kwargs)[source]

Generate embedding asynchronously.

Parameters:
  • text (str) – Text to embed

  • **kwargs – Additional API parameters

Returns:

Embedding vector

Return type:

List[float]

async embed_batch_async(texts, batch_size=100, max_concurrent=5, **kwargs)[source]

Generate embeddings asynchronously for multiple texts.

Parameters:
  • texts (List[str]) – Texts to embed

  • batch_size (int) – Number of texts per API call

  • max_concurrent (int) – Maximum concurrent requests

  • **kwargs – Additional API parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

class kerb.embedding.SentenceTransformerEmbedder(model_name='all-MiniLM-L6-v2')[source]

Bases: object

Sentence Transformers embedding provider (runs locally).

Requires: pip install sentence-transformers

Parameters:

model_name (str) – Model name (default: “all-MiniLM-L6-v2”)

Examples

embedder = SentenceTransformerEmbedder(model_name=”all-mpnet-base-v2”) vec = embedder.embed(“Hello world”) vecs = embedder.embed_batch([“Hello”, “World”])

__init__(model_name='all-MiniLM-L6-v2')[source]

Initialize the Sentence Transformer embedder.

Parameters:

model_name (str) – Model name

embed(text, **kwargs)[source]

Generate embedding for a single text.

Parameters:
  • text (str) – Text to embed

  • **kwargs – Additional model parameters

Returns:

Embedding vector

Return type:

List[float]

embed_batch(texts, batch_size=32, **kwargs)[source]

Generate embeddings for multiple texts.

Parameters:
  • texts (List[str]) – Texts to embed

  • batch_size (int) – Batch size for processing

  • **kwargs – Additional model parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

kerb.embedding.local_embed(text, dimensions=384)[source]

Generate embedding using local hash-based method.

This is a simple, deterministic embedding that requires no external models. Suitable for testing, prototyping, or when you don’t need semantic quality.

Parameters:
  • text (str) – Text to embed

  • dimensions (int) – Embedding dimension

Returns:

Normalized embedding vector

Return type:

List[float]

kerb.embedding.openai_embed(text, model_name='text-embedding-3-small', api_key=None, **kwargs)[source]

Generate embedding using OpenAI API.

Requires: pip install openai

Parameters:
  • text (str) – Text to embed

  • model_name (str) – OpenAI model name (default: “text-embedding-3-small”)

  • api_key (Optional[str]) – OpenAI API key (or set OPENAI_API_KEY env var)

  • **kwargs – Additional API parameters

Returns:

Embedding vector

Return type:

List[float]

Popular models:

  • “text-embedding-3-small” (1536 dim, cost-effective)

  • “text-embedding-3-large” (3072 dim, highest quality)

  • “text-embedding-ada-002” (1536 dim, legacy)

kerb.embedding.openai_embed_batch(texts, model_name='text-embedding-3-small', api_key=None, batch_size=100, **kwargs)[source]

Generate embeddings for multiple texts using OpenAI API.

Processes texts in batches to stay within API limits.

Parameters:
  • texts (List[str]) – Texts to embed

  • model_name (str) – OpenAI model name (default: “text-embedding-3-small”)

  • api_key (Optional[str]) – OpenAI API key (or set OPENAI_API_KEY env var)

  • batch_size (int) – Number of texts per API call (max 2048 for OpenAI)

  • **kwargs – Additional API parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

async kerb.embedding.openai_embed_async(text, model_name='text-embedding-3-small', api_key=None, **kwargs)[source]

Generate embedding using OpenAI API asynchronously.

Requires: pip install openai

Parameters:
  • text (str) – Text to embed

  • model_name (str) – OpenAI model name (default: “text-embedding-3-small”)

  • api_key (Optional[str]) – OpenAI API key (or set OPENAI_API_KEY env var)

  • **kwargs – Additional API parameters

Returns:

Embedding vector

Return type:

List[float]

Examples

>>> import asyncio
>>> embedding = asyncio.run(openai_embed_async("Hello world"))
async kerb.embedding.openai_embed_batch_async(texts, model_name='text-embedding-3-small', api_key=None, batch_size=100, max_concurrent=5, **kwargs)[source]

Generate embeddings for multiple texts using OpenAI API asynchronously.

Processes texts in batches with concurrent requests for improved performance.

Parameters:
  • texts (List[str]) – Texts to embed

  • model_name (str) – OpenAI model name (default: “text-embedding-3-small”)

  • api_key (Optional[str]) – OpenAI API key (or set OPENAI_API_KEY env var)

  • batch_size (int) – Number of texts per API call (max 2048 for OpenAI)

  • max_concurrent (int) – Maximum concurrent API requests

  • **kwargs – Additional API parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

Examples

>>> import asyncio
>>> texts = ["Hello", "World", "AI"]
>>> embeddings = asyncio.run(openai_embed_batch_async(texts))
kerb.embedding.sentence_transformer_embed(text, model_name='all-MiniLM-L6-v2', **kwargs)[source]

Generate embedding using Sentence Transformers (local ML model).

Requires: pip install sentence-transformers

Parameters:
  • text (str) – Text to embed

  • model_name (str) – Model name (default: “all-MiniLM-L6-v2”)

  • **kwargs – Additional model parameters

Returns:

Embedding vector

Return type:

List[float]

Popular models:
  • “all-MiniLM-L6-v2” (384 dim, fast)

  • “all-mpnet-base-v2” (768 dim, quality)

  • “all-MiniLM-L12-v2” (384 dim, balanced)

kerb.embedding.sentence_transformer_embed_batch(texts, model_name='all-MiniLM-L6-v2', batch_size=32, **kwargs)[source]

Generate embeddings for multiple texts using Sentence Transformers.

More efficient than calling sentence_transformer_embed repeatedly.

Parameters:
  • texts (List[str]) – Texts to embed

  • model_name (str) – Model name (default: “all-MiniLM-L6-v2”)

  • batch_size (int) – Batch size for processing

  • **kwargs – Additional model parameters

Returns:

List of embedding vectors

Return type:

List[List[float]]

kerb.embedding.cosine_similarity(vector1, vector2)[source]

Calculate cosine similarity between two vectors.

Parameters:
  • vector1 (List[float]) – First embedding vector

  • vector2 (List[float]) – Second embedding vector

Returns:

Cosine similarity score between -1 and 1 (1 = identical)

Return type:

float

Examples

from kerb.embedding import embed sim = cosine_similarity(embed(“hello”), embed(“hi”))

kerb.embedding.euclidean_distance(vector1, vector2)[source]

Calculate Euclidean (L2) distance between two vectors.

Parameters:
  • vector1 (List[float]) – First embedding vector

  • vector2 (List[float]) – Second embedding vector

Returns:

Euclidean distance (0 = identical, higher = more different)

Return type:

float

kerb.embedding.manhattan_distance(vector1, vector2)[source]

Calculate Manhattan (L1) distance between two vectors.

Parameters:
  • vector1 (List[float]) – First embedding vector

  • vector2 (List[float]) – Second embedding vector

Returns:

Manhattan distance

Return type:

float

kerb.embedding.dot_product(vector1, vector2)[source]

Calculate dot product between two vectors.

Parameters:
  • vector1 (List[float]) – First embedding vector

  • vector2 (List[float]) – Second embedding vector

Returns:

Dot product score

Return type:

float

kerb.embedding.batch_similarity(query_vector, vectors, metric='cosine')[source]

Calculate similarity between a query vector and multiple vectors.

Parameters:
  • query_vector (List[float]) – Query embedding vector

  • vectors (List[List[float]]) – List of embedding vectors to compare

  • metric (str) – Distance metric (“cosine”, “euclidean”, “manhattan”, “dot”)

Returns:

Similarity/distance scores

Return type:

List[float]

Examples

from kerb.embedding import embed, embed_batch query = embed(“search query”) docs = embed_batch([“doc1”, “doc2”, “doc3”]) scores = batch_similarity(query, docs, metric=”cosine”)

kerb.embedding.top_k_similar(query_vector, vectors, k=5, metric='cosine', return_scores=False)[source]

Find top-k most similar vectors to a query vector.

Parameters:
  • query_vector (List[float]) – Query embedding vector

  • vectors (List[List[float]]) – List of embedding vectors to search

  • k (int) – Number of top results to return

  • metric (str) – Distance metric (“cosine”, “euclidean”, “manhattan”, “dot”)

  • return_scores (bool) – If True, return (index, score) tuples

Returns:

Top-k indices (or index-score pairs)

Return type:

Union[List[int], List[Tuple[int, float]]]

Examples

from kerb.embedding import embed, embed_batch query = embed(“search query”) docs = embed_batch([“doc1”, “doc2”, “doc3”]) indices = top_k_similar(query, docs, k=2) # Or with scores results = top_k_similar(query, docs, k=2, return_scores=True)

kerb.embedding.normalize_vector(vector)[source]

Normalize a vector to unit length (L2 norm = 1).

Parameters:

vector (List[float]) – Input vector

Returns:

Normalized vector

Return type:

List[float]

kerb.embedding.vector_magnitude(vector)[source]

Calculate the magnitude (L2 norm) of a vector.

Parameters:

vector (List[float]) – Input vector

Returns:

Vector magnitude

Return type:

float

kerb.embedding.mean_pooling(vectors)[source]

Calculate the mean of multiple vectors (centroid).

Useful for averaging embeddings of multiple texts.

Parameters:

vectors (List[List[float]]) – List of vectors to average

Returns:

Mean vector

Return type:

List[float]

Examples

from kerb.embedding import embed_batch # Average embeddings of multiple sentences sentences = [“First sentence.”, “Second sentence.”, “Third sentence.”] embeddings = embed_batch(sentences) avg_embedding = mean_pooling(embeddings)

kerb.embedding.weighted_mean_pooling(vectors, weights)[source]

Calculate weighted mean of multiple vectors.

Parameters:
  • vectors (List[List[float]]) – List of vectors

  • weights (List[float]) – Weight for each vector (will be normalized)

Returns:

Weighted mean vector

Return type:

List[float]

Examples

from kerb.embedding import embed_batch embeddings = embed_batch([“important”, “less important”]) weighted_avg = weighted_mean_pooling(embeddings, weights=[0.8, 0.2])

kerb.embedding.max_pooling(vectors)[source]

Apply max pooling across multiple vectors (element-wise maximum).

Parameters:

vectors (List[List[float]]) – List of vectors

Returns:

Max-pooled vector

Return type:

List[float]

kerb.embedding.embedding_dimension(vector)[source]

Get the dimension of an embedding vector.

Parameters:

vector (List[float]) – Embedding vector

Returns:

Vector dimension

Return type:

int

kerb.embedding.pairwise_similarities(vectors, metric='cosine')[source]

Calculate pairwise similarities between all vectors.

Returns a similarity matrix where element [i][j] is the similarity between vectors[i] and vectors[j].

Parameters:
  • vectors (List[List[float]]) – List of embedding vectors

  • metric (str) – Distance metric to use

Returns:

N x N similarity matrix

Return type:

List[List[float]]

Examples

from kerb.embedding import embed_batch docs = embed_batch([“doc1”, “doc2”, “doc3”]) sim_matrix = pairwise_similarities(docs)

kerb.embedding.cluster_embeddings(vectors, threshold=0.8)[source]

Simple clustering of embeddings based on similarity threshold.

Groups embeddings that are similar above the threshold.

Parameters:
  • vectors (List[List[float]]) – List of embedding vectors

  • threshold (float) – Similarity threshold for clustering (0-1)

Returns:

List of clusters (each cluster is a list of indices)

Return type:

List[List[int]]

Examples

from kerb.embedding import embed_batch docs = embed_batch([“doc1”, “doc2 similar to 1”, “doc3 different”]) clusters = cluster_embeddings(docs, threshold=0.7)

Embedding generation and similarity search helpers.