Show full content
Building scalable feature pipelines and retrieval-augmented generation systems with distributed computing.
IntroductionIn today’s data-driven world, organizations are increasingly turning to distributed computing to handle large-scale machine learning workloads. When it comes to feature engineering and retrieval-augmented generation (RAG) systems, the combination of Feast and Ray provides a powerful solution for building scalable, production-ready pipelines.
This blog post explores how Feast’s integration with Ray enables distributed processing for both traditional feature engineering and modern RAG applications, with support for Kubernetes deployment through KubeRay.
Why Feast + Ray for Distributed Processing?Modern ML applications face several scaling challenges:
- Large Datasets: Processing millions of documents for embedding generation
- Complex Transformations: CPU-intensive operations like text processing and feature engineering
- Real-time Requirements: Low-latency retrieval for RAG applications
- Resource Management: Efficient utilization of compute resources across clusters
Feast’s integration with Ray addresses these challenges.
Getting Started 1. Install Dependenciespip install feast[ray]2. Initialize Ray RAG Template
feast init -t ray_rag my_rag_project cd my_rag_project/feature_repo3. Configure Feature Store
# feature_store.yaml project: my_project registry: data/registry.db provider: local offline_store: type: ray storage_path: data/ray_storage broadcast_join_threshold_mb: 100 max_parallelism_multiplier: 2 target_partition_size_mb: 64 batch_engine: type: ray.engine max_workers: 12 enable_optimization: true broadcast_join_threshold_mb: 100 target_partition_size_mb: 64 online_store: type: milvus path: data/online_store.db vector_enabled: true embedding_dim: 384 index_type: "FLAT" metric_type: "COSINE"4. Apply and Materialize
feast apply feast materialize --disable-event-timestamp

Feast supports three execution modes for Ray integration:
- Local Mode (Development)
offline_store: type: ray storage_path: data/ray_storage # Conservative settings for local development broadcast_join_threshold_mb: 25 max_parallelism_multiplier: 1 target_partition_size_mb: 16 enable_ray_logging: false
- Remote Ray Cluster
offline_store: type: ray storage_path: s3://my-bucket/feast-data ray_address: "ray://my-cluster.example.com:10001"
- KubeRay (Kubernetes)
offline_store:
type: ray
storage_path: s3://my-bucket/feast-data
use_kuberay: true
kuberay_conf:
cluster_name: "feast-ray-cluster"
namespace: "feast-system"
auth_token: "${RAY_AUTH_TOKEN}"
auth_server: "https://api.openshift.com:6443"
skip_tls: false
Component Responsibilities
- Ray Compute Engine: Executes distributed feature computations, transformations, and joins
- Ray Offline Store: Handles data I/O operations, reading from various sources (Parquet, CSV, etc.)
The Ray compute engine follows a DAG-based architecture:
RAG Implementation with Ray Distributed Embedding GenerationEntityDF → RayReadNode → RayJoinNode → RayFilterNode → RayAggregationNode → RayTransformationNode → Output
One of the most powerful use cases for Feast + Ray is distributed embedding generation for RAG systems:
from feast import BatchFeatureView, Entity, Field, FileSource
from feast.types import Array, Float32, String
from feast.transformation.ray_transformation import RayTransformation
# Embedding processor for distributed Ray processing
class EmbeddingProcessor:
"""Generate embeddings using SentenceTransformer model."""
def __init__(self):
import torch
from sentence_transformers import SentenceTransformer
device = "cuda" if torch.cuda.is_available() else "cpu"
self.model = SentenceTransformer("all-MiniLM-L6-v2", device=device)
def __call__(self, batch):
"""Process batch and generate embeddings."""
descriptions = batch["Description"].fillna("").tolist()
embeddings = self.model.encode(
descriptions,
show_progress_bar=False,
batch_size=128,
normalize_embeddings=True,
convert_to_numpy=True,
)
batch["embedding"] = embeddings.tolist()
return batch
# Ray native UDF for distributed processing
def generate_embeddings_ray_native(ds):
"""Distributed embedding generation using Ray Data."""
max_workers = 8
batch_size = 2500
# Optimize partitioning for available workers
num_blocks = ds.num_blocks()
if num_blocks < max_workers:
ds = ds.repartition(max_workers)
result = ds.map_batches(
EmbeddingProcessor,
batch_format="pandas",
concurrency=max_workers,
batch_size=batch_size,
)
return result
# Feature view with Ray transformation
document_embeddings_view = BatchFeatureView(
name="document_embeddings",
entities=[document],
mode="ray", # Native Ray Dataset mode
ttl=timedelta(days=365 * 100),
schema=[
Field(name="document_id", dtype=String),
Field(name="embedding", dtype=Array(Float32), vector_index=True),
Field(name="movie_name", dtype=String),
Field(name="movie_director", dtype=String),
],
source=movies_source,
udf=generate_embeddings_ray_native,
online=True,
)
Vector Search Integration
Feast integrates with vector databases like Milvus for efficient similarity search:
online_store: type: milvus path: data/online_store.db vector_enabled: true embedding_dim: 384 index_type: "FLAT" metric_type: "COSINE"RAG Query Example
from feast import FeatureStore
from sentence_transformers import SentenceTransformer
# Initialize feature store
store = FeatureStore(repo_path=".")
# Generate query embedding
model = SentenceTransformer("all-MiniLM-L6-v2")
query_embedding = model.encode(["sci-fi movie about space"])[0].tolist()
# Retrieve similar documents
results = store.retrieve_online_documents_v2(
features=[
"document_embeddings:embedding",
"document_embeddings:movie_name",
"document_embeddings:movie_director",
],
query=query_embedding,
top_k=5,
).to_dict()
# Display results
for i in range(len(results["document_id_pk"])):
print(f"{i+1}. {results['movie_name'][i]}")
print(f" Director: {results['movie_director'][i]}")
print(f" Distance: {results['distance'][i]:.3f}")

# feast init -t ray_ragDownloading real IMDB movie data for RAG demonstration...
Attempting to download IMDB dataset...
Kaggle API found, checking authentication... Dataset URL: https://www.kaggle.com/datasets/yashgupta24/48000-movies-dataset
Dataset downloaded successfully!
Found dataset file: final_data.csv
Dataset shape: (48513, 18)
Columns: ['id', 'url', 'Name', 'PosterLink', 'Genres', 'Actors', 'Director', 'Description', 'DatePublished', 'Keywords', 'RatingCount', 'BestRating', 'WorstRating', 'RatingValue', 'ReviewAurthor', 'ReviewDate', 'ReviewBody', 'duration']
Successfully downloaded Kaggle dataset with 48513 movies
Copied CSV to: /feast-projects/handy_wildcat/feature_repo/data/final_data.csv
Ray RAG template initialized successfully!
To get started: 1. cd handy_wildcat/feature_repo 2. feast apply 3. feast materialize --disable-event-timestamp 4. python test_workflow.py Creating a new Feast repository in /feast-projects/handy_wildcat.
# python test_workflow.py Feature views: 1 Vector similarity search with Feast ... Query: 'Science fiction movie'Top 3 results: 1. Science Fiction Director: Danny Deprez | Genres: Adventure,Family,Mystery,Sci-Fi,Thriller 2. The Scientist Director: Zach LeBeau | Genres: Drama,Sci-Fi 3. Life Director: Daniel Espinosa | Genres: Horror,Sci-Fi,Thriller Query: 'Action movie with explosions and car chases'
Top 3 results: 1. Blastfighter Director: Lamberto Bava | Genres: Action,Crime,Drama,Mystery 2. Blast Director: Anthony Hickox | Genres: Action,Thriller 3. Wreckage Director: John Asher | Genres: Horror,Thriller Query: 'Space exploration and time travel film'
Top 3 results: 1. The Space Movie Director: Tony Palmer | Genres: Documentary 2. An Adventure in Space and Time Director: Terry McDonough | Genres: Biography,Drama,History 3. Voyage of Time: Life's Journey Director: Terrence Malick | Genres: Documentary,Drama
What was demonstrated:
Ray-based distributed embedding generation
Milvus vector storage and retrieval
Similarity search
Raw Data to Search workflow
Next steps: • Scale to larger datasets • Connect to distributed Ray cluster
Whether you’re building traditional feature pipelines or modern RAG systems, Feast + Ray offers the scalability and performance needed for production workloads. The integration supports everything from local development to large-scale Kubernetes deployments, making it an ideal choice for organizations looking to scale their ML infrastructure.
Additional Resources
Downloading real IMDB movie data for RAG demonstration...
Attempting to download IMDB dataset...
Kaggle API found, checking authentication...
Dataset URL: https://www.kaggle.com/datasets/yashgupta24/48000-movies-dataset
Found dataset file: final_data.csv
Dataset shape: (48513, 18)
Columns: ['id', 'url', 'Name', 'PosterLink', 'Genres', 'Actors', 'Director', 'Description', 'DatePublished', 'Keywords', 'RatingCount', 'BestRating', 'WorstRating', 'RatingValue', 'ReviewAurthor', 'ReviewDate', 'ReviewBody', 'duration']
Successfully downloaded Kaggle dataset with 48513 movies
Copied CSV to: /feast-projects/handy_wildcat/feature_repo/data/final_data.csv
Ray RAG template initialized successfully!
To get started:
1. cd handy_wildcat/feature_repo
2. feast apply
3. feast materialize --disable-event-timestamp
4. python test_workflow.py
Creating a new Feast repository in /feast-projects/handy_wildcat.
What was demonstrated:

















































