Salmon Run — GeistHaus

Book Review: Software Engineering for Data Scientists

Sujit Pal Feb 2, 2026 Updated Feb 2, 2026

As a Software Engineer (backend Web Development then Search) turned Data Scientist, I was particularly interested in what the book Software Engineering for Data Scientists by Andrew Treadway had to say about the reverse transition. Transitioning between sub-disciplines is a given in our industry -- I started life as a sales/support engineer, then moved to application programming, then back and

tag:blogger.com,1999:blog-7583720.post-5766091247983370548

Extensions

Book Review: Transformers In Action

Sujit Pal Jan 10, 2026 Updated Jan 10, 2026

The Attention Is All You Need paper proposed the Transformer Architecrture as an improvement to the dominant encoder-decoder models of the time (both recurrent and convolutional). These models used an attention mechanism to connect the encoder and decoder parts, but the Transformer Architecture flipped the script, putting the Attention Mechanism at the center. An early implementation of the

tag:blogger.com,1999:blog-7583720.post-2407699836506847262

Extensions

Trip Report: PyData Global 2025

Sujit Pal Dec 26, 2025 Updated Jan 9, 2026

I attended PyData Global 2025 earlier this month. I had hoped to write this up earlier, but I've been busy, so only now getting the time Christmas morning. Merry Christmas to all my readers and best wishes for a Happy New 2026, hopefully it will be even better and more exciting (on the technology front) than this one! Taking stock of this year earlier today, I think I have some serious catching

tag:blogger.com,1999:blog-7583720.post-1413917009070635405

Extensions

Book Review: Time Series Forecasting using Foundation Models

Sujit Pal Oct 12, 2025 Updated Oct 12, 2025

As someone who primarily works in NLP and Search in the Health Domain, I don't have much use for Time Series. However, while exploring the Financial domain based on personal interest, I have been curious about Time Series for some time. Recently I attended the OpenHPI course Time Series Analysis taught by Mario Tormo Romero (even did the quizzes and the certificate of completion!). I was familiar

tag:blogger.com,1999:blog-7583720.post-3119510747610131509

Extensions

Book Review: Statistics every Programmer Needs

Sujit Pal Sep 20, 2025 Updated Sep 20, 2025

I recently read Statistics every Programmer Needs by Gary Sutton. I am probably a good target audience for the book since I used to be a software developer that transitioned into data science some 10 years ago, then into machine learning with neural networks and transformers, and more recently, to Generative AI with Large Language Models. During this time, I have read numerous books on statistics

tag:blogger.com,1999:blog-7583720.post-3625547917329916954

Extensions

Book Review: Hands-On Artificial Intelligence for IoT

Sujit Pal Jun 28, 2025 Updated Jun 28, 2025

For those in similar professional circles as I am in, i.e. looking forward into the Generative AI space, yet with one foot pragmatically and firmly stuck in Machine Learning (ML) and Deep Learning (DL) techniques of the (recent, ok, not very distant) past, you will find Dr Amita Kapoor's recent book Hands-On Artificial Intelligence for IoT: Expert Machine Learning and Deep Learning Techniques for

tag:blogger.com,1999:blog-7583720.post-555974058258313411

Extensions

Book Review: Essential Graph RAG

Sujit Pal Jun 15, 2025 Updated Jun 15, 2025

Coming from a background of Knowledge Graph (KG) backed Medical Search, I don't need to be convinced about the importance of manually curated structured knowledge on the quality of search results. Traditional search is being rapidly replaced with Generative AI using a technique called Retrieval Augmented Generation (RAG), where the pipeline produces an answer summarizing the search results

tag:blogger.com,1999:blog-7583720.post-6720469508089089268

Extensions

Packaging ML Pipelines from Experiment to Deployment

Sujit Pal Dec 31, 2024 Updated Dec 31, 2024

As an ML Engineer, we are generally tasked with solving some business problem with technology. Typically it involves leveraging data assets that your organization already owns or can acquire. Generally, unless it is a very simple problem, there would be more than one ML model involved, maybe different types of models depending on the sub-task, maybe other supporting tools such as a Search Index

tag:blogger.com,1999:blog-7583720.post-4027855785487586520

Extensions

Trip Report - PyData Global 2024

Sujit Pal Dec 9, 2024 Updated Dec 9, 2024

I attended PyData Global 2024 last week. Its a virtual conference, so I was able to attend it from the comfort of my home, although presentations seem to be scheduled to be maximally convenient, time-wise, for folks in the US East Coast and Western Europe, so some of them were a bit early for me. There were four main tracks -- the General Track, the Data / Data Science Track, the AI / ML track

tag:blogger.com,1999:blog-7583720.post-5810773948471156097

Extensions

Using Knowledge Graphs to enhance Retrieval Augmented Generation

Sujit Pal Oct 6, 2024 Updated Oct 6, 2024

Retrieval Augmented Generation (RAG) has become a popular approach to harness LLMs for question answering using your own corpus of data. Typically, the context to augment the query that is passed into the Large Language Model (LLM) to generate an answer comes from a database or search index containing your domain data. When it is a search index, the trend is to use Vector search (HNSW ANN based)

tag:blogger.com,1999:blog-7583720.post-3351294832206390671

Extensions

Experiments with Prompt Compression

Sujit Pal Jul 30, 2024 Updated Jul 30, 2024

I recently came across Prompt Compression (in the context of Prompt Engineering on Large Language Models) on this short course on Prompt Compression and Query Optimization from DeepLearning.AI. Essentially it involves compressing the prompt text using a trained model to drop non-essential tokens. The resulting prompt is shorter (and in cases of the original context being longer than the LLM's

tag:blogger.com,1999:blog-7583720.post-8297257525824569369

Extensions

Table Extraction from PDFs using Multimodal (Vision) LLMs

Sujit Pal Jul 1, 2024 Updated Jul 1, 2024

Couple of weeks ago a colleague and I participated in an internal hackathon where the task was to come up with an interesting use case using the recent multi-modal Large Language Models (LLMs). Multi-modal LLMs take not only text inputs via their prompt like earlier LLMs, but can also accept non-text modalities such as images and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI,

tag:blogger.com,1999:blog-7583720.post-1724681603841217995

Extensions

Book Report: Pandas Workout

Sujit Pal Jun 24, 2024 Updated Jun 24, 2024

Unlike many Data Scientists, I didn't automatically reach for Pandas when I needed to analyze data. I came upon this discipline (Data Science) as a Java Software Engineer who used Python for scripting, so I was quite comfortable operating on JSON / CSV / text files directly, loading data into relational databases and running SQL against them, and building visualizations with Matplotlib. So when

tag:blogger.com,1999:blog-7583720.post-2292899831695767070

Extensions

Finetuning RAGAS Metrics using DSPy

Sujit Pal May 18, 2024 Updated May 18, 2024

Last month, I decided to sign-up for the Google AI Hackathon, where Google provided access to their Gemini Large Language Model (LLM) and tasked participants with building a creative application on top of it. I have worked with Anthropic's Claude and OpenAI's GPT-3 at work previously, and I was curious to see how Gemini stacked up against them. I was joined in that effort by David Campbell and

tag:blogger.com,1999:blog-7583720.post-2003044845915300861

Extensions

Performance Analysis of Float vs Byte vs Binary Vectors on OpenSearch

Sujit Pal May 15, 2024 Updated May 16, 2024

I've been working on an application where, given an input string, the objective is to recommend an output string that is similar to the input string, for some notion of similarity. A machine learning model, in this case a SentenceTransformers model, is taught this notion of similarity by showing it many examples of input-output pairs. The model's weights are then used to encode the part to be

tag:blogger.com,1999:blog-7583720.post-2267706794105371634

Extensions

KGC/HCLS 2024 Trip Report

Sujit Pal May 7, 2024 Updated May 9, 2024

I was at KGC (Knowledge Graph Conference) 2024, which is happening May 6-10 at Cornell Tech. I was presenting (virtually) at their Health Care and Life Sciences (HCLS) workshop, so my speakers pass was only valid for today for the HCLS portion of KGC. My trip report covers a few talks that I attended here. Attending virtually was a bit chaotic as sessions went over sometimes, so you might leave a

tag:blogger.com,1999:blog-7583720.post-7969812542594217370

Extensions

Book Report: Machine Learning for Drug Discovery

Sujit Pal Mar 23, 2024 Updated Mar 24, 2024

Drug Discovery is a field where biochemists (and more recently computer scientists) turn ideas into potential medications. I first came across a few applications in this area when checking out how to build Graph Neural Networks (GNN) as part of auditing the CS224W: Machine Learning with Graphs course from Stanford, some learnings of which I recycled into my Deep Learning with Graphs tutorial at

tag:blogger.com,1999:blog-7583720.post-8966343464351206346

Extensions

Hierarchical (and other) Indexes using LlamaIndex for RAG Content Enrichment

Sujit Pal Mar 17, 2024 Updated Mar 17, 2024

At our weekly This Week in Machine Learning (TWIML) meetings, (our leader and facilitataor) Darin Plutchok pointed out a LinkedIn blog post on Semantic Chunking that has been recently implemented in the LangChain framework. Unlike more traditional chunking approaches that use number of tokens or separator tokens as a guide, this one chunks groups of sentences into semantic units by breaking them

tag:blogger.com,1999:blog-7583720.post-2444654388244073272

Extensions

Thoughts on using LangChain LCEL with Claude

Sujit Pal Feb 25, 2024 Updated Feb 25, 2024

I got into Natural Language Processing (NLP) and Machine Learning (ML) through Search. And this led me into Generative AI (GenAI), which led me back to Search via Retrieval Augmented Generation (RAG). RAG started out relatively simple -- take a query, generate search results, use search results as context for a Large Language Model (LLM) to generate an abstractive summary of the results. Back

tag:blogger.com,1999:blog-7583720.post-13596624551983867

Extensions

Book Report: Allen B Downey's Probably Overthinking It

Sujit Pal Feb 3, 2024 Updated Feb 4, 2024

I have read Allen Downey's books on statistics in the past, when trying to turn myself from a Software Engineer into what Josh Wills says a Data Scientist is -- someone who is better at statistics than a Software Engineer and better at software than a statistician (with somewhat limited success in the first area, I will hasten to add). Last year, I had the good fortune to present at PyData Global

tag:blogger.com,1999:blog-7583720.post-6145352586582115069

Extensions

Knowledge Graph Aligned Entity Linker using SentenceTransformers

Sujit Pal Jan 1, 2024 Updated Jan 1, 2024

Most of us are familiar with Named Entity Recognizers (NERs) that can recognize spans in text as belonging to a small number of classes, such as Person (PER), Organization (ORG), Location (LOC), etc. These are usually multi-class classifier models, trained on input sequences to return BIO (Begin-Input-Output) tags for each token. However, recognizing entities in a Knowledge Graph (KG) using this

tag:blogger.com,1999:blog-7583720.post-1026445863886131169

Extensions

PyData Global 2023: Trip Report

Sujit Pal Dec 9, 2023 Updated Dec 9, 2023

I had the opportunity to present at PyData Global this year. It is a virtual conference that ran over 3 days in multiple tracks from December 6 to 8. I talked about Building Learning to Rank models for search using Large Language Models. For those attending the conference, I already shared the links to the slides and the associated code on its Discord channel, but for those who are not, they are

tag:blogger.com,1999:blog-7583720.post-5896316554942919450

Extensions

Building Learning to Rank Models with Generative AI

Sujit Pal Dec 3, 2023 Updated Dec 3, 2023

Generative AI has been the new cool kid on the AI / ML block since early this year. Like everyone else, I continue to be amazed and wowed with each successive success story as they break existing benchmark records and showcase novel applications built on top of their new functionality. I was also lucky to be involved in a Generative AI project since the middle of this year, which gave me access

tag:blogger.com,1999:blog-7583720.post-9162304331871008979

Extensions

A PySpark idiom for efficient Model Inference

Sujit Pal Oct 7, 2023 Updated Dec 11, 2023

I recently needed to build an Apache Spark (PySpark) job where the task was (among other things) to use a Language Model (LM) to encode text into vectors. This is an embarassingly parallel job where the text to encoding is one to one, so something like Spark works very well here. We could, in theory at least, achieve a N-fold performance improvement by horizontally partitioning the data into N

tag:blogger.com,1999:blog-7583720.post-6305567007053567601

Extensions

BMI 702 Review Part IV -- Biomedical Imaging

Sujit Pal Jun 25, 2023 Updated Jun 25, 2023

Here is Part IV of my ongoing review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundation of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous reviews in this series, they are listed below. BMI 702 Review Part I BMI 702 Review Part II (Graph Learning) BMI 702 Review Part III (

tag:blogger.com,1999:blog-7583720.post-3144991686109696486

Extensions