GeistHaus
log in · sign up

Salmon Run

Part of feedburner.com

Swimming upstream on the technology tide, one technology at a time. A collection of articles, tips, and random musings on application development and system design.

stories
Book Review: Software Engineering for Data Scientists
data-sciencepythonsoftware-engineering
As a Software Engineer (backend Web Development then Search) turned Data Scientist, I was particularly interested in what the book Software Engineering for Data Scientists by Andrew Treadway had to say about the reverse transition. Transitioning between sub-disciplines is a given in our industry -- I started life as a sales/support engineer, then moved to application programming, then back and
tag:blogger.com,1999:blog-7583720.post-5766091247983370548
Extensions
Book Review: Transformers In Action
transformers
The Attention Is All You Need paper proposed the Transformer Architecrture as an improvement to the dominant encoder-decoder models of the time (both recurrent and convolutional). These models used an attention mechanism to connect the encoder and decoder parts, but the Transformer Architecture flipped the script, putting the Attention Mechanism at the center. An early implementation of the
tag:blogger.com,1999:blog-7583720.post-2407699836506847262
Extensions
Trip Report: PyData Global 2025
data-sciencepython
I attended PyData Global 2025 earlier this month. I had hoped to write this up earlier, but I've been busy, so only now getting the time Christmas morning. Merry Christmas to all my readers and best wishes for a Happy New 2026, hopefully it will be even better and more exciting (on the technology front) than this one! Taking stock of this year earlier today, I think I have some serious catching
tag:blogger.com,1999:blog-7583720.post-1413917009070635405
Extensions
Book Review: Time Series Forecasting using Foundation Models
pythontime-series
As someone who primarily works in NLP and Search in the Health Domain, I don't have much use for Time Series. However, while exploring the Financial domain based on personal interest, I have been curious about Time Series for some time. Recently I attended the OpenHPI course Time Series Analysis taught by Mario Tormo Romero (even did the quizzes and the certificate of completion!). I was familiar
tag:blogger.com,1999:blog-7583720.post-3119510747610131509
Extensions
Book Review: Statistics every Programmer Needs
statistics
I recently read Statistics every Programmer Needs by Gary Sutton. I am probably a good target audience for the book since I used to be a software developer that transitioned into data science some 10 years ago, then into machine learning with neural networks and transformers, and more recently, to Generative AI with Large Language Models. During this time, I have read numerous books on statistics
tag:blogger.com,1999:blog-7583720.post-3625547917329916954
Extensions
Book Review: Hands-On Artificial Intelligence for IoT
deep-learningiotmachine-learningpythontime-series
For those in similar professional circles as I am in, i.e. looking forward into the Generative AI space, yet with one foot pragmatically and firmly stuck in Machine Learning (ML) and Deep Learning (DL) techniques of the (recent, ok, not very distant) past, you will find Dr Amita Kapoor's recent book Hands-On Artificial Intelligence for IoT: Expert Machine Learning and Deep Learning Techniques for
tag:blogger.com,1999:blog-7583720.post-555974058258313411
Extensions
Book Review: Essential Graph RAG
knowledge-graphlarge-language-modelsretrieval-augmented-generationsearch
Coming from a background of Knowledge Graph (KG) backed Medical Search, I don't need to be convinced about the importance of manually curated structured knowledge on the quality of search results. Traditional search is being rapidly replaced with Generative AI using a technique called Retrieval Augmented Generation (RAG), where the pipeline produces an answer summarizing the search results
tag:blogger.com,1999:blog-7583720.post-6720469508089089268
Extensions
Packaging ML Pipelines from Experiment to Deployment
machine-learningpythonsoftware-engineering
As an ML Engineer, we are generally tasked with solving some business problem with technology. Typically it involves leveraging data assets that your organization already owns or can acquire. Generally, unless it is a very simple problem, there would be more than one ML model involved, maybe different types of models depending on the sub-task, maybe other supporting tools such as a Search Index
tag:blogger.com,1999:blog-7583720.post-4027855785487586520
Extensions
Trip Report - PyData Global 2024
generalpython
I attended PyData Global 2024 last week. Its a virtual conference, so I was able to attend it from the comfort of my home, although presentations seem to be scheduled to be maximally convenient, time-wise, for folks in the US East Coast and Western Europe, so some of them were a bit early for me. There were four main tracks -- the General Track, the Data / Data Science Track, the AI / ML track
tag:blogger.com,1999:blog-7583720.post-5810773948471156097
Extensions
Using Knowledge Graphs to enhance Retrieval Augmented Generation
generative-aigraphquestion-answeringretrieval-augmented-generationsearch
Retrieval Augmented Generation (RAG) has become a popular approach to harness LLMs for question answering using your own corpus of data. Typically, the context to augment the query that is passed into the Large Language Model (LLM) to generate an answer comes from a database or search index containing your domain data. When it is a search index, the trend is to use Vector search (HNSW ANN based)
tag:blogger.com,1999:blog-7583720.post-3351294832206390671
Extensions
Experiments with Prompt Compression
generative-aiinformation-retrievallarge-language-models
I recently came across Prompt Compression (in the context of Prompt Engineering on Large Language Models) on this short course on Prompt Compression and Query Optimization from DeepLearning.AI. Essentially it involves compressing the prompt text using a trained model to drop non-essential tokens. The resulting prompt is shorter (and in cases of the original context being longer than the LLM's
tag:blogger.com,1999:blog-7583720.post-8297257525824569369
Extensions
Table Extraction from PDFs using Multimodal (Vision) LLMs
large-language-modelsvision-models
Couple of weeks ago a colleague and I participated in an internal hackathon where the task was to come up with an interesting use case using the recent multi-modal Large Language Models (LLMs). Multi-modal LLMs take not only text inputs via their prompt like earlier LLMs, but can also accept non-text modalities such as images and audio. Some examples of multi-modal LLMs are GPT-4o from OpenAI,
tag:blogger.com,1999:blog-7583720.post-1724681603841217995
Extensions
Book Report: Pandas Workout
data-analysisdata-managementdata-sciencepandaspython
Unlike many Data Scientists, I didn't automatically reach for Pandas when I needed to analyze data. I came upon this discipline (Data Science) as a Java Software Engineer who used Python for scripting, so I was quite comfortable operating on JSON / CSV / text files directly, loading data into relational databases and running SQL against them, and building visualizations with Matplotlib. So when
tag:blogger.com,1999:blog-7583720.post-2292899831695767070
Extensions
Finetuning RAGAS Metrics using DSPy
evaluationgenerative-aiinformation-retrievallarge-language-modelspython
Last month, I decided to sign-up for the Google AI Hackathon, where Google provided access to their Gemini Large Language Model (LLM) and tasked participants with building a creative application on top of it. I have worked with Anthropic's Claude and OpenAI's GPT-3 at work previously, and I was curious to see how Gemini stacked up against them. I was joined in that effort by David Campbell and
tag:blogger.com,1999:blog-7583720.post-2003044845915300861
Extensions
Performance Analysis of Float vs Byte vs Binary Vectors on OpenSearch
searchvector-search
I've been working on an application where, given an input string, the objective is to recommend an output string that is similar to the input string, for some notion of similarity. A machine learning model, in this case a SentenceTransformers model, is taught this notion of similarity by showing it many examples of input-output pairs. The model's weights are then used to encode the part to be
tag:blogger.com,1999:blog-7583720.post-2267706794105371634
Extensions
KGC/HCLS 2024 Trip Report
conferenceknowledge-graph
I was at KGC (Knowledge Graph Conference) 2024, which is happening May 6-10 at Cornell Tech. I was presenting (virtually) at their Health Care and Life Sciences (HCLS) workshop, so my speakers pass was only valid for today for the HCLS portion of KGC. My trip report covers a few talks that I attended here. Attending virtually was a bit chaotic as sessions went over sometimes, so you might leave a
tag:blogger.com,1999:blog-7583720.post-7969812542594217370
Extensions
Book Report: Machine Learning for Drug Discovery
biomedical-informaticsgeneralmachine-learning
Drug Discovery is a field where biochemists (and more recently computer scientists) turn ideas into potential medications. I first came across a few applications in this area when checking out how to build Graph Neural Networks (GNN) as part of auditing the CS224W: Machine Learning with Graphs course from Stanford, some learnings of which I recycled into my Deep Learning with Graphs tutorial at
tag:blogger.com,1999:blog-7583720.post-8966343464351206346
Extensions
Hierarchical (and other) Indexes using LlamaIndex for RAG Content Enrichment
generative-ailarge-language-modelspythonsummarization
At our weekly This Week in Machine Learning (TWIML) meetings, (our leader and facilitataor) Darin Plutchok pointed out a LinkedIn blog post on Semantic Chunking that has been recently implemented in the LangChain framework. Unlike more traditional chunking approaches that use number of tokens or separator tokens as a guide, this one chunks groups of sentences into semantic units by breaking them
tag:blogger.com,1999:blog-7583720.post-2444654388244073272
Extensions
Thoughts on using LangChain LCEL with Claude
large-language-modelsprompt-engineeringquestion-answeringquestion-generation
I got into Natural Language Processing (NLP) and Machine Learning (ML) through Search. And this led me into Generative AI (GenAI), which led me back to Search via Retrieval Augmented Generation (RAG). RAG started out relatively simple -- take a query, generate search results, use search results as context for a Large Language Model (LLM) to generate an abstractive summary of the results. Back
tag:blogger.com,1999:blog-7583720.post-13596624551983867
Extensions
Book Report: Allen B Downey's Probably Overthinking It
statistics
I have read Allen Downey's books on statistics in the past, when trying to turn myself from a Software Engineer into what Josh Wills says a Data Scientist is -- someone who is better at statistics than a Software Engineer and better at software than a statistician (with somewhat limited success in the first area, I will hasten to add). Last year, I had the good fortune to present at PyData Global
tag:blogger.com,1999:blog-7583720.post-6145352586582115069
Extensions
Knowledge Graph Aligned Entity Linker using SentenceTransformers
knowledge-graphnamed-entity-linkingnamed-entity-recognitionnlpsentence-transformerstransformers
Most of us are familiar with Named Entity Recognizers (NERs) that can recognize spans in text as belonging to a small number of classes, such as Person (PER), Organization (ORG), Location (LOC), etc. These are usually multi-class classifier models, trained on input sequences to return BIO (Begin-Input-Output) tags for each token. However, recognizing entities in a Knowledge Graph (KG) using this
tag:blogger.com,1999:blog-7583720.post-1026445863886131169
Extensions
PyData Global 2023: Trip Report
generalpython
I had the opportunity to present at PyData Global this year. It is a virtual conference that ran over 3 days in multiple tracks from December 6 to 8. I talked about Building Learning to Rank models for search using Large Language Models. For those attending the conference, I already shared the links to the slides and the associated code on its Discord channel, but for those who are not, they are
tag:blogger.com,1999:blog-7583720.post-5896316554942919450
Extensions
Building Learning to Rank Models with Generative AI
Generative AI has been the new cool kid on the AI / ML block since early this year. Like everyone else, I continue to be amazed and wowed with each successive success story as they break existing benchmark records and showcase novel applications built on top of their new functionality. I was also lucky to be involved in a Generative AI project since the middle of this year, which gave me access
tag:blogger.com,1999:blog-7583720.post-9162304331871008979
Extensions
A PySpark idiom for efficient Model Inference
machine-learningpythonspark
I recently needed to build an Apache Spark (PySpark) job where the task was (among other things) to use a Language Model (LM) to encode text into vectors. This is an embarassingly parallel job where the text to encoding is one to one, so something like Spark works very well here. We could, in theory at least, achieve a N-fold performance improvement by horizontally partitioning the data into N
tag:blogger.com,1999:blog-7583720.post-6305567007053567601
Extensions
BMI 702 Review Part IV -- Biomedical Imaging
biomedical-informaticsimage-processing
Here is Part IV of my ongoing review of the Biomedical Artificial Intelligence (BMI 702) course, part of Harvard's Foundation of Biomedical Informatics 2023 Spring session, taught by Prof Marinka Zitnik and her team. If you want to check out my previous reviews in this series, they are listed below. BMI 702 Review Part I BMI 702 Review Part II (Graph Learning) BMI 702 Review Part III (
tag:blogger.com,1999:blog-7583720.post-3144991686109696486
Extensions