GeistHaus
log in · sign up

Babbling Fish: A personal blog

Part of babbling.fish

A personal blog exploring topics in data.

stories primary
The Keeb Life
When I got my ZSA Moonlander keyboard, I felt like I had to learn how to type again for the first time. It felt like being in a foreign…
Show full content

A view from the Sea-to-Sky Gondola, Squamish, BC
A view from the Sea-to-Sky Gondola, Squamish, BC

When I got my ZSA Moonlander keyboard, I felt like I had to learn how to type again for the first time. It felt like being in a foreign country where you don’t know the language, it was great.

Just to make things harder on myself, I decided to get the blank keys. I have always been typing by touch, and besides, how hard could this be?

It would take me around two months of typing practice to get back up to my normal typing speed.

I got a split keyboard because I wanted to improve my posture. A split keyboard allows me to more easily pull back my shoulders as I type because my hands are further apart on the desk. But then I was like, why stop there? If I am getting a split keyboard, it may as well be ortholinear, meaning the keys are arranged in vertical columns like a spreadsheet.

All of this this takes some getting used to, but once you get the hang of it it feels more efficient. If you take the typing lesson the ZSA website–which I highly recommend–it explains how each finger is responsible for one or two columns. The ortholinear layouts makes it so the keys are all easily reachable by the same finger without needing to move left or right.

More than anything, the new keyboard revealed that I had acquired some bad habits. It turns out that I type mostly with my left hand, probably because the most frequent letters in the English language are on the left side of the keyboard. I had to fight the urge for my left hand to press the Y, H, and N keys, which now sat way out of reach. Funnily enough, when I use a regular keyboard I still type this way.

Speaking of which, I now regularly switch between the Moonlander and normal keyboards without issue. After my initial immersion into the Moonlander, I temporarily had to relearn how to use a regular keyboard. This was much quicker than learning to use the Moonlander.

The Moonlander has onboard firmware that allows users to customize the keyboard layout. You are given access to powerful tools like modifiers and layers. At first, I wanted to create custom layers so that my favorite shortcuts would be the same between MacOS and Windows, as I frequently switch between the two. This turned out to be not such a great idea. It’s not enough just to swap the control with the command key, as I naively thought at first. That will work for many of the most common shortcuts, but there are many more shortcuts that are too different.

At my last job, I had a 2019 MacBook Pro and the escape key was a touch button. That thing was so hard to find! I was always getting trapped inside of vi. Maybe as an onvercorreciton, I now have multiple escape keys that are easily accessible from both hands.

One area I was not expecting to run into issues is with the way games implement their keyboard controls. In the default layout there is no tab key on the left side, something that is easy enough to fix. Some games require you to hold down keys that I mapped to be only tappable, like the escape key. For example, you have to hold down the escape key to skip cutscenes in Baldur’s Gate 3. I considered implementing a gaming layer as something I could toggle on. I decided to nix the idea as the cost of remapping my brain was too high.

The default layout created by the ZSA creators is quite innovative and intuitive. The right thumb is given a tab, enter, and escape key, with the up and down arrows in reach of the first two fingers of the right hand. The design here makes it easy for you to use the terminal or a search box, where the tab is autocomplete and the arrows are used to select from a drop down of options. The left hand is given the usual space bar, but also a backspace and the command key (which I use often). I like putting my thumbs to work, the default layout gives the thumbs little to do. There is a single button press for changing your language on windows and it always feels great.

The default Moonlander layout on Oryx
The default Moonlander layout on Oryx

On a final note, words cap (which capitalizes the following word after a tap) is far superior in my opinion to a toggleable caps lock key. Unless you are psychopath that writes in all caps, most people only need to capitalize one word at a time. If you are like me and write a lot of SQL at your job, then capitalizing one word surrounded by spaces is the name of the game. And not having to hold down a key as you type makes things just a little bit easier.

If you would like to see my custom layout, you can find it on Oryx here.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/keeb-life/
The System Traps of Applying to Medical School
When I read Thinking in Systems: A Primer by Donella Meadows, it had me thinking about systems in my life. In University, I was a pre-med…
Show full content

A Red Lacewing Butterfly, Vancouver Island, BC
A Red Lacewing Butterfly, Vancouver Island, BC

When I read Thinking in Systems: A Primer by Donella Meadows, it had me thinking about systems in my life. In University, I was a pre-med student, so my pre-med friends and I spent a lot of time talking about the med school application process.

It’s no secret that applying to medical school is expensive. People spend many years and a lot of money (e.g. MCAT classes, essay editing services, prerequisite classes, application fees, travel expenses for interviews, medical masters degree, etc) in order to earn that acceptance.

Let’s look at the application process as a system using the tools Meadows provides in Thinking in Systems.

Tragedy of the Commons

Applications to med school are done via the AAMC, a one stop shop for filling out the necessary requirements then sending it off to multiple programs with a push of a button. Before electronic applications people would apply by mailing physical paper applications. The friction of applying limited people to apply either to local programs or big name institutions. This allowed smaller programs to cater to a more local population without competition from the wider applicant pool.

The ability to apply to any school across the country so easily has greatly increased the number of applicants per program. To increase the chances of being accepted applicants apply to as many programs as possible, most people I know applied to at least 20 programs. Some apply to as many as 40 programs, ringing up costs of more than $5,000 for application fees alone.

Let’s say the probability of acceptance for a given program is the number of seats divided by the number of applicants. This makes the assumption that everyone who applies is equally likely to get in, which is not true, but good enough for the point I am trying to make.

Applying to an additional program increases your chances of being accepted into any program. Yet by having more applicants per program, the probability of acceptance goes down for all applicants to a given program.

The probability of getting into any program is the aggregation of the probability of getting into each program the applicant applies to.
The probability of getting into any program is the aggregation of the probability of getting into each program the applicant applies to.

So we have a tragedy of the commons, people acting in individual interest harm the collective. Everyone wants to get in so everyone applies to as many programs as they can afford to. The more programs people apply to the less people who get accepted at each program.

Winner Takes All

If you are an adcom (admissions committee), your job is to select the best applicants of the people who have applied. Only the best applicants are the same group of people who apply to many schools. The best candidates are the ones who have the most resources for things like research, volunteer trips, and paying for med school application fees. So they are able to apply to the most programs.

You may think that the best applicants know they are the best, and therefore only apply to the best programs. This isn’t true, they know how competitive the process is and want to get into any program. Of course they could reject an interview once they receive an acceptance at a program they like. But the acceptances are sent out much later after the interviews are conducted. So candidates are incentivized to accept as many interviews as is financially feasible (travel expenses) in order to improve their odds.

Many programs have moved their interviews online since the start of the COVID-19 pandemic. Now there is even less friction to accepting every interview invite you receive.

The result is the most competitive applicants get all the interview slots leaving disproportionately fewer slots for the applicants in the middle of the distribution. Or in other words, the winner takes all. These winners will only go to one school, yet they will attend many interviews, limited only by how many programs they applied to. While the average candidate may receive no interview invites that cycle.

All Together Now

These two system traps work together to create a system where only a select subset of the applicants to a given program receive an invite to interview. The problem is that in the middle of the distribution are capable people who are willing and able to perform the task, but are not given the opportunity.

Universities claim to have a “holistic process” of selection yet quantifiable aspects such as GPA and MCAT scores are used to rank students, often acting as a first pass eliminating 40% of applicants. This is done to reduce the number of applications the adcom has to read. Some programs receive as many as 7 thousand applications a year and only around 5% can be invited for an interview. So some kind of triage is required as a first pass.

People may argue that it’s ok if only the most qualified candidates get an interview or acceptance letter. They might say that this is a feature and not a bug. I think it unnecessarily filters out quality candidates and unfairly rewards the affluent. For a system that claims to be a meroticracy, I think it is doing a lousy job at doing so. As for a more equitable system, I find using a lottery based on a ranking system to be an interesting direction.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/systems-thinking/
The Enshitification of APIs
With the rising popularity of large language models (LLMs), the value of natural language text is increasing. Recently Reddit, following the…
Show full content

A Snowy Owl missing its left eye, North Island Wildlife Recovery Centre, Errington, BC
A Snowy Owl missing its left eye, North Island Wildlife Recovery Centre, Errington, BC

With the rising popularity of large language models (LLMs), the value of natural language text is increasing. Recently Reddit, following the example set by Twitter (aka X), decided to dramatically increase the price of using their APIs. To the point where using it at scale would be astronomically expensive and totally infeasible. (An API is a programmatic way to access information, in this case it refers to accessing data with a RESTful HTTP request without a web browswer.)

The goal of the API pricing change is to reduce access to this data and squeeze the large corporate customers who are willing to pay up. The natural progression of enshitification continues, where companies begin by offering customers value in order to onboard them. And then they gradually increase monetization and degrading the user experience. In other words, they are making their websites worse on purpose in order to increase sales.

I call this enshitification, and it is a seemingly inevitable consequence arising from the combination of the ease of changing how a platform allocates value, combined with the nature of a “two sided market,” where a platform sits between buyers and sellers, holding each hostage to the other, raking off an ever-larger share of the value that passes between them.

Not a shocking or new development, but something that will have an impact on the open source community. Now it will be impossinle for academics and independent teams to download all the natural text data from these sites for training LLMs in a cheap or conveinent way. And once again data will continue to become a resource centralized and concentrated by a few powerful corporations.

If LLMs are going to be as successful as everyone hopes they will be, then a very important question will be: with what data was it trained? The internet is a big place and there are a lot of niche communities, will they all be represented? With the goal to create a more equitable model then it will be imperative to start with a dataset that is carefully curated to remove hateful speech, misinformation, and conspiracy theories. But will also need to be as broad and diverse as possible to capture the entire probability space of language.

Twitter and Reddit are without a doubt the largest repositories of natural language that exist on the internet today. Users submitted their text for the benfit of other users, not to enrich corporations and increase their data from competition.

This leaves only one recourse: web scraping. The only way academics and individual contributors can get access to this data would be using a combination of browsers running on forward proxies to get this data without the API. Since web scraping is a violation of the terms of service for these websites, the data cannot be shared openly and easily between contributors. The scale of the data will also make hosting expensive and impractical for the open source community. It leaves little hope that the data will remain available for those outside the corporations that pay for it.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/no-more-apis/
Weak Foe Ahead
Spoiler Warning: The following article contains spoilers about enemies, items, bosses, and areas for Elden Ring by FromSoftware. Elden Ring…
Show full content

Let us learn together. Art by 澈 (Che)
Let us learn together. Art by 澈 (Che)

Spoiler Warning: The following article contains spoilers about enemies, items, bosses, and areas for Elden Ring by FromSoftware.

Elden Ring was the most successful video game launch outside of the Fifa and Call of Duty Franchises.

One way to measure the current popularity of a game is the concurrent player count, this measures how many people are currently playing. Typically the concurrent player count peaks on launch day. What made the Elden Ring launch exceptional was that the peak concurrent player count on PC occurred a week after launch.

Not only does this show Elden Ring has a remarkably high retention rate, it also means it was spreading through word-of-mouth. As of writing, 30.5% of players have beat Horah Loux (penultimate final boss) on Steam. Having 30% of players complete a 90+ hour game that sold millions of copies has to be some kind of world record.

Chart of Elden Ring concurrent player count for the last three months
Chart of Elden Ring concurrent player count for the last three months
Chart of Halo Infinite (low retention game) concurrent player count for comparison.
Chart of Halo Infinite (low retention game) concurrent player count for comparison.

What I want to talk about is my favorite new feature in the game, the mimic tear. The mimic tear is a spirit summons that creates a copy of your character with all the same equipment.

When I approached the area with the mimic tear I encountered a message that read “Weak Foe Ahead.” They were right. My mimic could not use my Unsheathe ability properly, so they were quite easy to kill by out ranging them.

The mimic tear makes the game easier, making it essentially an accessibility feature. After grinding against Radahn, unlocking Nokron and getting the mimic felt like an awesome reward. The day I unlocked the mimic I went on and defeated 4 of the bosses I had been stuck on! I was finally able to access the late game content.

Against Rykard, the mimic was not impacted by magma interrupt stagger. So they went up into the magma surrounding him and just kept smacking him non stop till he died. It was pretty funny to watch.

Playing with a mimic changes how you play. It’s super helpful when your mimic draws aggro and gives you time to heal or attack. Unless you are running a high vigor and endurance build, your mimic can’t tank that much damage. You have to develop a strategy of drawing aggro away from your mimic to keep it alive. This creates a sort of metagame where you need to learn when to draw aggro.

For my first playthrough, I used the Samurai starting class and was running a bleed build with the katana (early game Uchigatana, mid game Nagakiba, late game Rivers of Blood) mostly with the unsheathe ash of war. I factored the mimic into my build and made sure that our bleed buildup could stack.

The mimic uses whatever weapon is equipped in its right hand when summoned, and they have unlimited arrows. I would switch to my bow with bleed arrows before spawning my mimic so they would keep their distance and stay alive longer. While I closed the distance and engaged with the sword. There are so many opportunities for creative strategies, especially for multiclass builds like spellblade.

I was on my twelfth attempt against Mohg, after dying repeatedly in the second phase. When my mimic and I managed to wipe out his remaining health bar during the 30 second damage window while he was transitioning to phase two. There was something really satisfying when my mimic and I would proc hemorrhage multiple times in quick succession. Hemorrhage takes ten or fifteen percent off the total max health, so used correctly you can absolutely melt bosses with it.

What I love most about this feature is that it gives you someone to play co-op with. Even when the player count was high after launch, it was difficult to find summon signs in specific areas at certain times of day. Over time as player count goes down this issue will only get worse. I have no doubt this feature will be appreciated even more in the future.

Many people (myself included) do not have a group of people to play with. When I came to FromSoft games, they were already well past their launch on PC and there was no one playing online. I have always wanted to be part of a group of people progressing together, helping each other out on hard bosses.

At times the game can be scary. Especially dark areas filled with enemies hiding and waiting to jump out at you.

In the Mogwyn palace, the interior is pitch black and enemies come out of pools of blood on the ground to ambush you. I felt safe with my mimic, they could lock on to enemies before I even saw them. When two enemies spawned, I would take one and they would take the other. Playing with the mimic is like playing with an older sibling (I have two older brothers) as a kid. You have someone with seemingly magic powers to protect you. Now you never have to play alone.

Rest here weary traveler
Rest here weary traveler

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/weak-foe-ahead/
Daylight Savings Time
With the senate passing a bill, the US has finally decided to make daylight savings time (DST) permanent in 2023 (pending approval by…
Show full content

With the senate passing a bill, the US has finally decided to make daylight savings time (DST) permanent in 2023 (pending approval by congress and the president).

mars

While all programmers work with time, the work of a Data Engineer especially involves working with timestamps in the database. Some tasks commonly done by data engineers include reporting, logging, and tracking the arrival time of events.

I am going to talk about some of the work that will be involved in the transition. Also, I will discuss strategies that I have seen used to handle time changes for scheduling and reporting.

Goodbye Backwards Compatibility

It is considered best practice to store timestamps in UTC format in the database. One benefit is that the UTC time zone does not have DST changes.

If you stored your timestamps in a local timezone like Pacific Time (e.g. America/Los_Angeles) then permanent DST will break your backwards compatibility. All of the timestamps that are stored in PST are no longer valid. As PST will no longer exist as a valid time zone. If you wanted to map them to the present day equivalent you would need to add one hour and change the timezone info.

Committing such a change would be a major pain. The timestamp would likely be spread out across several different tables and stored in files inside an object store like S3. Changing the value everywhere without breaking anything would be near-impossible.

The only thing that makes sense to me would be to gradually age out the old data. A team could migrate to UTC this year and start collecting records with clean timestamps. Then they can set a retention policy and delete the old data with the dirty timestamps. Hopefully they can do this before the execution of the change in 2023.

Goodbye Scheduling Pain

In our current system, the twice-a-year hour change causes issues with scheduling and orchestration. As an example, let’s take a popular open source workflow management tool for scheduling like Apache Airflow. Airflow runs on UTC time, and uses UTC timestamps exclusively in its data model.

We often have reports that are meant to be delivered by a specific time in a local time zone. For example, we want to see a summary of yesterday’s performance at 0800 PT. The job that delivers the report is being run at a specific time in UTC. So when there is an hour change, the report is now running at the wrong time.

There are two solutions to this problem, the simple solution is to manually change the schedule_interval twice a year. Move it up one hour in March, and back one hour in November. Another solution is to write a script to change the schedules automatically. This solution ends up being surprisingly time consuming.

There is a similar problem when it comes to running scripts that aggregate data from the previous calendar day. It is common to run reporting scripts at around midnight for a local US time zone. Because we want to process yesterday’s data for reporting as soon as yesterday is over. Since reporting is typically grouped by calendar day.

What makes this scenario more complicated than the above is these aggregation scripts have several upstream and downstream dependencies. A single change to scheduling can have a chain reaction causing other scripts to change their time as well.

days since

Hello Job Security

In all reality this problem will be easy to fix for most companies that are using servers connected to the public internet and running the latest version of their open source software. For example, running apt upgrade would patch all the software to include the new timezone information.

Somewhere out there is old enterprise software with datetime libraries that were developed in-house back in the 80s or 90s. Doing a major version update on a forty year old software library may require a total rewrite of the code itself.

Such a task would need to be done by specialists with a very strong understanding of the code. Surely some COBOL programmers who have been kept on retainer will be happy to hear about this change.

Conclusion

The fact that the US decided to go with DST permanently is a somewhat existential problem. For much of human history, solar noon has been defined as the time of day when the sun is highest in the sky based on a sundial. Now we are saying that noon is actually one hour after solar noon.

Just kidding, all times are made up so who cares. I definitely prefer DST because it gives me more daylight hours after I finish working for the day. I also noticed more people in my neighborhood walking outside at times that would have been dark before DST.

While there will be work involved in the transition, in the long run this should reduce work for programmers. The bi-annual time change has always been a risk that creates unnecessary issues and complexity.

praise the sun

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/dst/
Learning Modern Greek
When I started learning Modern Greek I felt frustrated when I typed “Greek Learning Resources” into Google. Every item on the first page was…
Show full content

A view of Quandary Peak (left), Blue River CO
A view of Quandary Peak (left), Blue River CO

When I started learning Modern Greek I felt frustrated when I typed “Greek Learning Resources” into Google. Every item on the first page was for Ancient Greek. It is more common for American universities to teach Ancient Greek than Modern Greek. Why is it so hard to find good resources for Modern Greek?

Eventually I did find some great resources for learning Modern Greek, it just took a lot of digging. Here I have compiled my favorite resources so others can avoid the effort I had to do.

The resources are arranged by level using the European CEFT system. Using these resources, I created a deliberate practice routine to teach myself Greek.

Absolute Beginner (A1)

Language Transfer is a series of 120 podcasts, each around ten minutes long. The podcasts follow a formula the creator calls “The Thinking Method.” The Thinking Method is a way of learning the fundamental rules of a language by presenting one rule at a time with examples, then allowing the learner to think of sentences that utilize that rule.

Because the podcasts are short they are easy to do once a day in under fifteen minutes. Making them the perfect way to begin a language learning routine because it provides structure. The concepts covered get progressively more complicated, and towards the end different concepts start to be combined together.

What I like about the Thinking Method is it focuses on learning by doing. It teaches the form of the language through demonstration, and it encourages the learner to actively engage with what they’ve learned.

This is a more generative way of learning a language that mimics to some extent how children naturally learn language, by hearing it and then speaking it. Greek (like all languages) has surprisingly few rules that when combined can produce beautiful and complex language. The focus of this program is to teach you the basic rules so you can identify correct sentences from incorrect sentences. Then you can start thinking in the language and self-correct your own mistakes.

This podcast was a major game changer for me. I had been struggling with learning on my own for some time when I had found it. Afterwards I felt like I had a really solid foundation and could finally wrap my head around the first principles.

Resources

Here is a collection of resources I find useful through my language learning journey. I’d recommend getting comfortable with using them early on. They are useful for when you are reading, writing, or listening. I would highly recommend avoiding Google Translate. It’s not very good at translating between English and Greek. And some people might use it as a crutch.

WordReference

WordReference is the most comprehensive English to greek dictionary I have found. It also has slang and info for different dialects. This is handy for looking up the meaning of words you don’t know.

Cooljugator

Cooljugator Is a website that allows you to look up the verb conjugations for a given verb. It can help you when you are having trouble remembering a specific conjugation. Or if you see a word for the first time and you want to know if it’s irregular.

Wikitionary

Wikitionary Is helpful for looking up the Etymology of words. Greek is the oldest continuously used language in the world. Many words have a rich etymology and surprising associations with words you already know!

I have intentionally not included a website for pronunciation, though there are many out there. Greek is a language with a one-to-one ratio between letters and sounds. There is no ambiguity in what a word should sound like. The challenge is pronouncing it correctly! Once you learn the alphabet and the vowel sounds you will have all the tools you need to guess a word’s pronunciation. Developing an accurate accent is something that would come much later.

Upper Level Beginner (A2)

At this stage you have a basic understanding of Modern Greek. I define this as being able to conjugate verbs in the various tenses; use pronouns correctly; and comprehension of sentences with the help of a dictionary.

The goal in this phase is to grow vocabulary, improve listening comprehension, reading comprehension, and the ability to generate language. These four things should be worked on simultaneously, if you have time then use all three resources every day.

Reading

Modern Greek for Classicists is a Graded Reader designed for beginners. A graded reader is a text designed to use grammar and vocabulary that is easy enough for people learning a language. It gets progressively more difficult after each chapter. Don’t worry if you are not familiar with Greek mythology, It is useful to anyone learning Modern Greek.

What I love about Graded readers is that they are the appropriate difficulty for a beginner. Even with my limited vocabulary I could understand 95% of the words on the page. This makes you feel like you are really reading Greek. When you read a language with high comprehension things begin to click as you see the different pieces come together in real time.

At the end of each chapter are exercises to test your comprehension and to practice your reading and writing. This is a good opportunity to practice writing by answering the questions. Or what I like to do is to answer the questions verbally by talking to myself.

Note: The book costs $20 and shipping is around $5 for those in the US when I bought it. This is not an affiliate link or an advertisement.

Vocabulary

ClozeMaster is a game available on web and mobile that utilizes cloze deletion tests for learning vocabulary and natural phrases. A cloze test is a sentence with a word deleted and you have to guess the word. This game is free for the basic features with no limits on how much you can play.

I have tried flash cards as well as other gamification vocab apps, I find cloze deletions to be the most effective method. What is enjoyable about ClozeMaster is that it uses natural language for the example sentences. So it’s a fun way to pick up expressions and phrases, to learn how people actually speak.

I like to practice speaking by talking to myself as I play the game. Using the example sentence as a starting point, I’ll make small adjustments, adding a clause, changing the tone, swapping the object with the subject. This helps the vocab stick in my mind as well as helps me practice generating the language.

Listening Comprehension

Peppa the Pig (Πεππά το Γουρανάκι) Is a youtube channel that makes educational cartoons for preschoolers. The language is simple enough that it is accessible to beginners. I have found that if I try to use materials that are too difficult for me it will feel really discouraging. So it’s better to use something that is the appropriate difficulty.

One downside is that a lot of the vocabulary is not relevant to being an adult. For example when peppa plays with dolls or jumps in puddles. These are not words adults use with a ton of frequency. The main thing here is to focus on understanding what is being said. Greek is spoken quickly and it takes the brain some training to hear where the gaps are between words and phonemes.

Lower Level Intermediate (B1)

At the intermediate level you can start expanding your practice to some more challenging resources.

Easy Greek is a great way to practice listening to Greek as its spoken naturally. In these videos Dimitris walks around Athens interviewing people on various topics. He then captions the video with Greek and English subtitles.

As I watch the videos I read the Greek subtitles and listen for when they’re spoken. When there’s a word I don’t know I pause the video and look at the English translation.

There are several ways to engage with this material. Sometimes I will listen to a phrase then pause, think about what was said, then read the subtitles to check if I’m right.

If you become a Patreon supporter, they also provide worksheets and online flashcards, as well as audio recording with slower speaking. I’ll watch the video over and over again until I can watch it without reading the subtitles, and only listening.

Once you can start understanding Greek without subtitles you are ready to watch TV in Greek. This is probably my favorite way to practice because it creates a low friction way to practice listening comprehension for an extended period of time.

There are a lot of shows and movies dubbed in Greek on Netflix. I would recommend starting with some of the Dreamworks animated shows. For example, I really enjoyed Kipo, it has great characters, story, and voice acting. Blood of Zeus is surprisingly approachable because the characters speak in short simple phrases.

Conclusion

The journey of language acquisition never ends because there is always room for improvement. While this sequence can be followed in the beginning as you get more advanced your practice will need to be more self-directed. It’s best to use as many different types of resources as possible. For example you could use a combination of different mediums like books, podcasts, tv, and audiobooks.

As you get better the number of resources that are accessible increases. Eventually something as advanced as Greek TV and podcasts will become accessible. At that point practicing can consist of speaking with natives or watching a funny sitcom like Στο Παρά or Είσαι το Ταίρι Μου.

There are no shortcuts so the best thing to do is love the process. Approach the language each day with a beginner’s mind, and have fun with it! If you are like me, progress may be slow. As long as you are enjoying the process that will help sustain your practice.

Edit: Republished to my new blog.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/learn-greek/
Preparing for the SQL Interview
When applying to be a data engineer at a big company, you will likely need to pass a SQL test. I’d like to take what I’ve learned and share…
Show full content

Wild California Poppy
Wild California Poppy

When applying to be a data engineer at a big company, you will likely need to pass a SQL test. I’d like to take what I’ve learned and share it with prospective candidates. If you feel stuck on a question you can cycle through this list in your mind to find ideas on how to proceed.

You should already know some basic SQL. If you don’t there are a lot of high quality tutorials out there. I took the SQL tutorial on codeacademy ages ago, and I thought it was pretty good.

General tips

You should use PostgreSQL (PG) to practice for your interview. PG is an open source database with a very robust feature set. Also, it is the only SQL dialect supported on Coderpad. Coderpad is used by several large companies e.g. Google, Meta, Amazon.

It is easy to spin up your own PG instance using docker or installing directly to your machine via homebrew or apt. It mostly uses standard SQL, the universal core spec of the SQL programming language. It also has support for analytical functions which are important for more difficult questions.

It’s good to know some basics about query optimization in PG. For example, you should know that common table expressions (CTEs) are implemented as temporary tables. This is an easy way to get a performance boost by using intermediate tables.

The Problem

Let’s use a leetcode problem as an example. Take some time to try and solve this on your own.

The question is to find the cancellation rate for each day between 2013-10-01 and 2013-10-03, but only counting trips when the driver and the user are not banned. Here is the schema from the linked problem for reference.

Trips
+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| id          | int      |
| client_id   | int      |
| driver_id   | int      |
| city_id     | int      |
| status      | enum     |
| request_at  | date     |     
+-------------+----------+
id is the primary key for this table.
The table holds all taxi trips. Each trip has a unique id, while client_id and driver_id are foreign keys to the users_id at the Users table.
Status is an ENUM type of ('completed', 'cancelled_by_driver', 'cancelled_by_client').

Users
+-------------+----------+
| Column Name | Type     |
+-------------+----------+
| user_id     | int      |
| banned      | enum     |
| role        | enum     |
+-------------+----------+
user_id is the primary key for this table.
The table holds all users. Each user has a unique user_id, and role is an ENUM type of ('client', 'driver', 'partner').
banned is an ENUM type of ('Yes', 'No').

This problem has been copied from LeetCode for educational purposes only. This website is not monetized in any form.

Avoid Many-to-Many Joins

Writing SQL is really all about normalization. Every table needs to have a column or set of columns that makes a record unique. Without this we would have no way to know if duplicates exist.

Duplicates can easily be introduced by accident through a many-to-many join. A many-to-many join will take the cartesian product between the two sets, quickly multiplying the size of the result set.

When you perform a join between two tables using a column that is not normalized, the resulting set will contain duplicates. As an initial step you’ll need to normalize your data before performing the joins using CTEs. This is basically true for every SQL question I encountered.

Let’s take a look at the example problem. A naive approach would be to do a self-join on the trips table joining on the request_at column. Doing so would result in a many-to-many join as several records in the table can have the same request_at.

To answer this question we will need to create two CTEs: one for the number of canceled trips, and another for the total trips. These CTEs will need to be normalized first before you can join them together.

WITH canceled AS (
   SELECT t.request_at, count(*) `cancelled_count`
     FROM Trips t
    INNER JOIN Users c ON t.client_id = c.users_id
    INNER JOIN Users d ON t.driver_id = d.users_id
    WHERE c.banned = 'No' AND d.banned = 'No' AND t.status IN (2,3)
    GROUP BY t.request_at 
), total_trips AS (
   SELECT t.request_at, count(*) `total_count`
     FROM Trips t
    INNER JOIN Users c ON t.client_id = c.user_id
    INNER JOIN Users d ON t.driver_id = d.user_id
    WHERE c.banned = 'No' AND d.banned = 'No'
    GROUP BY t.request_at 
)

Here is an example on how to create two CTEs to normalize on the request_at column. Now we can join these two CTEs together with a a one-to-one join.

Know how to handle NULLs

A visual description of null. Most SQL Dialects do not have an undefined field.
A visual description of null. Most SQL Dialects do not have an undefined field.

Anyone who has worked with SQL knows that NULL is a tricky concept. I was frequently asked to demonstrate that I understand how to handle null values appropriately in my interviews.

When comparing NULL you need to use the IS operator because NULL is a singleton (NULL is NULL returns true). NULL is not equal to NULL (NULL = NULL would return false). When you do arithmetic with a number and a NULL then you will get a NULL result since they are not comparable. For this reason, you must use the IFNULL function to convert NULL into zero before using it in arithmetic.

At the same time you have to avoid divide by zero errors. If you try to divide by zero that will throw an error. If you divide by NULL that will make the result NULL, which is what we want.

Here we look at how we will calculate the cancellation rate based on the two CTEs above.

SELECT
    tt.request_at AS `Day`, 
    ROUND(
        IFNULL(c.cancelled_count,0) / 
        IF(tt.total_count=0,NULL,tt.total_count), 
        2) AS `Cancellation Rate`
FROM
total_trips tt
LEFT OUTER JOIN canceled c ON tt.request_at = c.request_at
WHERE tt.request_at >="2013-10-01" AND tt.request_at <= "2013-10-03";

When you are performing an outer join, we can expect to see some null values in our result set. In this query the cancelled_count or the total_count could be null. By wrapping them in an IFNULL statement and setting them to 0, we avoid having our ratios converted to NULL inapropriately.

Let’s say we want to join between the trips and users table. But there are users in the trips table that are not in the users table, as well as the converse. We don’t want to have user_id columns in our result set as that will make downstream processing more complicated. So what we can do is use a COALESCE statement so we can find the first non-null values across both columns. The COALESCE function can take many arguments so this can be done across multiple columns.

SELECT DISTINCT COALESCE(users.user_id, trips.client_id)
  FROM trips
  FULL OUTER JOIN users 
      ON trips.client_id = users.user_id
Know how to use Having with Group By

This is an easy one so be sure to know it. A common followup question for a query involving a GROUP BY is to ask a question that requires adding a HAVING clause. Let’s say we modify our problem to return the cancellation rate only for days that have more than X canceled trips. To do this we could add a HAVING clause to filter out days with less than X canceled trips.

WITH canceled AS (
  SELECT t.request_at, count(*) `cancelled_count`
    FROM
    Trips t
    INNER JOIN Users c ON t.client_id = c.users_id
    INNER JOIN Users d ON t.driver_id = d.users_id
    WHERE c.banned = 'No' AND d.banned = 'No' AND t.status IN (2,3)
    GROUP BY t.request_at
    HAVING COUNT(*) > 10000 
)
How to use Row Number to get the Kth Item

Whenever you see them asking for the kth ordered item, you should instantly think of ROW_NUMBER. Similarly you may need to use Rank, but this is less common because it is harder to use. With Rank you can end up having multiple rows with the same ranking and that makes the code more complex.

If you learn only one analytical function, make it this one. Let’s say we want the driver with the fifth most completed trips.

WITH trips AS (
 SELECT driver_id, COUNT(1) as num_trips
   FROM trips
  WHERE status = 'completed'
)
 SELECT driver_id 
   FROM (
         SELECT driver_id,
                ROW_NUMBER() OVER 
                (PARTITION BY driver_id ORDER BY num_trips DESC) as row_num
           FROM trips ) T
  WHERE T.row_num = 5

This query will only return the driver the fifth most trips. There are other analytical functions that are useful, so be sure to read up on some of the others. Some functions are easier to write questions for than others so keep that in mind when you decide how much effort to put into studying each.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; } .default-dark { background-color: #1E1E1E; color: #D4D4D4; } .default-dark .mtk4 { color: #569CD6; } .default-dark .mtk1 { color: #D4D4D4; } .default-dark .mtk11 { color: #DCDCAA; } .default-dark .mtk8 { color: #CE9178; } .default-dark .mtk7 { color: #B5CEA8; } .default-dark .grvsc-line-highlighted::before { background-color: var(--grvsc-line-highlighted-background-color, rgba(255, 255, 255, 0.1)); box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, rgba(255, 255, 255, 0.5)); }
http://babbling.fish/sql-interview/
How to create a Markov Chain Monte Carlo Model
There has been a trend of retail investors adding their shares to the Direct Registration System (DRS). One question that has been…
Show full content

There has been a trend of retail investors adding their shares to the Direct Registration System (DRS). One question that has been concerning the r/superstonk subreddit is how to approximate the number of shares that have been added to DRS via ComputerShare.

This model is built on several strong assumptions and premises. The most important piece of information is from the GameStop 10Q 2021 Q3 filing, where they declared that there are currently 5.2 millions shares directly registered through October 31st, 2021. We can combine this with information being gathered by redditors to build a probabilistic model.

Computershare Accounts

Computer Share account numbers are generated using a mod11 algorithm. A mod11 algorithm is a monotonic generator that uses a tenth digit as a check digit. By taking the mod11 of a given account number we can find where it is in the sequence.

If we can find the maximum account number, we can then infer the total number of accounts that exist. Reddit user u/stopfuckingwithme has been tracking the maximum account number posted to reddit for a given day.

A post on October 30th, 2021 found that based on the current maximum account number, there were at least 72 thousand accounts in existence. We will be using this number when creating our model.

Number of Shares per Account

Members of the reddit community have been posting screenshots and letters of the shares they have added to ComputerShare. Reddit user u/Roid_Rage_Smurf created DRSBOT that collects information on the number of shares users have registered. The bot allows users to self-report the number of shares for a given post, which are then confirmed by witnesses. The number of shares is then stored in a database and displayed in summary tables in the comments.

This data set represents a small sample of accounts, from which we can infer more about the distribution.

I will be using this comment from October 31st, 2021 as my data source for this analysis.

Multiplier model

I have seen other people on r/superstonk estimate the total number of DRS shares by multiplying the number of shares by the mean or median. These methods result in over and under counting respectively.

The sample is skewed to the right by users who have many shares. The shareholder with the most shares has over 1000 times more shares than the median shareholder. Using the mean multiplied by the number of accounts to extrapolate will result in overcounting by a large margin.

Using the median has the opposite problem. When multiplying the number of accounts by the median we end up undercounting. This method doesn’t account for the users with a lot of shares and only counts users with the average amount of shares.

The Median model underestimates while the mean model overestimates the ground truth (correct) value.
The Median model underestimates while the mean model overestimates the ground truth (correct) value.

The method outlined here should produce a more accurate estimation because it will simulate the nuances of the particular distribution. Rather than using a multiplier model (what was described above) this model will take into account the shape of the data in order to arrive at a more accurate estimate.

Markov Chain Monte Carlo

A Markov Chain Monte Carlo (MCMC) is a method for generating a statistical prediction by simulating data with a random number generation. We encapsulate our assumptions about the data into a probability distribution, then we generate samples of data using the probability distribution. We do this simulation many times and save the results each time. Then we can combine the results of each simulation to form a probability distribution of the possible outcomes.

One really great thing about this method is that it works with limited information. I was unable to get the raw data from DRSBOT (hit me up u/Roid_Rage_Smurf) but that is OK. All we need to know is the shape of the data. And that info is being provided in the DRSBOT comments on every post.

Not only will this provide us with a prediction. It will also provide us with a range of possible values, and a probability value assigned to each. This is valuable information for knowing the confidence interval of our model.

Picking a distribution

I choose a log-normal distribution as my generator. We know empirically and intuitively that the number of shares per account follows a power law. In other words it follows an exponential relationship, with the majority of the shares being held by a minority of the shareholders. This is also known as the Pareto Principle.

Log-normal distributions for different means and standard deviations
Log-normal distributions for different means and standard deviations

I wanted to represent the core distribution as a normal distribution. I suspect that the majority of shareholders fall within a narrow band in terms of the number of shares they own in DRS form. I also wanted to allow exponential values to account for whales, with a lower incidence compared to the mean. The log-normal distribution does a fair job of accomplishing both these things.

A quick note, I do not believe the data is actually normal in nature. I have simply chosen this generator because it was convenient and fit well enough.

MCMC model

First, let’s look at the distribution of the data on October 31st according to DRSBOT.

Number of Shares Number of People with at Least X Shares x 1633 xx 3290 xxx 1893 xxxx 192 xxxxx 7

As you can see, most people have only a few shares. We would say the distribution is left skewed with a fat tail to the right. Hopefully this convinces you that the distribution is log-normal.

Like I mentioned, small holders are being undersampled by DRSBOT. This is being corrected in the model with a higher proportion of users with 1-100 shares compared to shareholders with over 100 shares.

DRSBOT did not contain data from users with fractional shares on October 31st, while my model does represent users with less than one share.

Now let’s run the simulation ten thousand times using the number of accounts on 2022-01-01 assuming there are 110 K registered shares.

The predicted number of shares for my model after running ten thousand simulations.
The predicted number of shares for my model after running ten thousand simulations.

According to my model the most likely estimate is 8 million shares. You can see the graph as a fat tail to the right. That is because the exponential nature of the model produced some samples with a result much higher than the mean. Because the output is normally distributed, we can assume the mean will always be the center due to the central tendency of normal distributions.

Assumptions

Our distribution is highly skewed to the right. In other words, most people have registered a small amount of shares while a few users have registered a large amount of shares. For this reason, the median is going to be a more reliable indicator of the average shareholder compared to the mean. Because the mean can be skewed by large numbers. I mean average here to be analogous to the mode, or the most common number.

In my opinion, it is safe to assume that our sample is greatly undersampling users with a small amount of shares. Assuming there were 72K shares on October 30th, 2021. DRSBOT had registered the shares from 3K accounts. That means our sample contains less than 5% of the total population. Nearly half the sample is from shareholders with over 99 shares. It also makes intuitive sense that whales would be overrepresented in our sample since whales like to flex to inspire the rest of the troops. While most people do not post to reddit or may not meet the karma requirement.

The median number of shares held on October 31st according to DRSBOT was 40. I am assuming that 95% of Computershare shareholders have less than 50 shares. So I am going to say the real median is closer to 25. I choose a standard deviation of 5 because my intuition tells me that 95% of users have between 5 to 40 shares. I also found these numbers experimentally. I played around with the parameters until the output of the model matched 5.2M when there are 72K shareholders.

P.S. I wrote the analysis using the PyMC3 library in python. I originally intended to share the notebook but I have since lost it before publishing.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/mcmc/
The Data Problem for Startups
The hard pill to swallow for many startup founders is that getting data from a business is expensive and time consuming. Oftentimes…
Show full content

Muir Beach Coast Trail, Marin County CA
Muir Beach Coast Trail, Marin County CA

The hard pill to swallow for many startup founders is that getting data from a business is expensive and time consuming. Oftentimes engineers are searching for ways to speed up the process with automation and standard operating procedures. These help to an extent but fundamentally they cannot scale without also scaling the number of employees.

The Problem

Let’s say we have a company that does analytical modeling for insurance companies. In order to perform their analysis, the startup needs to move the client’s data from their database into the startup’s database.

This process is painful and error-prone. The client may be using an IBM mainframe from the 1980s and can only export the data as fixed character length files once a day at 2 AM when the load is lowest. Another client may be able to send you files over an object store in a more reasonable format like Parquet. Even so, as you start to work with this customer’s data you’ll find they have made small, yet extremely important, differences in the way they represent their data.

I am very skeptical of these startups that rely on business data and assume they can onboard hundreds of small clients. The idea is that if process of on boarding clients is cheap and quick enough, the startup can quickly scale up.

The most obvious problem is that clients do not have the same data models. Once they send the data over someone on the startup side then has to do the boring work of mapping everything from the client’s schema to the startup’s schema. This is where small difference become big. Say one client represents refunds as a negative transaction, while another uses positive numbers but identifies it as a refund with a flag. Someone has to write business logic to transform the clients data into a normalized format that is consistent with other clients.

Landing one big client is not the silver bullet answer either. Since that one client is going to expect premium treatment given how much they are spending. They were probably promised the moon by the sales team and the engineering team is going to be constantly working to push new features to them. The large client will have unique problems for their scale of data transfer that won’t generalize to other smaller companies or even other large companies.

How did we get here?

So why is this the case? The developers who designed the database for the business application did not have this use-case in mind. They are thinking of data access patterns to support fast single row transactions. Maybe this requires a heavy amount of normalization. When it comes time to pull this data out, it requires someone with deep institutional knowledge of the data model to write queries to pull out the relevant data.

Even then sending the data is a pain as well. Most companies do not have a way to send many small files in parallel, a method of data transfer that is fast and uses less memory. Instead they will send big files with each file corresponding to a given query for a given day.

The companies that rely on data that can scale easily are ones that are consumer facing and generate their data with machines. This data is oftentimes produced by client devices logging user interactions with application.

It is much more difficult for startups that rely on getting their data from other businesses. Especially when those businesses are not tech companies. These kind of companies tend to have small IT teams and don’t have the necessary staff and expertise to create a low friction hand-off of the data.

Conclusion

Sure they’ll be able to get faster and take on more clients. But scaling will always be linear. If they want to on board more clients that means more engineers and more solution engineers. They will never achieve the sort of parabolic growth they are hoping for.

I’d love to be proven wrong, can anyone provide me with an example of a startup that managed to scale this way?

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/startup-data-problem/
What is a Decision Tree?
Decisions trees are a simple, yet surprisingly powerful data structure used in several machine learning models. They are used in ensemble…
Show full content

Henry Cowell Redwoods State Park, Santa Cruz CA
Henry Cowell Redwoods State Park, Santa Cruz CA

Decisions trees are a simple, yet surprisingly powerful data structure used in several machine learning models. They are used in ensemble models like XGBoost and Random Forest. An ensemble model is when the output is averaged across several different methods to obtain a better result. These are often powerful models that are versatile and easy to use.

A decision tree is a flowchart-like structure in which each internal node represents a “test” on an attribute (e.g. whether a coin flip comes up heads or tails), each branch represents the outcome of the test, and each leaf node represents a class label (decision taken after computing all attributes). The paths from root to leaf represent classification rules.

Wikipedia

This diagram shows a decision tree created on the Titanic data set that predicts if a passenger survived or not. You can find this data set included inside of the scikit-learn pacakge by default. (sibsp == Number of Siblings/Spouses Aboard)
This diagram shows a decision tree created on the Titanic data set that predicts if a passenger survived or not. You can find this data set included inside of the scikit-learn pacakge by default. (sibsp == Number of Siblings/Spouses Aboard)

Take it as it Comes

Random forests do not require the data to be normalized. Meaning you can throw your data into a model as-is. This greatly reduces the amount of time investment needed to produce a first prototype.

A model built with decision trees does not make the assumption the data is normally distributed. Normal distributions are commonly found in nature when some sort of inertia prevents very high or very low values. Many machine learning models require the input to be normalized in order to make the search space dense and centered on the origin (0).

Data does not need to be normalized because the business rules that power decision trees work on both ordinal and nominal (categorical) data. We could have a rule say where the color is equal to orange, or where the height is great than 5.

A common example of a normal distribution is the height of a population. The physiology of our bodies prevent us from being ten feet tall or 2 inches tall, our bodies are designed to operate within a certain range of height. Extreme outlier are not rare, but impossible. On the other hand an example of a non-normal distribution would be something like the number of followers an account has on Instagram. Many accounts will be 0 or less than 100. A few accounts will be in the millions, many thousand times greater than the population median.

Recursion

Decision trees by their design approach a problem recursively. Each iteration of the decision tree splits the remaining population into two discrete groups. Also, the feature that was used for the last node is removed from the pool of potential features. Each iteration reduces the dimensions of both the row and column space.

This property is called recursion. The problem gets smaller and the calculation becomes faster. Making this method effective for even very large data sets with many features. It has the trade off that nodes towards the root of the tree have a much higher impact than nodes lower down.

Parallel

Methods like Random Forest are made up of a collection of decision trees. The output of each decision tree can be calculated in parallel. This creates a simple way to score large data sets using multiple threads, processes, or machines. One example would be to use a map-reduce machine cluster like Hadoop to quickly score a large collection of input samples in parallel, then aggregating the results at the end.

Interpretation

Perhaps the strongest attribute of decision trees is in its ability to be interpreted. Most libraries that utilize an ensemble of trees will return a “feature importance” scoring show the impact each feature made on the final score. We can actually print out the decision trees themselves graphically like the image above. The closer a feature is to the root, the larger it’s impact. Because the amount of samples at each node decreases as you go down the tree.

If we find our random forest model useful, we can figure out exactly which features are most heavily weighted. Then we can conclude that those features are most important for influencing the outcomes we are measuring.

For transnational research, knowing why a model works is important. Seeing how the features are being used allow us to interpret the results and share the explanations. These attributes are important for scientific research.

Models that use deep learning are considered a black box. It is not possible to interpret what the model is doing, since it is taking a nonlinear transformation of the features represented as a vector. The deep learning model is using the associations and connections between the features in a complex way that is impossible to untangle. We tend to use deep learning for optimization problems when it is not important to understand how the model works.

Conclusion

Decision trees are a powerful tool but they come with limitations. They are extremely low bias, meaning they tend to overfit the data. Using a decision tree will basically tell you which records the training set are most similar to your input. Since plugging our input into the decision tree will lead us to a leaf node that will contain samples from our training set.

This is not always a useful thing. Let’s say we are interested in predicting something that lies outside our training set. The decision trees would have trouble generalizing since the trees were constructed using only seen data.

Decision trees are considered to be low variance. So they would be a poor fit for modeling a system with high variance, when we expect outcomes to differ from our training.

In this article I focused on the pros of using decision trees. I hope it has convinced you they are a useful data structure in your statistical modeling tool set.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/decision-trees/
Opt-in IDFA on iOS 14.5
After announcing that the Identifier for Advertisers (IDFA) would become opt-in with the release of iOS 14 in (announced in June, 2020) the…
Show full content

After announcing that the Identifier for Advertisers (IDFA) would become opt-in with the release of iOS 14 in (announced in June, 2020) the launch was pushed back. Supposedly to give advertisers more time to prepare for the change. Finally the update was released with udpate 14.5 earlier this year.

Attribution 101

The IDFA is used by advertisers for a process called attribution. Attribution is connecting an acquired user with the ad that brought them into the app. This is done by using cookies in the user’s browser to record which ads they’ve clicked on, and what apps they have installed. Attribution is necessary for the advertiser to calculate key performance indicators (KPIs).

An example of a KPI is return on investment (ROI) calculated by dividing the amount of revenue from acquired users divided by the cost to acquire those users. An ad campaign reaches the break-even point if its ROI is 100%.

The IDFA plays a critical role in how attribution is done on mobile with iOS. Without having IDFA enabled the accuracy and reliability of attribution for iPhone users will decrease siginificantly. Prior to the update around 80% of iOS users had IDFA enabled for their apps. After it is likely to drop to around 20-40%.

Without attribution it will be difficult to calculate the ROI for a given user. Advertisers will essentially be flying blind unable to objectively evaluate the performance of an ad campaign using data. This will make it more difficult to pay a fair price on an ad since the advertiser will not know how much money that ad will generate.

Users on iPhone are a particularly important segment because on average they have more discretionary income compared to Android users. Most iPhone models are expensive, targeting the “flagship” tier of phone. While Android has a solid lineup of phones under $200 that monopolize the low to mid tier of phones.

Impact on MMPs

The stakeholder most impacted by this change will be the Mobile Measurement Partners (MMPs). They sell attribution as a service to advertisers. The utility of their service depends on having reliable accuracy. The IDFA is used for “user level attribution” which is by far the most helpful form of attribution. User level means that we can see which individual users were part of a specific campaign. We can then evaluate the value of each user, and when aggregating users in a campaign we can do so intelligently factoring in variability between users.

Part of the IDFA opt-in update was a push to MMPs to move away from user level attribution and towards SKAdNetwork. A service provided by Apple that provides aggregate-level info on an ad campaign, lacking information about individual users. This can be used to evaluate the ROI of a campaign but is not helpful for evaluating the quality of a user. Companies tend to have several KPIs measuring engagement for users beyond how much money they spend.

The other work around has been to use a method called fingerprinting. Fingerprinting relies on other device metadata like IP address or user agent to probabilistically match a user. The main problem is that this method is less accurate, the lack of IDFA will make matching on the advertisers backend more difficult as well.

Apple has said fingerprinting is not allowed under their new policy. But they have not started enforcing this policy strictly and several MMPs continue to use fingerprinting and other user level attribution methods. Some MMPs have gone ahead and started following Apple’s new policy, while other’s are not, giving them a competitive advantage. (I am being intentionaly vague about which MMPs I am talking about.) The major question is when and how Apple will choose to enforce these new policies and what penalties, if any, will come.

Conclusion

The MMPs are massive companies and Apple’s stance is likely to shake up the very lucrative marketing industry. Google may not be far behind in making the Google Advertising ID opt-in as well. While the policies have already changed, there has not been much change in enforcement.

One immediate change has been the number of users who have IDFA enabled, which changed as soon as the update went out. This proves that when people are given a choice they will choose privacy. I applaud Apple’s decision to give user’s the choice to protect their data. The marketing industry will need to learn to adapt to this new changing ecosystem.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; }
http://babbling.fish/rip-idfa/
Add Images to Gatsby Blog
When I created this blog I ran into a issue. I wanted to store my blog posts as plain text markdown files in Github. But I also wanted to…
Show full content

white cat

When I created this blog I ran into a issue. I wanted to store my blog posts as plain text markdown files in Github. But I also wanted to have large high resolution images. It is not advisable to store large binary data types like jpegs in git. The object gets copied on each commit. Over time images would increase the size of the repo until operations like git clone would run slowly, making the repo frustrating to work with.

I set out to find a way to add images to Gatsby starter blog template in a low friction way. I did not want use a CMS, that felt like overkill. I find they provide more features than I need, and the abstraction is not useful for a personal blog where I am the only contributor.

Most gatsby plugins built for a CMS assume that the CMS defines the blog posts completely and the website simply renders what is returned by the CMS. I wanted a mixture of markdown combined with photos stored in an object store.

white cat on trellis

Benefits of Markdown

There are several advantages to using markdown. It is interoperable, and not tied to a specific platform. The use of markdown has been a main stay in static blogging websites going back to some of the earliest blogs on the web. I also wanted my articles to be readable from Github without going to the website at all. It also allows people to contribute to the blog via a pull request.

Having the articles in Github gives me a sense of ownership. Since Github will likely remain free forever it feels like a safe bet. While a CMS could hide your data behind a pay wall one day once they decide to eliminate their free tier.

cat with eye infection

Gatsby Image

A requirement I gave myself was that I wanted to use the gatsby-image library. Because it offers great features like lazy loading, blurry image placeholder, image resizing, and optimization. The library only works if the files are present at build time. This rules out the use of a CMS because these work by requesting the image after the website has been built.

The solution I came up with was to store the images in S3 and download them to the local file system in the prebuild npm script hook. This way the images are present at build time but are not in the git repo itself.

cat behind tire

Implementation

I decided to write a script in python to pull images into the repo during build time. I choose python because it is installed by default in Ubuntu 20.04, the docker image I am using to build this website. The script is simple and pulls down images and puts them into a directory that matches their key prefix in S3.

In package.json you can create “pre” scripts. For example, if you create a script named prebuild it will automatically get executed before the build script. I placed the script here so it would run before the build.

  "scripts": {
    "prebuild": "python3 get_photos.py",
    "build": "GATSBY_EXPERIMENTAL_PAGE_BUILD_ON_DATA_CHANGES=true gatsby build --log-pages",
    "predevelop": "python3 get_photos.py",
    "develop": "gatsby develop",
  }
Deployment to CDN

I am using Netlify as the CDN for this blog. The caching of python libraries is done automatically when it detects a requirements.txt file. It also automatically caches all of the gatsby artifacts in the .cache created during build time.

I want to add the ability to cache these photos so they don’t need to be downloaded from S3 on every build. I have not yet figured out how to do this. The size of the blog currently is small and the transfer fees cost cents to do. The images get served up by the CDN itself for the live site, so I only pay for the transfer cost between S3 and Netlify during the build. Netlify is hosted in AWS so this download is incredibly fast, as it can utilize the AWS wired ethernet connection.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; } .default-dark { background-color: #1E1E1E; color: #D4D4D4; } .default-dark .mtk1 { color: #D4D4D4; } .default-dark .mtk8 { color: #CE9178; } .default-dark .mtk12 { color: #9CDCFE; } .default-dark .grvsc-line-highlighted::before { background-color: var(--grvsc-line-highlighted-background-color, rgba(255, 255, 255, 0.1)); box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, rgba(255, 255, 255, 0.5)); }
http://babbling.fish/cat-gallery/
SQL for Distributed Systems
Writing SQL for the purpose of data aggregation and reporting requires a separate mindset than writing SQL for the backend of a web…
Show full content

Writing SQL for the purpose of data aggregation and reporting requires a separate mindset than writing SQL for the backend of a web application. For a web application we typically are updating a specific user’s records. In data warehousing application we need to think of ways to efficiently backfill our data and run our SQL quickly at scale.

A backfill is when we want to populate data in a table for the past X days. To do so our SQL has to be repeatable and horizontally scaleable. We need to populate data for multiple days at a time in a way that does not leak data or result in duplication. Below are a few tips to achieve this goal.

Templates

Template libraries in python were originally developed for the purpose of web development as a way to generate static HTML content on the server side. Static content can be cheaply hosted on CDNs using fast web servers like NGINX. Templates are very useful for generating SQL because they allow for parameterization and embedding python logic inside a SQL script.

A common pattern in data warehousing is to parametrise a date for a given script. We are typically processing yesterday’s data for today’s report. Using a template, we can make it so yesterday’s date is dynamically generated in python based on the day we would like the report for.

Templating is also useful when you need to use an imperative programming language for tricky logic. For example, if you need to do complicated time zone conversions you can do so in python and write the code into the template itself. Having the python inside a template with a SQL extension allows your IDE to use SQL syntax highlighting, facilitates separation of concerns, and makes the database logic more discoverable using file search.

It will be tempting to embed the SQL into your application code. The issue is that the engineer has to read through all the application code to understand how it interacts with the database. The whole advantage of using SQL is that it is an abstraction, decoupled from the application that can be reasoned about independently. Some of my favorite SQL features are that it has interoperability across levels of the organization (including analysts, product team, data engineers, data scientists), and is declarative in nature. By embedding it you lose some of its strongest features.

Don’t
-- PostgreSQL
SELECT *
   FROM transaction
WHERE transaction_time
BETWEEN DATE(NOW() AT TIMEZONE 'America/New_York')
          AND DATE(NOW() AT TIMEZONE 'America/New_York' - INTERVAL ‘1 DAY’);

The problem with using the now stored function is that the script can only be used to process data relative to the current moment. A backfill is when we want to reprocess data in the past. Code written using a stored function to get the current time cannot be used to do a backfill without manually changing the script.

Let’s say we define a python function convert_to_eastern that takes a timestamp and converts it to the eastern time zone. We pass this function to the render funcion of our template library so that it is available inside the template.

Do
/* PostgreSQL with Jinja2 templating from Apache Airflow */

SELECT *
   FROM transaction
 WHERE transaction_time
BETWEEN DATE(NOW() AT TIMEZONE 'America/New_York')
         AND DATE(NOW() AT TIMEZONE 'America/New_York' - INTERVAL ‘1 DAY’);
 BETWEEN ‘{{ convert_to_eastern(execution_date.date().isoformat()) }}’
        AND  ‘{{ convert_to_eastern(execution_date - macros.timedelta(days=1)).isoformat() }}’

Using templated SQL we can define the execution_date as a parameter. Doing a backfill would then consist of running the same task iteratively with all the dates from some point in the past to today via task scheduler like Apache Airflow.

Prefer Overwriting

This lesson is somewhat counterintuitive, the primary purpose of a SQL database is to store stateful information about users or a business. It is natural then to think update statements are a good way to change the state for a given record. The problem with updates is that they are poorly optimized in data warehouses that are designed for high throughput reads and writes.

This is because the database uses immutable data structures under the hood for performance reasons. For example a database could be using compressions and encoding of strongly typed columns to achieve some of its performance gains. An update is often implemented as a delete and then insert, and too many delete operations can really hurt performance. Since a record is not stored contiguously in a column oriented database, a delete will touch several different blocks of memory to delete a single row.

When writing SQL for data warehousing it’s preferable to overwrite partitions then to upsert (in general there are exceptions of course). This has the benefit of simplifying some operations and being really fast in many cases. Let’s say we want to backfill for some previous days. We have written our code to insert rows into a table with the assumption that the rows are not already in the table. This method helps us avoid rewriting our insert as an update and insert, and writing additional deduplication logic. If we simply overwrite the existing partition we can reuse our insert statement knowing the data already there will simply be removed.

Let’s say we have a table in Postgres partitioned on a datetime column. If we write our code to overwrite the partition then the SQL logic is the same every time regardless if we are backfilling or not. When we run the code again it will just blow away what’s already there and replace it with what we want.

Don’t
-- hiveSQL
UPDATE transaction
SET
      amount = loading.amount,
      transaction_time = loading.transaction_time
FROM loading
WHERE transaction.id = loading.id;

INSERT INTO transaction
SELECT * FROM loading
WHERE (transaction_id) NOT IN (SELECT DISTINCT transaction_id FROM transaction);

In the above query first we update the records if they already exist in the target table, then we insert records that do not exist. We may also need a third step to delete any records that could have been produced as duplicates.

Do
-- hiveSQL
INSERT OVERWRITE TABLE transaction
PARTITION (transaction_time=transaction_time)
SELECT * FROM loading;

Another reason overwriting partitions is fast is because the old partition gets deallocated from memory instantly without an actual disk write. Meaning this will only take as long as writing the new data to disk, the old data does not impact performance.

Intermediate tables

It is often useful to use intermediate tables when writing SQL, especially in data warehouses that use columnar storage to achieve fast write speeds. One benefit is that it allows you to break your query into smaller more manageable pieces, making the SQL easier to read and reason about. The other benefit is performance, when you have the data you want to work with in an intermediate table you can add an index (or sort order and partitioning) to fit the subsequent joins and queries to optimize performance.

When using intermediate tables you should create a local temporary table scoped to your database session. This will prevent issues from running the same query in parallel across different processes with different arguments. Another reason to avoid using the same table is to avoid exclusive (X) locks when performing deletes and updates that can slow down performance or even cause deadlock.

Don’t
TRUNCATE intermediate_table;
INSERT INTO intermediate_table SELECT * FROM table

It will be tempting to use a permanent table as an intermediate, then to truncate and insert into it. The problem is this method does not scale to distributed systems and can lead to weird behavior if two processes are writing to the same table. One process could truncate the table immediately after a different process inserts into the same table. In a distributed system you want processes to be isolated from each other so they can safely run in parallel.

Do
-- postgreSQL
CREATE TEMP TABLE intermediate_table AS SELECT * FROM table;
Conclusion

This article went over a few tips to developing SQL for data warehouses that gets executed by a task runner in a cluster of workers. There are many other techniques that are dependent on the specific database vendor. The most optimal way to develop your SQL will depend on which database vendor you are using.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; } .default-dark { background-color: #1E1E1E; color: #D4D4D4; } .default-dark .mtk3 { color: #6A9955; } .default-dark .mtk4 { color: #569CD6; } .default-dark .mtk1 { color: #D4D4D4; } .default-dark .mtk8 { color: #CE9178; } .default-dark .mtk7 { color: #B5CEA8; } .default-dark .mtk11 { color: #DCDCAA; } .default-dark .grvsc-line-highlighted::before { background-color: var(--grvsc-line-highlighted-background-color, rgba(255, 255, 255, 0.1)); box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, rgba(255, 255, 255, 0.5)); }
http://babbling.fish/elt-cookbook-sql/
How to Crawl the Web with Scrapy
Web scraping is the process of downloading data from a public website. For example, you could scrape ESPN for stats of baseball players and…
Show full content

Web scraping is the process of downloading data from a public website. For example, you could scrape ESPN for stats of baseball players and build a model to predict a team’s odds of winning based on their players stats and win rates. Below are a few use-cases for web scraping.

  • Monitoring the prices of your competitors for price matching (competitive pricing).
  • Collecting statistics from various websites to create a dashboard e.g. COVID-19 dashboards.
  • Monitoring financial forums and twitter to calculate sentiment for specific assets.

One use-case I will demonstrate is scraping the website indeed.com for job postings. Let’s say you are looking for a job but you are overwhelmed with the number of listings. You could set up a process to scrape indeed every day. Then you can write a script to automatically apply to the postings that meet certain criteria.

Disclaimer: Web scraping indeed is in violation of their terms of use. This article is meant for educational purposes only. Before scraping a website be sure to read their terms of service and follow the guidelines of their robots.txt.

Data warehousing Considerations

Our spider will crawl all the pages available for a given search query every day, so we expect to store a lot of duplicates. If a post is up for multiple days then we will have a duplicate for each day the post is up. In order to be tolerant of duplication we will design a pipeline that captures everything then filters the data to create a normalized data model we can use for analysis.

First the data will be parsed from the web page then put into a semi-structured data structure, like JSON. From here the data structure will be stored in an object store (e.g. S3, GS). An object store is a useful starting place for the capture of our data. It is cheap, scaleable, and can change flexibly with our data model. Once the data is in our object store the work of the web scraper is done, and data has been captured.

The next step is to denormalize the data into something more useful for analysis. As previously mentioned the data contains duplicates. I would choose to use a SQL database because it has powerful analytical queries. It will also give me the ability to separate different entities, like companies, job postings, and locations. First all posts will go into a fact table (large write only table) with a timestamp showing when the posting was scraped, and when it was inserted into the table. From here we can de-normalize the data into a stateful table representing currently active postings.

A merge statement could be written to update and insert posting into our table representing live postings. From here we will also want to delete postings that have been deleted or have expired. Now we have a table that has been normalized, or in other words all the duplicates have been removed.

Setting up the Project

For this project I will be using Scrapy, because it comes with useful features and abstractions that will save you time and effort. For example, scrapy makes it easy to push your structured data into an object store like S3 or GCS. This is done by adding your credentials along with the bucket name and path to the configuration files generated by scrapy. The purpose is for long term storage and to have an immutable copy generated each time we run our scraper. Because S3 is a limitless object store, it is the perfect place for long term storage that will scale easily with any project.

To highlight a few more features, scrapy uses the Twisted framework for asynchronous web requests. This means the program can do work as it is waiting for the website server to respond to a request, instead of wasting time by waiting idly. Scrapy has an active community, so you can ask for help and look at examples from other projects. It also provides some more advanced options like running in a cluster with Redis, and user-agent spoofing but those are outside the scope of this tutorial.

Let’s start by creating a virtual environment in python and installing the dependencies. Then initializing a blank project where we will have our web crawlers. Be sure to execute this code in the top level directory of your project.

python3 -m venv venv
source ./venv/bin/activate
pip install scrapy lxml BeautifulSoup4 jupyterlab pandas
scrapy startproject jobs
cd jobs/jobs/
scrapy genspider indeed indeed.com

The code for parsing the web page will go in the file indeed.py in the spiders/ directory. A spider is an abstraction of a web crawler that generates HTTP requests and parses the page that is returned. There is a separate abstraction for processing and storage of the information called an ItemPipeline. Separation of such abstractions allow for decoupling, flexibility, and horizontal scaling. For example, You could send the result of the spider to multiple places without modifying the internal logic of the spider’s code. A good practice would be to save the results as a file in an object store for long term storage, and in a database for deduping and adhoc querying.

Development Environment

You can think of web scraping as reverse engineering other people’s work. There are many tools available to make web development more composable and manageable, such as the use of reusable templates or ES6 modules. These allow a web developer to reuse the same code in multiple places with the goal to combine simple pieces into something more complex. When you are scraping a website all you have is the actual rendered web page, we don’t have access to components that were used to build that page. So we have to work backwards using any tricks we can to get what we want.

The very first step of any web scraping project is to open the page you want to scrape in your web browser and explore the DOM with the “Inspect Element” in your browser of choice. Using the developer tools in your browser you can explore the structure of the DOM, or the skeleton of the page. Feel free to open the page now and explore it with developer tools. Throughout the process we will be using the browser to quickly navigate across the dom, in a visual and interactive way.

I like to develop the parsing code for my web scraping using the scrapy shell with iPython, the interactivity allows for quick feedback loops, allowing for lots of trial and error. The scrapy shell drops you into the scrapy context with all of the helpers and convenience functions already instantiated. These same objects are available to the spider during runtime. It also gives you access to all the features of iPython. You can start your shell with the code below.

scrapy shell 'https://www.indeed.com/q-medical-assistant-l-Boston,-MA-jobs.html'

The most important object here is the response. This contains the HTTP response from the web server to our HTTP GET request. It contains the HTML of the page, as well headers and other information associated with the HTTP response. The basic feedback loop is using the browser to identify what we want to parse, then testing that parsing code in the terminal.

Starting URLs

If you look at the link above, you may notice the URL contains our search query. This is how parameters are passed in HTTP GET requests (this example does not use the standardized format). We can utilize this information to programmatically try different search queries.

In the url, following the letter q we see our query which corresponds with the job title. After the letter l is the location. Let’s say in our use case we want to search for multiple locations and job titles. For example, medical assistants are also called patient care assistants.

In scrapy we can pass our spider several urls as starting points for scraping. We can pass it URLs that correspond with multiple locations and job titles to get this behavior. In this example I use the product function to generate every combination of location and job title, then I pass it to the spider as our starting point.

from itertools import product

job_titles = ["Medical Assistant", "Patient Care Technician", "Patient Care Assistant"]
states = ["MA"]
cities = ["Boston", "Cambridge", "Somerville", "Dorchester"]
urls = []
for (job_title, state, city) in product(job_titles, states, cities):
     urls.append(f"https://www.indeed.com/q-{'-'.join(job_title.split())}-l-{city},-{state}-jobs.html")

class IndeedSpider(scrapy.Spider):
    name = "indeed"
    allowed_domains = ["indeed.com"]
    start_urls = urls

Maybe you noticed that searching the same job title in adjacent cities would yield overlapping results. In other words, searching for the same title in Cambridge and Boston will return duplicates. One major challenge of any data project is deduplication. One strategy we could use is to have an application cache like redis, our program could check if a listing has already been parsed based on a natural primary key made up of its job title, company name, location, and posting date. We could even look for a unique ID in the DOM generated by the server. For simplicity, we will do deduplication at the very end once everything has been saved to our object store.

Parsing the Page

I will be using the python library BeautifulSoup4to parse the HTML because this is the library I am the most familiar with. By default, scrapy comes with CSS selectors and XPath selectors, both are powerful ways to write queries against the DOM. For this tutorial, you will need to import beautiful soup into our shell then parse the HTML into a BeautifulSoup object. I like BeautifulSoup because of the simplicity of the find API.

from bs4 import BeautifulSoup
soup = BeautifulSoup(response.text, features="lxml")

Please note that this parsing code will become broken when they rename their CSS classes. If you are finding that the examples are not working, then try fixing them to work with how the website is structured and named today.

To begin we will need to find a way to parse a list of all the jobs on a given page. We want to find a way to capture the top level node of each listing. Once we have the parent node of each listing we can then iterate on how to parse the attributes of each individual listing. Currently, each listing has a top level anchor element (<a>) with class="tapItem". We can use the CSS class to select all of these nodes that represent individual listings.

listings = soup.find_all("a", {"class": "tapItem"})

In this tutorial we will target the attributes job title, employer, location, and job description. The first three attributes can be found in the search results page, while the job description will require following a link to the job descriptions page. Starting with the attributes available on this page we can use the CSS class to target different attributes of the listing within each parent node.

for listing in listings:
    job_title = listing.find("h2", {"class": "jobTitle"}).get_text().strip()
    summary = listing.find("div", {"class": "job-snippet"}).get_text().strip()    # strip newlines
    company = listing.find("span", {"class": "companyName"}).get_text().strip()
    location = listing.find("div", {"class": "companyLocation"}).get_text().strip()

I wrote this code by finding the job title in the inspect element devtools then iterating on the code in iPython. In each case I found that selecting by HTML element and CSS class was sufficient to get the information I needed.

Now we need to retrieve the job description located on a separate page. To do this we will send another HTTP request to retrieve the page with the job description on it using the link we found on the search results page. We get the link url by combining the relative path found in the href of the anchor tag with the URL of the search results page found in our response object. Then we will ask scrapy to schedule the request with the asynchronous event loop.

We will be combining the job description (jd) with information we found on this page page, so we will pass the parsed attributes to the callback function so they can all be stored in the same item. Scrapy requires using the yield statement because functions are executed by an asynchronous scheduler. The parse_jd callback function will return a dictionary representing the job listing.

posting = {"job_title": job_title, "summary": summary, "company": company, "location": location}
jd_page = listing.get("href")
if jd_page is not None:
    yield response.follow(jd_page, callback=self.parse_jd, cb_kwargs=posting)

Now all that’s left to do is parse the job description and then yield the item for collection. Luckily for us the job description has a unique id. This is the easiest way to select a specific element. We want to save the URL for the job description, because if we end up applying we’ll need the link to find the apply button.

def parse_jd(self, response, **posting):
    soup = BeautifulSoup(response.text, features="lxml")
    jd = soup.find("div", {"id": "jobDescriptionText"}).get_text()
    url = response.url
    posting.update({"job_description": jd, "url": url})
    yield posting

Parsing is typically the most challenging and time consuming phase of writing a spider. Websites will change over time, so you will need to modify the code when it breaks. It can be useful to add a validation step that checks for None or empty strings, then raises an error. This way you get notified when the code is no longer working. This should only be done for critical path information, as missing information can be common.

The final step is to tell our scraper to go to the next page of the search results. We want the crawler to retrieve every post currently available, not just results on the first page. When the next page button is no longer available then we will know that we’re done.

next_page = soup.find("a", {"aria-label":  "Next"}).get("href")
if next_page is not None:
    next_page = response.urljoin(next_page)
            yield scrapy.Request(next_page, callback=self.parse)

With these simple sets of instructions we now have a fairly robust process for extracting all the important details. The ability to follow the next page link means the crawler will scrape all the available results without any additional coding. Now we are done writing the spider and we can move on to the results.

Saving the Results

Now that the parser is written we can start utilizing some scrapy features. We have an option to use an ItemPipline to send each Item into a file object store or a database. We could utilize a highly available distributed database with a high write throughput (e.g. DynamoDB, Cassandra), and insert the items into a table as it’s running.

For this project I will use the builtin feeds option to create extracts from the command line. I would choose JSON lines because the JSON encoder will properly escape newline characters and quotes as part of the process of marshalling into JSON. This could save you some pain later on compared to CSV, where a single extra newline character or quotation mark can lead to parsing errors and headaches. Also a semi-structured format is useful when the schema is not static, each item may vary in what attributes it contains.

There is a way to specify the feed in a config file, but I will show you how to do it from the command line. We will create a JSON Lines file containing all the scraped data to our local filesystem. From there we can start interacting with the results to extract value.

scrapy crawl indeed -o jobs.jl

This will take some time to run depending on how many jobs titles and locations you have configured. All websites have some form of rate limiting. The rate limiting can be implemented by the CDN like Cloudflare, or by a load balancer/reverse proxy. Rate limiting prevents Denial of Service (DoS) attacks from taking down a web server. Scrapy will perform an exponential backoff until it gets a 200 response code, this means it will wait a little longer after each failure until a request is successful.

Analyzing the Results

Now that we have a file containing all of the postings we can begin our analysis. One way to do analysis is to use Jupyter notebooks. Jupyter notebooks are useful for exploratory data analysis and nonlinear programming. When we are coding for the purpose of discovery and analysis we don’t know what the end state looks like. As we are writing code we will need to change the order of things, and make big changes, that’s why it’s called nonlinear programming. Jupyter makes this style of programming easier.

Jupyter notebooks facilitate this type of development by utilizing cells and iPython. By using iPython as it’s backend, it allows you to work in a REPL environment. A REPL allows you to quickly be able to see the output of the code you executed and hold onto objects after they are created. The second piece is cells which can be moved, cut, copied, and deleted. Cells make it simple to change the order of execution, and change the scope of objects. I found the notebooks a good place to develop my analysis of the results.

Conclusion

The next step would be to aggregate and normalize the data, analyze it, then create some sort of user interface for accessing it. For example you could have a website that displays all the scraped websites sorted and filtered on custom criteria. You could use keyword detection to prioritize the listings that offer the opportunities you are most interested in.

Using scrapy to write a spider will get you past the first step, parsing data from a web page and saving it. This is the first component in any data pipeline that relies on data from web crawling. Once you have captured the data you can then start extracting value from it for whatever application you wish.

You can download a python file with all the code from this tutorial.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; } .default-dark { background-color: #1E1E1E; color: #D4D4D4; } .default-dark .mtk1 { color: #D4D4D4; } .default-dark .mtk11 { color: #DCDCAA; } .default-dark .mtk8 { color: #CE9178; } .default-dark .mtk15 { color: #C586C0; } .default-dark .mtk4 { color: #569CD6; } .default-dark .mtk10 { color: #4EC9B0; } .default-dark .mtk12 { color: #9CDCFE; } .default-dark .mtk3 { color: #6A9955; } .default-dark .grvsc-line-highlighted::before { background-color: var(--grvsc-line-highlighted-background-color, rgba(255, 255, 255, 0.1)); box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, rgba(255, 255, 255, 0.5)); }
http://babbling.fish/scraping-for-a-job/
Python for Distributed Systems
When writing data workflows to be horizontally scaleable it is important to make scripts idempotent. Idempotency means that a script can be…
Show full content

When writing data workflows to be horizontally scaleable it is important to make scripts idempotent. Idempotency means that a script can be run multiple times without side effects or dependencies. When you run the same script with the same input you should always get the same result. If a function changes the input, relies on a stateful variable then it is not idempotent.

The assumption allows code to be executed inside a stateless distributed workflow executor like Apache Airlfow or Luigi. Inside these environments the same code is executed across multiple machines in parallel with different input.

Idempotency also makes DevOps easier beacuse your team members know they can safely run any workflow without having to know what it does. This abstraction lets other engineers know that a worfklow can be rerun and it will not impact other workflows. The dependencies should encapsulated in the relationships between the tasks in the directed acyclic graph (DAG) itself.

Here are some common patterns useful for working with data pipelines and distributed systems in an idempotent way. The overall theme is that scripts need to be idempotent because they will fail, and backfilling data requires running multiple instances at the same to avoid things taking too long.

Loading Files into the Database

One common operation done in scripting is to load a file into the database. When loading the file it has to be temporarily stored inside the computer’s memory or storage inside a buffer. It’s important to use buffers in an idempotent way. It also means the script does not produce side effects, like mutating its input or locking a globally used file.

One common gotcha is to use a hardcoded file path as a buffer. This will result in issues if multiple processes try reading or writing to the same file at once.

Below is an example, this script downloads a file from s3 to a file named transaction.csv, then copies that file into the database, and finally deletes it. If two processes try to write to this file at the same time that could result in duplication. Even worse, if one process deletes the file before the other has finished writing, then we could have data loss!

Don’t
import boto3
import os
import psycopg2

# Download file from S3 to be copied into database
s3 = boto3.client("s3")
s3.download_file("BUCKET_NAME", "OBJECT_NAME", "transaction.csv") 

conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
with open("transaction.csv", "r") as f:
    next(f)  # skip headers line
    cur.copy_from(f, transaction, sep=",")
conn.commit()
os.remove("transaction.csv") 

Instead use a temporary file or an in-memory string or bytes buffer. An in-memory buffer can work if the amount of data is small or the machine has a lot of memory. For the in-memory buffer you can use the builtin io.StringIO object for storing the unicode string in memory, or io.Bytes for storing encoded binary data.

Using a temporary file is a more generalizable solution since files can get as large as the file system allows. For very large files this method is impractical and too slow, so you will need to find a way to keep your file size low.

The tempfile library allows for the creation of temporary directories and files with globally unique tables, on POSIX systems this works by utilizing the /tmp directory and grabbing a random string from /dev/urandom. Using tempfile with a context manager will automatically dereference the file to be garbage collected by the OS when you are done.

Do
import boto3
import psycopg2
import tempfile

s3 = boto3.client("s3")
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
s3.download_file("BUCKET_NAME", "OBJECT_NAME", "transaction.csv")
with tempfile.NamedTemporaryFile(mode="wb") as f:
# or
# with io.BytesIO() as f:
   s3.download_file("BUCKET_NAME", "OBJECT_NAME", f.name)
   f.seek(0)
   cur.copy_from(f, transaction, sep=",")
 conn.commit()
os.remove("transaction.csv")

Both methods support context managers (the with closures) so the memory / storage gets freed up automatically.

Using transactions

When executing SQL queries from python, it’s important not to leave the database in a partial or incomplete state. One way of accomplishing this is to wrap your queries inside of a transaction then only commit the transaction at the end. This guarantees that the database will only be updated once the script is complete.

Don’t
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
cur = conn.cursor()
cur.execute("DELETE FROM transaction WHERE created_at = '2019-02-12'")
conn.commit()
cur.execute("INSERT INTO transaction WHERE created_at = '2019-02-12 SELECT * FROM loading'")
conn.commit()

This script uses a DELETE and INSERT pattern to de-duplicate data. If the script were to fail after the DELETE before the INSERT the database would be left in a bad state with missing data.

Do
conn = psycopg2.connect("host=localhost dbname=postgres user=postgres")
with conn.cursor() as cur:
    cur.execute("DELETE FROM transaction WHERE created_at = '2019-02-12'")
    cur.execute("INSERT INTO transaction WHERE created_at = '2019-02-12 SELECT * FROM loading'")
    conn.commit()

By commiting at the end we know that the table will not be missing any data.

Functional Programming

When writing code that transforms data it is important to use the concept of “pure” functions from functional programming. A pure function is one that can execute without mutating its input or creating side effects. And of course a pure function is idempotent and always returns the same value for the same input. In practice this means creating a new object to return to the calling function rather than mutating an object passed as input.

As a side note, mutating state in a database is by definition not something that can be done in a 100% repeatable way. Since the goal is to change the state permanently to the current state. When the state changes in the future we will get a different result. Functional programming is meant to be used for changing the state of an application in a safe way, or transforming data in a repeatable way. In web development there is often a file named utils.js with pure functions that can be used to transform data in a safe way.

Don’t
def min_max_scale(s: list):
    min_val = min(s)
    max_val = max(s)
    for i in range(len(s)):
        s[i] = (s[i] - min_val) / (max_val - min_val)
    return s

The input list is being modified by the function. If the list is referenced elsewhere in the program this can cause unintended issues.

Do
def min_max_scale(s: list):
    min_val = min(s)
    max_val = max(s)
    new_series = []
    for i in range(len(s)):
       s[i] = (s[i] - min_val) / (max_val - min_val)
       new_series = (s[i] - min_val) / (max_val - min_val)
    return new_series

Also, any resources used to complete the action need to be safely closed. One additional note is to avoid modifying global state variables. If a function references a global variable it should be in a read-only context.

Conclusion

The theme throughout these examples is to write code that won’t produce side effects. Running code in an environment like Apache Airflow makes the assumption that the code is stateless. In other words it will run from a clean worker machine without any previous memory of the task. Any state should be stored in a database for long term storage. The program should also be able to run in parallel with different parameters. As this is the requirement for running a backfill.

These strategies have the added benefit of being resilient for failure. You should always work under the assumption that things will fail. Having an ability to recover from a failure and fill in missing data is critical to build a redundant and resilient system.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; } .default-dark { background-color: #1E1E1E; color: #D4D4D4; } .default-dark .mtk15 { color: #C586C0; } .default-dark .mtk1 { color: #D4D4D4; } .default-dark .mtk3 { color: #6A9955; } .default-dark .mtk8 { color: #CE9178; } .default-dark .mtk11 { color: #DCDCAA; } .default-dark .mtk12 { color: #9CDCFE; } .default-dark .mtk7 { color: #B5CEA8; } .default-dark .mtk4 { color: #569CD6; } .default-dark .mtk10 { color: #4EC9B0; } .default-dark .grvsc-line-highlighted::before { background-color: var(--grvsc-line-highlighted-background-color, rgba(255, 255, 255, 0.1)); box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, rgba(255, 255, 255, 0.5)); }
http://babbling.fish/elt-cookbook-python/
Getting Started with Vertica
Vertica is a distributed SQL database used for data warehousing and analytics. In this article I will go over the basic concepts needed to…
Show full content

Vertica is a distributed SQL database used for data warehousing and analytics. In this article I will go over the basic concepts needed to effectively design schemas and queries.

Records in Vertica are stored in columnar order, rather than row order like in OLTP databses such as MySQL and PostgreSQL. This allows for massively parrallel processing, as well as highly encodable and compressible data. Vertical storage is a common feature among most major data wharehouse technologies like BigQuery and RedShift.

One advantage is being able to have “wide” tables with many columns, without impactings performance. In a row order database, adding a column locks the table during the alter and will hurt performance of preexisting queries because it increases the amount of memory each row takes up. With larger rows that means there are less rows per page and it will take longer to paginate through the rows when doing a lookup.

For a column order database adding columns is instaneous and has no impact on performance. Wide tables are helpful in analytics especially when creating tidy data. It also allows for quick prototyping and deployment of new tables or altering existing tables. Data analysis tends to be an adhoc process with fast changing requirements, so its important to give yourself the flexibility to chagne your data model.

Massively Parallel Processing

Vertica is designed to run in a cluster of nodes. The nodes have a shared nothing architecture meaning each node has its own hardware resources and the nodes communicate over the internet. Being distributed also makes it highly available and scalable, with the ability to add or subtract nodes. This comes at the cost of consistency, any change must be propagated across all the nodes. For this reason, Vertica cannot efficiently update a large volume of rows.

Vertica is designed to do fact table to dimension table joins. A fact table is a large table that is append-only, like an event log. While a dimension table is a smaller table that is used to provide a description. For example, let’s say the transaction fact table contains a product code, we could have a product dimension table that contains the product type, description, and other details. If we want to add more information about a product we can do so by adding it to the dimension table. Fact tables are designed to be immutable as modifying records is expensive. The dimension tables are replicated across each node, so that joins can be done locally on each node.

When designed correctly, Vertica will perform a join locally on each node then collate the responses at the end.

Projections

In Vertica Projections are what is stored in the physical storage layer. They are analogous to tables in other SQL databases, with similar features like strongly typed columns and uniqueness constraints. A projection can contain a set of columns from multiple tables like a materialized view. Every projection stores a separate copy of the data. One projection can contain a subset of another projection, but with a different segmentation and sorting order.

Every table has a super projection containing all the columns of a logical table. You can create a projection optimized for a specific query. Projections offer high availability via buddy projections which are copies of the same projection, but the columns are stored on different nodes to create redundancy. Another added benefit is that if a node goes down the buddy projections allow Vertica to continue functions.

Since every projection stores a complete copy of the data, mutating a projection will result in all of the buddy projections and derivative projections being mutated as well. Bad performance on delete operations has the potential to slow down the whole cluster through cascading projections.

Segmentation

Segmentation is where the data is stored. For large tables, the goal is to evenly distribute the data across the cluster for parallel query execution to avoid shuffling. This is done by using a hash key on a unique identifier like a user_id, a hash key can contain multiple columns. If records need to be shuffled between nodes for a join you will see GLOBAL RESEGMENT GROUPS in the query plan. This is a hint that segmentation is a possible way to improve performance.

For small dimension tables you want it to be replicated across all nodes. To do this you mark the table as UNSEGMENTED ALL NODES this will create a copy of the table in each node in the cluster. This will allow the each node to load the table into memory for quick hash joins.

-- fact table of transactions
CREATE TABLE transactions (
    transaction_id INT NOT NULL,
    user_id INT NOT NULL,
    product_id INT NOT NULL,
    quantity INT NOT NULL,
    vendor VARCHAR(255) NOT NULL,
    dollar_amount NUMERIC  NOT NULL,
    execution_timestamp TIMESTAMPTZ
) -- sort columns are ordered by cardinality and based on user queries
  ORDER BY vendor, product_id, user_id 
  SEGMENTED BY HASH(user_id) ALL NODES 
;

-- dimension table containing information about products
CREATE TABLE products (
   product_id INT NOT NULL,
   product_type INT NOT NULL,
   product_description VARCHAR(1000) NOT NULL
) ORDER BY product_type, product_id  
  UNSEGMENTED ALL NODES              
;
Sorting and Cardinality

The sort order determines how the data is stored on disk. This is the primary way query performance is optimized. The order by clause determines the sort order. The goal is to use columns that get used in join, where, and group by clauses.

Because the data is stored in column arrays, that allows Vertica to use typed arrays. Typed arrays are highly compressible and encodeable since they are homogenous in nature. An advantage of encoding is that data does not have to be decoded to be processed, the smaller footprint of decoded data helps reduce IO across the network and disk.

The sort order functions analogously to an index in another database. Say for example you had a table that contained addresses with a field for state. If I write a query that selects users that live in California, Vertica only has to scan users where the state is California. Because the data is ordered, Vertica can quickly find California while avoiding a scan of all the other states. This kind of filtering is what leads to very fast queries on large amounts of data.

Using a sort column in a join allows a merge join, a special optimized type of join that works by using two pointers on two sorted lists. When used correctly with segmentation a merge join allows records to be quickly joined on a node without any shuffling.

One of the options for compression is Run Length Encoding, which works well on low cardinality data. Run Length Encoding (RLE) works by sorting the data then taking the counts of each value. Rather than storing each individual value it stores the counts for each unique value. When you include a column with RLE in the where clause Vertica will do a predicate push down and only access the values you want excluding the rest.When used in a where clause it allows the query optimizer to scan only a fraction of the rows, making the query more selective and much faster.

Run Length Encoding is a form of compression that relies on the repetition of values.
Run Length Encoding is a form of compression that relies on the repetition of values.

When a column with RLE is used in a group by clause, assuming there aren’t too many groups, then the optimizer can process each group more quickly. Because the data for a single group are colocated in the same block of memory.

A heuristic used for designing sort order is to use columns in ascending cardinliaty. It is a similar priniciple to a [Huffman Encoding](Link needed) where data is stored in a hierachical data structure optimized for fast access.

When a table is grouped by the columns it is sorted by that allows a GROUPBY PIPELINED. This allows Vertica to process one group at a time so it typically uses less memery and is faster than GROUPBY HASH.

Conclusion

Keeping the techniques mentioned here in mind is useful for speeding up the performance of any slow query and designing tables. I have seen improvements of 5x to 500x speedups just by adding segmentation or changing the sort order. If something is running really slowly it’s likely there are some opportunities to improve it.

When working with column order databases it requires a separate mindset. A production OLTP databse must be optimized for fast performance to support the customer facing production application. This means using normalization to make your database if ast.

On the other hand, a data wharehouse is optimized for ease of use and clarity of analysis. De-normalization is helpful in that it makes the relationship betweem objects self evident and encapsulated within the data model itself. While writing analytics queries the fewer joins the better, as every join is an opportunity to introduce erroneous duplication.

It is easy to forget that Vertica works entirely differently than a conventional SQL database. The fact that it uses SQL gives the false impression that you can simply work with the same mental models. In order to really get the most performance out keep in mind that Vertica is a columnar database, that utilizes segmentation and sort order to achieve massive data scaling.

.grvsc-container { overflow: auto; position: relative; -webkit-overflow-scrolling: touch; padding-top: 1rem; padding-top: var(--grvsc-padding-top, var(--grvsc-padding-v, 1rem)); padding-bottom: 1rem; padding-bottom: var(--grvsc-padding-bottom, var(--grvsc-padding-v, 1rem)); border-radius: 8px; border-radius: var(--grvsc-border-radius, 8px); font-feature-settings: normal; line-height: 1.4; } .grvsc-code { display: table; } .grvsc-line { display: table-row; box-sizing: border-box; width: 100%; position: relative; } .grvsc-line > * { position: relative; } .grvsc-gutter-pad { display: table-cell; padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } .grvsc-gutter { display: table-cell; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter::before { content: attr(data-content); } .grvsc-source { display: table-cell; padding-left: 1.5rem; padding-left: var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)); padding-right: 1.5rem; padding-right: var(--grvsc-padding-right, var(--grvsc-padding-h, 1.5rem)); } .grvsc-source:empty::after { content: ' '; -webkit-user-select: none; -moz-user-select: none; user-select: none; } .grvsc-gutter + .grvsc-source { padding-left: 0.75rem; padding-left: calc(var(--grvsc-padding-left, var(--grvsc-padding-h, 1.5rem)) / 2); } /* Line transformer styles */ .grvsc-has-line-highlighting > .grvsc-code > .grvsc-line::before { content: ' '; position: absolute; width: 100%; } .grvsc-line-diff-add::before { background-color: var(--grvsc-line-diff-add-background-color, rgba(0, 255, 60, 0.2)); } .grvsc-line-diff-del::before { background-color: var(--grvsc-line-diff-del-background-color, rgba(255, 0, 20, 0.2)); } .grvsc-line-number { padding: 0 2px; text-align: right; opacity: 0.7; } .default-dark { background-color: #1E1E1E; color: #D4D4D4; } .default-dark .mtk3 { color: #6A9955; } .default-dark .mtk4 { color: #569CD6; } .default-dark .mtk1 { color: #D4D4D4; } .default-dark .mtk11 { color: #DCDCAA; } .default-dark .mtk7 { color: #B5CEA8; } .default-dark .grvsc-line-highlighted::before { background-color: var(--grvsc-line-highlighted-background-color, rgba(255, 255, 255, 0.1)); box-shadow: inset var(--grvsc-line-highlighted-border-width, 4px) 0 0 0 var(--grvsc-line-highlighted-border-color, rgba(255, 255, 255, 0.5)); }
http://babbling.fish/vertica-survival-guide/