Bibliographic Wilderness

ActiveRecord neighbor vector search, with per-document max

jrochkind Feb 18, 2026

I am doing LLM “RAG” with rails ActiveRecord, postgres with the pgvector extension for vector similarity searches, and the neighbor gem. I am fairly new to all of this stuff, figuring it out by doing it. I realized that for a particular use, I wanted to get some document diversity — so i wanted to … Continue reading ActiveRecord neighbor vector search, with per-document max →

Show full content

I realized that for a particular use, I wanted to get some document diversity — so i wanted to do a search of my chunks ranked by embedding vector similarity, getting the top k (say 12) chunks — but in some cases I only want, say, 2 chunks per document. So the top 12 chunks by vector similarity, such that only 2 chunks per interview max are represented in those 12 top chunks.

I decided I wanted to do this purely in SQL, hey, I’m using pgvector, wouldn’t it be most efficient to have pg do the 2-per-document limit?

Note: This may be a use case that isn’t a good idea! I have come to realize that maybe I want to just fetch 12*3 or *4 docs into ruby, and apply my “only 2 per document” limit there? Because I may want to do other things there anyway that I can’t do in postgres, like apply a cross-model re-ranker? So I dunno, but for now I did it anyway.

So this was some fancy SQL, i was having trouble figuring out how to do it myself, so I asked ChatGPT, sure. It gave me an initial answer that worked, but…

Turns out was over-complicated, a simpler (to my understanding anyway) approach was possible
Turns out was not performant, it was not using my postgres ‘HNSW’ indexes to make vector searches higher performance, and/or was insisting on sorting the entire table first defeating the point of the indexes. How’d I know? Well, I noticed it was being slower than expected (several seconds or at times much more to return), and then I did postgres explain/analyze… which I had trouble understanding… so i fed the results to ChatGPT and/or Claude, who confirmed, yeah buddy, this is a bad query, it’s not using your vector index properly.

I had to go on a few back and forths with both ChatGPT and Claude (this is just talking to them in a GUI, not actually using Claude Code or whatever), to get to a pattern that did use my index effectively. They kept suggesting things to me that either just didn’t work, or didn’t actually use the index, etc. I had to actually understand what they were suggesting, and tweak it myself, and have a dialog with them…

But i eventually got to this cool method that can take an arbitrary ActiveRecord relation which already has had neighbor nearest_neighbors query applied to it… and wraps it in a larger query using CTE’s that can limit the results to max-per-document.

I wondered if I should try to share this somewhere (would neighbor gem want a PR?), except… I’m realizing like I said above maybe this is not actually a very useful use case, better to do it in ruby… I’m still not necessariliy getting the performance I expected either, although the analyze/explain says the indexes should be used properly.

So I just share here. Note the original base_relation may be it’s own internal joins to enforce additional conditions on retrieval etc. Assuming each Chunk ActiveRecord model has a document_id attribute which we are using to group for max-per-document.

# We need to take base_scope and use it as a Postgres CTE (Common Table Expression)
    # to select from, but adding on a ROW_NUMBER window function, that let's us limit
    # to top max_per_interview
    #
    # Kinda tricky, especially to do with good index usage. Got solution from google and talking
    # to LLMs, including having them look at pg explain/analyze output.
    #
    # @param base_relation [ActiveRecord::Relation] original relation, it can have joins and conditions.
    #   It MUST have already had vector distance ordering applied to it with `neighbor` gem.
    #
    # @param max_per_interview [Integer] maximum results to include per interview (oral_history_content_id)
    #
    # @param inner_limit [Integer] how many to OVER-FETCH in inner limit, to have enough even after
    #    applying max-per-interview.
    #
    # @return [ActiveRecord::Relation] that's been in a query to enforce max_per_interview limits. It does
    #   not have an overall limit set, caller should do that if desired, otherwise will be effectively
    #   limited by inner_limit.
    def wrap_relation_for_max_per_interview(base_relation:, max_per_interview:, inner_limit:)
      # In the inner CTE, have to fetch oversampled, so we can wind up with
      # hopefully enough in outer. Leaving inner unlimited would be peformance problem,
      # cause of how indexing works it doesn't need to calculate them all if limited.
      base_relation = base_relation.limit(inner_limit)

      # Now we have another CTE that assigns doc_rank within partitioned
      # interviews, from base. Raw SQL is just way easier here.
      partitoned_ranked_cte = Arel.sql(<<~SQL.squish)
        SELECT base.*,
          ROW_NUMBER() OVER (
            PARTITION BY document_id
            ORDER BY neighbor_distance
          ) AS doc_rank
        FROM base
      SQL

      # A wrapper SQL that incorporates both those CTE's, limiting to
      # doc_rank of how many we want per-interview, and overall making sure to
      # again order by vector neighbor_distance that must already have been included
      # in the base relation.
      base_relation.klass
        .select("*") # just pass through from underlying CTE queries.
        .with(base: base_relation)
        .with(partitioned_ranked: partitoned_ranked_cte)
        .from("partitioned_ranked")
        .where("doc_rank <= ?", max_per_document)
        .order(Arel.sql("neighbor_distance"))
    end

Like I said, I am new to this LLM stuff, curious what others have to say here.

http://bibwild.wordpress.com/?p=13494

Extensions

Help fund attorney for artist charged with transporting zines(?!?)

jrochkind Dec 4, 2025

i know Des Revol, and know them to be an incredibly kind, solid, reliable person. For real he’s facing federal charges and threat of deportation because of subversive political pamphlets found in his trunk. Des was not at the Prairieland demonstration. Instead, on July 6, after receiving a phone call from his wife in jail … Continue reading Help fund attorney for artist charged with transporting zines(?!?) →

Show full content

i know Des Revol, and know them to be an incredibly kind, solid, reliable person.

For real he’s facing federal charges and threat of deportation because of subversive political pamphlets found in his trunk.

Des was not at the Prairieland demonstration. Instead, on July 6, after receiving a phone call from his wife in jail (one of the initial ten), Des was followed by Federal Bureau of Investigation (“FBI”) agents in Denton, Texas. They pretextually pulled him over due to a minor traffic violation and quickly arrested him at gunpoint. He was later charged with alleged “evidence tampering and obstruction of justice” based on a box of political pamphlets that he purportedly moved in his truck from his home (not his wife’s) to another house. This type of literature can be found in any activist house or independent bookstore. Des was briefly held at the Johnson County Jail, and then transferred to a federal prison, FMC Fort Worth, where he has been held ever since.

He is also currently on an ICE hold, and has been publicly targeted and doxxed on social media by both prominent fascists and ICE. Moreover, right after his arrest, his family experienced a brutal and intimidating nine-hour FBI raid of their home. Police confiscated everything from electronics to stickers and more zines.

I’m a librarian (and software engineer, but I have a librarian’s MLIS degree and have made a career in libraries). I know that if collecting and distributing controversial, dissident, and even “subversive” political literature is subject to this kind of state repression, our entire society is in trouble.

Attorneys are expensive. And they are all so busy right now.

If you can spare a few bucks, care about a free society, and feel that supporting Des is a good way to do it, please help contribute at his GoFundMe.

More info in this article from the Intercept, and at Des’ support website.

Des says:

I want to be very clear. I did not participate. I was not aware nor did I have any knowledge about the events that transpired on July 4 outside the Prairieland Detention Center. Despite not having any knowledge or not having been near the area at all, I was violently arrested at gunpoint for allegedly making a “wide turn.” My feeling is that I was only arrested because I’m married to Mari Rueda, who is being accused of being at the noise demo showing support to migrants who are facing deportation under deplorable conditions. For this accusation, she’s being threatened with a life sentence in prison.

My charge is allegedly having a box containing magazine “zines,” books, and artwork. Items that are in the possession of millions of people in the United States. Items that are available free online, and available to purchase at stores and online even at places like Amazon. Items that should be protected under the First Amendment “freedom of speech.” If this is happening to me now, it’s only a matter of time before it happens to you.

I believe there’s been almost 20 people arrested in supposed relation to this public noise demo. More than half of those were arrested days later despite not being in the area and are now facing a slew of outrageous charges, in what seems like a political persecution to instill fear on people exercising their First Amendment right.

http://bibwild.wordpress.com/?p=13433

Extensions

Whisper-generated transcripts used in presentation of archival video

jrochkind Jul 16, 2025

Here at the Science History Institute, we have a fairly small, but growing, body of video/film in our Digital Collections, at present just over 100 items, around 70 hours total. We wanted to add transcripts/captions to these videos, for accessibility to those who are hearing impaired, for searchability of video transcript content, and for general … Continue reading Whisper-generated transcripts used in presentation of archival video →

Show full content

Here at the Science History Institute, we have a fairly small, but growing, body of video/film in our Digital Collections, at present just over 100 items, around 70 hours total.

We wanted to add transcripts/captions to these videos, for accessibility to those who are hearing impaired, for searchability of video transcript content, and for general usability. We do not have the resources to do any manual transcription or even really Quality Assurance, but we decided that OpenAI whisper automated transcription software was of sufficient quality to be useful.

We have implemented whisper-produced transcriptions. We use them for on-screen text track captions; for an accompanying on-the-side transcript; and for indexing for searching in our collection.

I’ll talk about some of the choices we made and things we discovered, including: our experience using whisper to transcribe; implementing a text track for captions in the video screen (and some Safari weirdness with untitled empty track); synchronized transcript elsewhere on the page; improving the default video.js skin/theme; and trying to encourage Google to index transcript text.

*Baseline: The Chemist*, an amusing 1970s kind of impressionistic/conceptual promotional video for… chemists being really cool?

Some other interesting videos in our collection

Atomic Energy As a Force For Good (1955)
Putting Scientific Information to Work (1967) (of interest to my librarian colleagues, an early marketing video for ISI (then the “Institute for Scientific Information”) cutting edge citation indexing database. (Go to 31:53 for a weird 60s style recap of important events of the 60s?)
Pick of the Pod (1939)
Proposition 65: Troubled Waters for California (1986)

OpenAI Whisper Hosted API

Many of our library/museum/archives peers use the open source Whisper implementation, or a fork/variation of it, and at first I assumed I would do the same. However, we deploy largely on heroku, and I quickly determined that the RAM requirements (at least for medium and above models, and disk space requirements (a pip install openai-whisper added tens of gigs) were somewhere in between inconvenient and infeasible on the heroku cedar platform, at least for our budget.

These limitations and costs change on the new heroku fir platform, so at first I thought we might have to wait until we migrate there — but then I noticed whisper also existed, of course, on the commercial OpenAI API platform.

This is not exactly the same product as OpenAI whisper, and exactly how it differs is not public. The hosted whisper does not let (or require?) you to choose a model, it just uses whatever it uses. It has fewer options — and in the open source realm, there are forks or techniques with even more options and features, like diarization or attempting to segment multi-lingual recordings by language. With the hosted commercial implementation, you just get what you get.

But on the plus side, it’s of course convenient not to have to provison your own resources. It is priced at $0.006 per minute of source audio, so that’s only around $25 to transcribe our meager 70 hour corpus, no problem, and no problem if we keep adding 70-200 hours of video a year as currently anticipated. If we start adding substantially more, we can reconsider our implementation.

Details of whisper API usage implementation

Whisper hosted API has a maximum filesize of 25 MB. Some of our material is up to two hours in length, and audio tracks simply extracted from this material routinely exceeded this limit. But by using ffmpeg to transcode to the opus encoding in an ogg container, using the opus voip profile optimized for voice, at a 16k bitrate — even 2 hours of video is comfortably under 25MB. This particular encoding was found often recommended on forums, with reports that downsampling audio like this can even result in better whisper results; we did not experiment, but it did seem to perform adequately.

ffmpeg -nostdin -y -i input_video.mp4 -vn -map-metadata -1 -ac 1 -c:a libopus -b:a 16k -application voip ./output.oga

Whisper can take a single source language argument — we have metadata already in our system recording language of source material, so if there is only one listed, we supply that. Whisper can’t really handle multi-lingual content. Almost all of our current video corpus is only English, but we do have one video that is mixed English and Korean, and fairly poor audio quality — whisper API actually refused to transcribe that, actually returning an error message (after a wait). When I tried that with opensource whisper just out of curiosity, it did transcribe it, very slowly — but all the Korean passages were transcribed as “hallucinated” English. So error-ing out may actually be a favor to us.

You can give whisper a “prompt” — it’s not conversational instructions, but is perhaps treated more like a glossary of words used. We currently give it our existing metadata “description” field, and that resulted in successful transcription of a word that never caught on, “zeugmatography” (inventor of MRI initially called it that), as well as correct spelling of “Eleuthère Irénée”. If it’s really just a glossary, we might do even better by taking all metadata fields, and just listing unique words once per word (or even trying to focus on less common words). But for now description as-is works well.

Here’s our ruby implementation, pretty simple, using the ruby-openai gem for convenience.

I had at one point wanted to stream my audio, stored on S3, directly to a HTTP POST to API, without having to download the whole thing to a local temporary copy first. But ruby’s lack of a clear contract/API/shape of a “stream” object strikes again, making interoperability painful. This fairly simple incompat was just the first of many I encountered; patching this one locally just let me onto the next one, etc. One of my biggest annoyances in ruby honestly!

Results?

As others have found, the results of whisper are quite good, better than any other automated tool our staff had experimented with, and we think the benefits to research and accessibility remain despite what errors do eist. There isn’t much to say about all the things it gets right, by listing the things it doesn’t you might get the wrong idea, but it really does work quite well.

As mentioned, it can’t really handle multi-lingual texts
Errors and hallucinations were certainly noticed. In one case it accurately transcribed a musical passage as simply ♪, but oddly labelled it as “Dance of the Sugar Plum Fairies” (it was not). An audience clapping was transcribed as repeated utterances of “ok”. This example might be more troubling: some totally imaginary dialog replacing what is pretty unintelligible dialog in the original.
Perhaps the most troubling noticed is invented copyright attributions, such as © transcript Emily Beynon (apparently a common one?) — and some other names too. Putting imaginary erroneous copyright declarations in is not great. I am contemplating post-processing to strip any cue beginning with ©, which I think can’t possibly be legitimate?
Wide differences in how long the cues are, although consistent within a piece. But some pieces are transcribed with long paragraph-sized cues, and others just phrase by phrase. I am considering post-processing to join tiny phrase cues into sentences, up to so many words.
It seems to not infrequently, well into a video, start losing the synchronization of timing, getting 5, 10, or even 15 seconds behind? This is weird and I haven’t seen it commented upon before. The text is still as correct as ever, so mostly an inconvenience. See for instance at 9:09 in Baseline: The Chemist, definitely annoying. By 10:23 it’s caught up again, but quickly gets behind again, etc.

We don’t really have the resources to QA even our fairly small collection, so we are choosing to follow in the footsteps of WGBH and their American Archive of Public Broadcasting, and publish it anyway, with a warning influenced by theirs:

I think in the post-pandemic zoom world, most users are used to automatically generated captions and all their errors, and understand the deal.

WGBH digitizes around 37K items a year, far more than we do. They also run an instance of FixIt+ for public-contributed “crowd-sourced” transcription corrections. While I believe FixIt+ is open source (or a really old version of it is?) and some other institutions may run it, we don’t think we’d get enough public attention and only have a small number of videos, we can’t really afford to stand up our own FixitPlus even if it is available. But it does seem like there is an unfilled need for someone to run a crowd-hosted FixitPlus to charge a reasonable rate for hosting for someone that only will need a handful a year?

We did implement an admin feature to allow upload of corrected WebVTT, which will be used in preference to the direct ASR (Automated Speech Recognition) ones. As we don’t anticipate this being done in bulk, right now staff just downloads the ASR WebVTT, uses the software of their choice to edit it, and then uploads a corrected version. This can be done for egregious errors as noticed, or using whatever policy/workflow our archival team thinks appropriate. We also have an admin feature to disable transcoding for material it does not work well for, such as multi-lingual, silent, or other problems.

Text Track Captions on Video

We were already using video.js for our video display. It provides API’s based on HTML5 video API’s, in some cases polyfilling/ponyfilling, in some cases just delegating to underlying APIs. It has good support for text tracks. At present, by default it uses ‘native’ text tracks instead of it’s own implementation (maybe only on?) Safari — you can force emulated text tracks, but it seemed advisable to stick to default native. This does mean it’s important to test on multiple browsers, there were some differences in Safari that required workarounds (more below).

So, for text tracks we simply provide a WebVTT file in a <track> element under the <video> element. Auto-generated captions (ASR, or “Automated Speech Recognition”, compare to OCR), don’t quite fit the existing categories of “captions” vs “subtitles” — we label them as kind captions and give them an English label “Auto-captions”, which we think/hope is a common short name for these.

Safari adding extra “Untitled” track for untagged HLS

For those most part, this just works, but there was one idiosyncracy that took me a while to diagnose and determine appropriate fix. We deliver our video as HLS with a .M3U8 playlist. There is a newer metadata element in .m3u8 playlist that can label the presence or absence of subtitles embedded in the HLS. But in the absence of this metadata — Safari (both MacOS and iOS I believe) insists on adding a text caption track called “Untitled”, which in our case will be blank. This has been noticed by some, but not as much discussion on the internet as I’d expect to be honest!

One solution would be adding the metadata saying no text track is present embedded in HLS (since we want to deliver text tracks as external in <track> element instead). Somewhat astoundingly, simply embedding an EXT-X-MEDIA tag with a fixed static value of CLOSED-CAPTIONS=NONE — on AWS Elemental MediaConvert (which I use) seems to takes you into the “Professional Tier” costing 60% more! I suppose you could manually post-process the .m3u8 manifests yourself… including my existing ones…

Instead, our solution is simply, when on Safari, hook into events on video element to remove a text track with empty string language and title, which is what characterizes these. I adapted from similar solution in ramp, who chose this direction. They wrote theirs to apply to “mobile which is not android”; I found it actually was needed on Safari (iOS or MacOS Safari too), and indeed not Android Chrome (or iOS Chrome!).

I lost at least a few days figuring out what was going on here and how to fix it, hopefully you, dear reader, won’t have to!

Synchronized Transcript on page next to video

In addition to the text track caption in the video player, I wanted to display a synchronized transcript on the page next to/near the video. It should let you scroll through the transcript independent of the video, and click on a timestamp to jump there.

Unsure of how best to fit this on the screen with what UX — I decided to look at YouTube and base my design on what they did. (On YouTube, you need to expand description and look for a “show transcript” button at bottom of it — I did make my ‘show transcript’ button easier to find!)

It shows up next the video, or when on a narrow screen right below it. In a ‘window in window’ internal scrolling box. Used some CSS to try to make the video and the transcript fit wholly on the screen at any screen size — inner scrolling window that’s higher than the parent window I consider a UX nightmare to avoid!

Looking at YouTube, I realized that feature that highlighted current cue as the video played was also one I wanted to copy. That was the trickiest thing to implement.

I ended up using the HTML5 media element api and the events emitted by it and associated child objects, based on the text track with cues I had already loaded in my video.js-enhanced html5 video player. I can let the browser track cue changes and listen for events when they change, to highlight current cue.

If a track is set to mode hidden, then the user agent will still track the text cues and emit events for when they change, even though they aren’t displayed. Video.js (and probably native players) by default have UI that toggles between shown and disabled (which does not track cue changes), so I had to write a bit of custom code to switch non-selected text tracks to hidden instead of disabled
- (Some browsers and/or video.js polyfill code may have been emitting cueChange events even on disabled tracks, contrary to or not required by spec — important to test on all browsers!)
After that, it’s just listening to the cueChange HTML5 video event emitted on the track of our auto-captions, to know that we need to de-highlight any old cues, and highlight the new ones.
Had to write code to map from the HTML5 video Cue object returned as active cue, and find the div/span on page to highlight. as simple as putting start time in a data- attribute, and matching it to startTime on Cue — except we’re string-matching, so important to output identically including digits after decimal place etc.
At first I didn’t realize I could use the user-agent’s own cue-tracking code, and was trying to catch an event on every timeUpdate event, and calculate which cues included that timestamp myself. In addition to being way more work than required (the HTML5 video API has this feature for you to use) — safari wasn’t emitting timeUpdate events unless the status bar with current time was actually on screen!
In general, the media element api and events seemed to an area with, for 2025, unusual level of differences between browsers — or at least between more native Safari and more emulated video.js in other browsers. It definitely is important to do lots of cross browser testing. While I use it rarely, when I do I couldn’t do without BrowserStack and its free offerings for open source.

Improved Video Controls

The default video.js control bar seems to me undesirably small buttons and text, and just not quite right in several ways. And there don’t seem to be very many alternative open source theme or skins (video.js seems to use both words for this), and what do exist are often kind of pushing on “interesting” aesthetics instead of being neutral/universal?

Adding the caption button was squeezing the default control bar tight, especially on small screens. With that and the increased attention to our videos that transcripts would bring, we decided to generally improve the UX of the controls, but in a neutral way that was still generic and non-branded. Again, I was guided by both youtube and the ramp player (here’s one ramp example), and also helped by ramp’s implementation (although beware some skin/theme elements are dispersed in other CSS too, not all in this file).

Before (default video.js theme)

After (local tweaked)

Scrubber/progress bar extends all the way across the the screen, above the control bar (ala youtube and ramp)
- Making sure captions stay above the now higher controls was tricky. I think this approaching using translateY works pretty well, but hadn’t seen it before? Also required a bit of safari-specific css for safari’s “native text tracks”. And some nice slide up/down animation on control bar show/hide matching youtube seems nice.
- buttons split between right and left, like again both youtube and ramp. Volume on right only cause it was somewhat easier.
Buttons themselves made bigger by default, and the icons on the buttons take up a larger portion of the button square. (They were all so tiny before!)
Underline the CC button when a text track is visible. From both youtube and ramp.
- Required some simple javascript to add/remove the visibility class appropriately.
Current time showing as current / total instead of by default elapsed, now matching youtube and what some of our users asked for. (Default video.js has some weird spacing that you have to really trim down once you show current and total).
Use newer CSS @container queries to make buttons smaller and/or remove some buttons when screen is smaller (had some weird problems with this actually crashing the video player in my actual markup though).

While fairly minor changes, I think it results in much better look and usability for a general purpose neutral theme/skin than video.js ships with out of the box. While relatively simple, it still took me a week or so to work through.

If there’s interest, I would find time to polish it up further and release it as more easily re-usable open source product, let me know?

Google indexable transcripts

One of the most exciting things about adding transcripts for our videos, is that text is now searchable and discoverable in our own web app.

It would be awfully nice if Google would index it too, so people could find otherwise hidden mentions of things they might want in videos. In the past, I’ve had trouble getting Google to index other kinds of transcripts and item text like OCRs. While hypothetically Google is visiting with javascript and can click on things like tabs or disclosure “show” buttons — conventional wisdom seems to be that Google is doens’t like to index things that aren’t on the initial page and require a click to see — which matches my experience, although others have had other experiences.

In an attempt to see if I could get google to index, I made a separate page with just transcript text — it links back to the main item page (with video player), and even offers clickable timecodes that will link back to player at that time. This transcript-only page is the href on the “Show Transcript” button, although a normal human user ordinarily would get JS executing to show transcript on same page instead when clicking on that link, you can right-click “open in new tab” to get it if you want. These extra transcript pages are also listed in my SiteMap.

There are already a few of these transcript pages showing up in google, so it seems to be a potentially useful move.

That isn’t to say how much SEO juice they have; but first step is getting them in the index, which I had trouble doing before with things that required a tab or ‘show’ click to be shown. So we’ll keep an eye on it! Of course, another option is making the transcript on-page right from the start without requiring a click to show, but I’m not sure if that really serves the user?

We also marked up our item pages with schema.org content for video, including tags around the transcript text (which is initially in DOM, but requires a ‘show transcript’ click to be visible). I honestly would not expect this to do much for increasing indexing of transcripts… I think according to google this is intended to give you a “rich snippet” for video (but not to change indexing)… but some people think Google doesn’t do too much of that anyway, and to have any chance I’d probably have to provide a persistent link to video as a contentUrl which I don’t really do. Or maybe it could make my content show up in Google “Video” tab results… but no luck there yet either. Honestly I don’t think this is going to do much of anything, but it shouldn’t hurt.

Acknowledgements

Thanks to colleagues in Code4Lib and Samvera community slack chats, for sharing their prior experiences with whisper and with video transcripts — and releasing open source code that can be used as a reference — so I didn’t have to spend my time rediscovering what they already had!

Especially generous were Mason Ballengee and Dananji Withana who work on the ramp project. And much thanks to Ryan “Harpo” Harbert for two sequential years of Code4Lib conference presentations on whisper use at WGBH (2024 video, 2025 video), and also Emily Lynema for a 2025 whisper talk.

I hope I have helped pass on a portion of their generosity by trying to share all this stuff above to keep others from having to re-discover it!

http://bibwild.wordpress.com/?p=13266

Extensions

Using CloudFlare Turnstile to protect certain pages on a Rails app

jrochkind Jan 16, 2025

I work at a non-profit academic institution, on a site that manages, searches, and displays digitized historical materials: The Science History Institute Digital Collections. Much of our stuff is public domain, and regardless we put this stuff on the web to be seen and used and shared. (Within the limits of copyright law and fair … Continue reading Using CloudFlare Turnstile to protect certain pages on a Rails app →

Show full content

I work at a non-profit academic institution, on a site that manages, searches, and displays digitized historical materials: The Science History Institute Digital Collections.

Much of our stuff is public domain, and regardless we put this stuff on the web to be seen and used and shared. (Within the limits of copyright law and fair use; we are not the copyright holders of most of it). We have no general problem with people scraping our pages.

The problem is that, like many of us, our site is being overwhelmed with poorly behaved bots. Lately one of the biggest problems is with bots clicking on every possible combination of facet limits in our “faceted search” — this is not useful for them, and it overwhelms our site. “Search” pages are one of our most resource-constrained category of page in our present site, adding to the injury. Peers say even if we scaled up (auto or not) — the bots sometimes scale up to match anyway!

One option would be putting some kind of “Web Application Firewall” (WAF) in front of the whole app. Our particular combination of team and budget and platform (heroku) makes a lot of these options expensive for us in licensing, staff time to manage, or both. Another option is certainly putting the the whole thing behind (ostensibly free) CloudFlare CDN and using its built-in WAF, but we’d like to avoid giving our DNS over to CloudFlare, I’ve heard mixed reviews of CloudFlare free staying free, and generally am trying to avoid contributing to CloudFlare’s monopoly unaccountable control of the internet.`

Although ironically then, the solution we arrived at is still using CloudFlare, but Cloudflare’s Turnstile “captcha replacement”, one of those things that gives you the “check this box” or more often entirely interactive “checking if you are a bot” UXs.

[If you’re a tldr look at the code type, here’s the initial implementation PR in our open repo, there are some bug fixes since then
Update March 18 2025: There is now a gem implementation, bot_challenge_page. It is pre-1.0 and still evolving as we learn more about the problem space]

While this still might unfortunately lock people using unconventional browsers etc out (just the latest of many complaints on HackerNews), we can use this to only protect our search pages. Most of our traffic comes directly from Google to an individual item detail page, which we can now leave completely out of it. We have complete control of allow-listing traffic based on whatever characteristics, when to present the challenge, etc. And it turns out we had a peer at another institution who had taken this approach and found it successful, so that was encouraging.

How it works: Overview

While typical documented Turnstile usage involves protecting form submissions, we actually want to protect certain urls, even when accessed via GET. Would this actually work well? What’s the best way to implement it?

Fortunately, when asking around on a chat for my professional community of librarian and archivist software hackers, Joe Corall from Lehigh University said they had done the exact same thing (even in response to the same problem, bots combinatorially exploring every possible facet value), and had super usefully written it up, and it had been working well for them.

Joe’s article and the flowchart it contains is worth looking it. His implementation is as a Drupal plugin (and used in at least several Islandora instances); the VuFind library discovery layer recently implemented a similar approach. We have a Rails app, so needed to implement it ourselves — but with Joe paving the way (and patiently answering our questions, so we could start with the parameters that worked for him), it was pretty quick work, bouyed by the confidence this approach wasn’t just an experiment in the blue, but had worked for a similar peer.

Meter the rate of access, either per IP address, or as Joe did, in buckets per sub-net of client IP address.
Once client has crossed a rate limit boundary (in Joe’s case 20 requests per 24 hour period), redirect them to a page which displays the Turnstile challenge — and has the original destination in a query param in url —
Once they have passed the Turnstile challenge, redirect them back to their original destination, which now lets them in because you’ve stored their challenge pass in some secure session state.
In that session state record that they passed, and let them avoid a challenge again for a set period of time.

Joe allow-listed certain client domain names based on reverse IP lookup, but I’ve started without that, not wanting the performance hit on every request if I can avoid it. Joe also allow-listed their “on campus” IPs, but we are not a university and only have a few staff “on campus” and I always prefer to show the staff the same thing our users are seeing — if it’s inconvenient and intolerable, we want to feel the pain so we fix it, instead of never even seeing the pain and not knowing our users are getting it!

I’m going to explain and link to how we implemented this in a Rails app, and our choices of parameters for the various parameterized things. But also I’ll tell you we’ve written this in a way that paves the way to extracting to a gem — kept everything consolidated in a small number of files and very parameterized — so if there’s interest let me know. (Code4Lib-ers, our slack is a great place to get in touch, I’m jrochkind).

Ruby and Rails details, and our parameters

Here’s the implementing PR. It is written in such a way to keep the code conslidated for future gem extraction, all in the BotDetectController class, which means kind of weirdly there is some code to inject in class methods in the controller. While it does turnstile now, it’s written with variable/class names such that analagous products could be made available.

Rack-attack to meter

We were already using rack-attack to rate-limit. We added a “track” monitor with our code to decide when a client had passed a rate-limit gate to require a challenge. We start with allowing 10 requests per 12 hours (Joe at Lehigh did 20 per 24 hours), batched together in subnets. (Joe did subnets too, but we do smaller /24 (ie x.y.z.*) for ipv4 instead of Joe’s larger /16 (x.y.*.*)).

Note that rack-attack does not use sliding/rolling-windows for rate limits, but fixed windows that reset after window period. This makes a difference especially when you use such a long period as we are, but it’s not a problem with our very low count per period, and it does keep the RAM extremely effiicent (just an integer count per rate limit bucket).

When the rate limit is reached, the rack-attack block just sets a key/value in the rack_env to tell another component that a challenge is required. (setting in the session may have worked, but we want to be absolutely sure this will work even if client is not storing cookies, and this is really only meant as this-request state, so rack env seemed the good way to set state in rack-attack that could be seen in a rails controller)

Rails before_action filter to enforce challenge

There’s a Rails before_action filter that we just put on the application-wide ApplicationController, that looks for the “bot challenge key” required in the rack env — if present, and there isn’t anything in the session saying they have already passed a bot challenge, then we redirect to a “challenge” page, that will display/activate Turnstile.

We simply put the original/destination URL in a query param on that page. (And include logic to refuse to redirect to anything but a relative path on same host, to avoid any nefarious uses).

The challenge controller

One action in our BotDetectController just displays the turnstile challenge. The cloudflare turnstile callback gives us a token we need to verify server-side with turnstile API to verify challenge was really passed.

the front-end does a JS/xhr/fetch request to the second action in our BotDetectController. The back-end verify action makes the API call to turnstile, and if challenge passed, sets a value in Rails (encrypted and signed, secure) session with time of pass, so the before_action guard can give the user access.

if the JS in front gets a go-ahead from back-end, it uses JS document.replace to go to original destination. This conveniently removes the challenge page from the user’s browser history, as if it never happened, browser back button still working great.

In most cases the challenge page, if non-interactive, wont’ be displayed for more than a few seconds. (the language has been tweaked since these screenshots).

We currently have a ‘pass’ good for 24 hours — once you pass a turnstile challenge, if your cookies/session are intact, you won’t be given another one for 24 hours no matter how much traffic. All of this is easily configurable.

If the challenge DOES fail for some reason, the user may be looking at the Challenge page with one of two kinds of failures, and some additional explanatory text and contact info.

Limitations and omissions

This particular flow only works for GET requests. It could be expanded to work for POST requests (with an invisible JS created/submitted form?), but our initial use case didn’t require it, so for now the filter just logs a warning and fails for POST.

This flow also isn’t going to work for fetch/ajax requests, it’s set up for ordinary navigation, since it redirects to a challenge then redirects back. Our use case is only protecting our search pages — but the blacklight search in our app has a JS fetch for “facet more” behavior. Couldn’t figure out a good/easy way to make this work, so for now we added an exemption config, and just exempt requests to the #facet action that look like they’re coming from fetch. Not bothered that an “attacker” could escape our bot detection for this one action; our main use case is stopping crawlers crawling indiscriminately, and I don’t think it’ll be a problem.

To get through the bot challenge requires a user-agent to have both JS and cookies enabled. JS may have been required before anyway (not sure), but cookies were not. Oh well. Only search pages are protected by the bot challenge.

The Lehigh implementation does a reverse-lookup of the client IP, and allow-lists clients from IP’s that reverse lookup to desirable and well-behaved bots. We don’t do that, in part because I didn’t want the performance hit of the reverse-lookup. We have a Sitemap, and in general, I’m not sure we need bots crawling our search results pages at all… although I’m realizing as I write this that our “Collection” landing pages are included (as they show search results)… may want to exempt them, we’ll see how it goes.

We don’t have any client-based allow-listing… but would consider just exempting any client that has a user-agent admitting it’s a bot, all our problematic behavior has been from clients with user-agents appearing to be regular browsers (but obviously automated ones, if they are being honest).

Possible extensions and enhancements

We could possibly only enable the bot challenge when the site appears “under load”, whether that’s a certain number of overall requests per second, a certain machine load (but any auto-scaling can make that an issue), or size of heroku queue (possibly same).

We could use more sophisticated fingerprinting for rate limit buckets. Instead of IP-address-based, colleague David Cliff from Northeastern University has had success using HTTP user-agent, accept-encoding, and accept-language to fingerprint actors across distributed IPs, writing:

I know several others have had bot waves that have very deep IP address pools, and who fake their user agents, making it hard to ban.

We had been throttling based on the most common denominator (url pattern), but we were looking for something more effective that gave us more resource headroom.

On inspecting the requests in contrast to healthy user traffic we noticed that there were unifying patterns we could use, in the headers.

We made a fingerprint based on them, and after blocking based on that, I haven’t had to do a manual intervention since.

def fingerprint
result = “#{env[“HTTP_ACCEPT”]} | #{env[“HTTP_ACCEPT_ENCODING”]} | #{env[“HTTP_ACCEPT_LANGUAGE”]} | #{env[“HTTP_COOKIE”]}”
Base64.strict_encode64(result)
end

…the common rule we arrived at mixed positive/negative discrimination using the above

request.env["HTTP_ACCEPT"].blank? && request.env["HTTP_ACCEPT_LANGUAGE"].blank? && request.env["HTTP_COOKIE"].blank? && (request.user_agent.blank? || !request.user_agent.downcase.include?("bot".downcase))

so only a bot that left the fields blank and lied with a non-bot user agent would be affected

We could also base rate limit or “discriminators” for rate limit buckets on info we can look up from the client IP address, either a DNS or network lookup (performance worries), or perhaps a local lookup using the free MaxMind databases that also include geocoding and some organizational info.

Does it work?

Too early to say, we just deployed it!

I sometimes get annoyed when people blog like this, but being the writer, I realized that if I wait a month to see how well it’s working to blog — I’ll never blog! I have to write while it’s fresh and still interesting to me.

But encouraged that colleagues say very similar approaches have worked for them. Thanks again to Joe Corral for paving the way with a drupal implementation, blogging it, discussing it on chat, and answering questions! And all the other librarian and cultural heritage technologists sharing knowledge and collaboration on this and many other topics!

I can say that already it is being triggered a lot, by bots that don’t seem to get past it. This includes google bot and Meta-ExternalAgent (which I guess is AI-related; we have no particular use-based objections we are trying to enforce here, just trying to preserve our resources). While Google also has no reason to combinatorially explore every facet combination (and has a sitemap), I’m not sure if I should exempt known resource-considerate bots from the challenge (and whether to do so by trusting user-agent or not; our actual problems have all been with ordinary-browser-appearing user-agents).

Update 27 Jan 2025

Our original config — allowing 10 search results per IP subnet before turnstile challenge — was not enough to keep the bot traffic from overwhelming us. Too many botnets had enough IPs making apparently fewer than 10 requests each.

Lowering that to 2 requests was enough to reduce traffic enough. (Keep in mind that a user should only get one challenge per 24 hours unless IP address changes — although that makes me realize that people using Apple’s “private browsing” feature may get more, hmm).

Pretty obvious on these heroku dashboard graphs where our succesful turnstile config was deployed, right?

I think I would be fine going down to challenge on first search results, since a human user should still only get one per 24 hour period — but since the “success passed” mark in session is tied to IP address (to avoid session replay for bots to avoid the challenge), I am now worried about Apple “private browsing”! In today’s environment with so many similar tests, I wonder if private browsing is causing problems for users and bot protections?

You can see on the graph a huge number of 3xx responses — those are our redirects to challenge page! The redirect to and display of the challenge page seem to be cheap enough that they aren’t causing us a problem even in high volume — which was the intent, nice to see it confirmed at least with current traffic.

We are only protecting our search result page, not our item detail pages (which people often get to directly to google) — this also seems succesful. The real problem was the volume of hits from so many bots trying to combinatorially explore every possible facet limit, which we have now put a stop to.

http://bibwild.wordpress.com/?p=13100

Extensions

Accessing capybara-screenshot artifacts on Github CI

jrochkind Dec 5, 2024

We test our Rails app with rspec and capybara. For local testing, we use the capybara-screenshot plugin which “Automatically save screen shots when a Capybara scenario fails”, even when the tests were running in a headless browser you couldn’t see at all. This can be very helpful in debugging tricky capybara failures, especially ones that … Continue reading Accessing capybara-screenshot artifacts on Github CI →

Show full content

We test our Rails app with rspec and capybara.

For local testing, we use the capybara-screenshot plugin which “Automatically save screen shots when a Capybara scenario fails”, even when the tests were running in a headless browser you couldn’t see at all. This can be very helpful in debugging tricky capybara failures, especially ones that are “flaky” and hard to reproduce failure on.

We run all our tests automatically as CI in Github Actions.

I was running into some capybara browser tests that were failing flakily and inconsistently on Github Actions, but I could not manage to reproduce locally at all. What was going on? It would be super helpful to have access to the capybara-screenshots generated on the github actions run.

Is there a way to do it? Yes! Store them as Github Actions “artifacts“. My last two steps of my github workflow .yml look like this, the one that runs rspec, and then the one that saves any capybara-screenshot screenshot artifacts!

        - name: Run tests
          run: |
            bundle exec rspec

        - name: Archive capybara failure screenshots
          uses: actions/upload-artifact@v4
          if: failure()
          with:
            name: dist-without-markdown
            path: tmp/capybara/*.png
            if-no-files-found: ignore

I already had capybara-screenshot set up.

Now, if a capybara test fails, I can look at the screenshot filename reported for that particular failed test in the Github CI log.

And, then down under the “Archive capybara failure screenshots” action, I can find a clickable URL, which when clicked on, downloads a zip file that contains any/all archived capybara screenshots. If there are more than one, I can match filenames to the filename reported in a certain spec failure.

And I confirmed that last step with an if: failure() does not change the failure status of the job — the job is still marked by Github CI as failed, as it should be, but the archiving step still runs to archive the failure artifacts.

Very handy!

http://bibwild.wordpress.com/?p=13072

Extensions

Getting rspec/capybara browser console output for failed tests

jrochkind Oct 8, 2024

I am writing some code that does some smoke tests with capybara in a browser of some Javascript code. Frustratingly, it was failing when run in CI on Github Actions, in ways that I could not reproduce locally. (Of course it ended up being a configuration problem on CI, which you’d expect in this case). … Continue reading Getting rspec/capybara browser console output for failed tests →

Show full content

I am writing some code that does some smoke tests with capybara in a browser of some Javascript code. Frustratingly, it was failing when run in CI on Github Actions, in ways that I could not reproduce locally. (Of course it ended up being a configuration problem on CI, which you’d expect in this case). But this fact especially made me really want to see browser console output — especially errors, for failed tests, so I could get a hint of what was going wrong beyond “Well, the JS code didn’t load”.

I have some memory of being able to configure a setting in some past capybara setup, to make error output in browser console automatically fail a test and output? But I can’t find any evidence of this on the internet, and at least I’m pretty sure there is no way to do that with my current use of selenium-webdrivers and with the headless chrome to run capybara tests.

So I worked out this hacky way to add any browser console output to the failure message on failing tests only. It requires using some “private” rspec API, but this is all I could figure out. I would be curious if anyone has a better way to accomplish this goal.

Note that my goal is a bit different than “make a test fail if there’s error output in browser console”, although I’m potentially interested in that too, here I wanted: for a test that’s already failing, get the browser console output, if any, to show up in failure message.

# hacky way to inject browser logs into failure message for failed ones
  after(:each) do |example|
    if example.exception
      browser_logs = page.driver.browser.logs.get(:browser).collect { |log| "#{log.level}: #{log.message}" }

      if browser_logs.present?
        # pretty hacky internal way to get browser logs into 
        # existing long-form failure message, when that is
        # stored in exception associated with assertion failure
        new_exception = example.exception.class.new("#{example.exception.message}\n\nBrowser console:\n\n#{browser_logs.join("\n")}\n")
        new_exception.set_backtrace(example.exception.backtrace)

        example.display_exception = new_exception
      end
    end
  end

I think by default, with selenium headless chrome, you should get browser console that only includes error/warn log levels but not info, but if you aren’t getting what you want or want more you need to make a custom Capybara driver with custom loggingPrefs config that may look something like this:

Capybara.javascript_driver = :my_headless_chrome

Capybara.register_driver :my_headless_chrome do |app|
  Capybara::Selenium::Driver.load_selenium
  browser_options = ::Selenium::WebDriver::Chrome::Options.new.tap do |opts|
    opts.args << '--headless'
    opts.args << '--disable-gpu'
    opts.args << '--no-sandbox'
    opts.args << '--window-size=1280,1696'

    opts.add_option('goog:loggingPrefs', browser: 'ALL')
  end
  Capybara::Selenium::Driver.new(app, browser: :chrome, options: browser_options)
end

http://bibwild.wordpress.com/?p=12999

Extensions

keyword-like arguments to JS functions using destructuring

jrochkind Oct 2, 2024

I am, unusually for me, spending some time writing some non-trivial Javascript, using ES modules. In my usual environment of ruby, I have gotten used to really preferring keyword arguments to functions for clarity. More than one positional argument makes me feel bad. I vaguely remembered there is new-fangled way to exploit modern JS features … Continue reading keyword-like arguments to JS functions using destructuring →

Show full content

I am, unusually for me, spending some time writing some non-trivial Javascript, using ES modules.

In my usual environment of ruby, I have gotten used to really preferring keyword arguments to functions for clarity. More than one positional argument makes me feel bad.

I vaguely remembered there is new-fangled way to exploit modern JS features to do this with JS, including default values, but was having trouble finding it. Found it! It involves using “destructuring”. Putting it here for myself, and in case this text gives someone else (perhaps another rubyist) better hits for their google searches than I was getting!

function freeCar({name = "John", color, model = "Honda"} = {}) {
  console.log(`Hi ${name}, you get a ${color} ${model}`);
}

freeCar({name: "Joe", color: "Green", model: "Lincoln"})
# Hi Joe, you get a Green Lincoln

freeCar({color: "RED"})
# Hi John, you get a RED Honda

freeCar()
# Hi John, you get a undefined Honda

freeCar({})
# Hi John, you get a undefined Honda

http://bibwild.wordpress.com/?p=12983

Extensions

Sort order of results in OCLC AssignFAST/fastsuggest API

jrochkind Jul 25, 2024

We have long used the free OCLC AssignFAST API to power an auto-suggest/auto-complete in some of our staff metadata entry forms. (Note: While OCLC calls this service “AssignFAST” in docs, the base URL instead includes the term fastsuggest as in http://fast.oclc.org/searchfast/fastsuggest?, which may be confusing for some!) Recently our staff reported they thought the sort … Continue reading Sort order of results in OCLC AssignFAST/fastsuggest API →

Show full content

We have long used the free OCLC AssignFAST API to power an auto-suggest/auto-complete in some of our staff metadata entry forms. (Note: While OCLC calls this service “AssignFAST” in docs, the base URL instead includes the term fastsuggest as in http://fast.oclc.org/searchfast/fastsuggest?, which may be confusing for some!)

Recently our staff reported they thought the sort order had changed in the results returned — before they think they could enter, say, “Philadelphia” and get “Philadelphia–Pennsylvania” as a suggested response — which was the one they wanted. But now, that result wasn’t even in the first 15 results, and thus wasn’t included in our drop-down auto-suggest menu. It looks like the current result order is strictly alphabetical, so includes a lot of obscure or unlikely to be useful hits from a query matching any part of a term.

We contacted OCLC support at this form we found for “Contact the FAST team”. They got back to us relatively quickly (hooray!), but unfortunately could not neither confirm nor deny that any changes had happened to the order of results from that API any time recently, or any time at all.

But they did, eventually, tell us that we could in fact control the sort order of the API response, and make it return results in order of usage, by adding a query param to the API request — &sort=usage+desc. I would guess “usage” means frequency of use in the OCLC catalog record database.

Eg, https://fast.oclc.org/searchfast/fastsuggest?&query=philadelphia&query&queryReturn=suggestall%2Cidroot%2Cauth%2Ctype&suggest=autoSubject&rows=20&sort=usage+desc

This works, and seems to return results more like what we’re expecting/used to?

It also seems to return results more like what their own web page at https://fast.oclc.org/searchfast/ returns?

It’s also entirely undocumented on the AssignFAST API documentation, and they don’t seem inclined to update the docs? I do not know if there might be other useful values selectable as sort fields, in addition to usage.

I make this post in part for Google-ability for anyone else looking to solve this mystery.

The AssignFAST doc page also includes lots of broken links to examples. In general, I think we probably should be grateful this free API still exists at all; I think OCLC is putting only minimal if any resources to even keeping it alive, let alone enhancing it or responding to current user need; I’d expect it to disappear at some point.

I remember a time when I first entered the profession where I imagined we would develop all sorts of innovative digital services and APIs from which a new generation of library technological ecosystem would be built, and OCLC, our community-owned non-profit cooperative, would be at the center of that doing innovative things that they were well-placed to do with their access to metadata ecosystems. Of that fantasized future, what remains in 2024 is the leftover unsupported frozen-on-the-vine initial experiments from that era of excitement and innovation, and a shrunken ecosystem of vendors trying to figure out how to wring every last drop of revenue from their customers.

http://bibwild.wordpress.com/?p=12857

Extensions

Cloudfront in front of S3 using response-content-disposition

jrochkind Jun 18, 2024

At the Science History Institute Digital Collections, we have a public collection of digitized historical materials (mostly photographic images of pages). We store these digitized assets — originals as well as various resizes and thumbnails used on our web pages — in AWS S3. Currently, we provide access to these assets directly from S3. For … Continue reading Cloudfront in front of S3 using response-content-disposition →

Show full content

Currently, we provide access to these assets directly from S3. For some of our deliveries, we also use the S3 feature of a response-content-disposition query parameter in a signed expiring S3 url, to have the response include an HTTP Content-Disposition header with a filename and often attachment disposition, so when the end-user saves the file they get a nice humanized filename (instead of our UUID filename on S3), supplied dynamically at download time — while still sending the user directly to S3, avoiding the need for a custom app proxy layer.

While currently we’re sending the user directly to urls in S3 buckets set with public non-authenticated access, we understand a better practice is putting a CDN in front like AWS’s own CloudFront. In addition to the geographic distribution of a CDN, we believe this will give us: better more consistent performance even in the same AWS region; possibly some cost savings (although it’s difficult for me to compare the various different charges over our possibly unusual access patterns); and additionally access to using AWS WAF in front of traffic, which was actually our most immediate motivation.

But can we keep using the response-content-disposition query param feature to dynamically specify a content-disposition header via the URL? It turns out you certainly can keep using response-content-disposition through CloudFront. But we found it a bit confusing to set up, and think through the right combination features and their implications, with not a lot of clear material online.

So I try to document here the basic recipe we have used, as well as discuss considerations and details!

Recipe for CloudFront distribution forwarding response-content-disposition to S3

We need CloudFront to forward response-content-disposition header to s3 — by default it leaves off query string (after ? in a URL) when forwarding to origin. You might reach for a custom Origin Request Policy, but it turns out we’re not going to need it, because a Cache Policy will take care of it for us.
If we’re returning varying content-disposition headers, we need a non-default Cache Policy such that the cache key varies based on response-content-disposition too — otherwise changing the content-disposition in query param might get you a cached response with old stale content-disposition.
- We can create a Cache Policy based on the managed CachingOptimized policy, but adding the query params we are interested in.
- It turns out including URL query params in a Cache Policy automatically leads to them being included in origin requests, so we do NOT need a custom Origin Request Policy. Only a custom Cache Policy that includes response-content-disposition

OK, but for the S3 origin to actually pay attention to the response-content-disposition` header, you need to set up a CloudFront Origin Access Control (OAC) given access to the S3 bucket, and set to “sign requests”. Since S3 only respects this header for signed requests.
- You don’t actually need to restrict the bucket to only allow requests from CloudFront, but you probably want to make sure all your buckets requests are going through cloudfront?
- You don’t need to restrict the CloudFront distro to Restrict viewer access, but there may be security implications of setting up response-content-disposition forwarding with non-restircted distro? More discussion below.
- Some older tutorials you may find use AWS “Origin Access Identity (OAI)” for this, but OAC is the new non-deprecated way, don’t follow those tutorials.
- Setting this all up has a few steps, and but this CloudFront documentation page leads you through it.

At this point your Cloudfront distribution is working to forward response-content-disposition headers, and return the resultant content-disposition headers in response — Cloudfront by default forwards on all response headers from origin, by default if you haven’t set a distribution behavior “Response headers policy”. Even setting a response headers policy like Managed-CORS-with-preflight-and-SecurityHeadersPolicy (which is what I often need), it seems it forwards on other response headers like content-disposition no problem.

You may likely also want to Restrict viewer access to the Cloudfront distribution. (See security considerations below). See AWS docs Serve private content with signed URLs and signed cookies.

Security Implications of Public Cloudfront with response-content-disposition

An S3 bucket can be set to allow public access, as I’ve done with some buckets with public content. But to use the response-content-disposition or response-content-type query param to construct a URL that dynamically chooses a content-disposition or content-type — you need to use an S3 presigned url (or some other form of auth I guess), even on a public bucket! “These parameters cannot be used with an unsigned (anonymous) request.”

Is this design intentional? If this wasn’t true, anyone could construct a URL to your content that would return a response with their chosen content-type or content-disposition headers. I can think of some general vague hypothetical ways this could be used maliciously, maybe?

But by setting up a CloudFront distribution as above, it is possible to set things up so an unsigned request can do exactly that. http://mydistro.cloudfront.net/content.jpg?response-content-type=application%2Fx-malicious, and it’ll just work without being signed. Is that a potential security vulnerability? I’m not sure, but if so you should not set this up without also setting the distribution to have restricted viewer access and require (eg) signed urls. That will require all urls to the distribution to be signed though, not just the ones with the potentially sensitive params.

What if you want to use public un-signed URLs when they don’t have these sensitive params; but require signed URLs when they do have these params? (As we want the default no-param URLs to be long-cacheable, we don’t want them all to be unique time-limited!)

Since CloudFront “restricted access” is set for the entire distribution/behavior, you’d maybe need to use different distributions both pointed at the same origin (but with different config). Or perhaps different “behaviors” at different prefix paths within the same distribution. Or maybe there is a way to use custom Cloudfront functions or lambdas to implement this, or restrict it in some other way? I don’t know much about that. It is certainly more convoluted to try to set up something like how S3 alone works, where straight URLs are public and persistent, but URLs specifying response headers are signed and expiring.

Other Considerations

You may want to turn on logging for your CloudFront distro. You may want to add tags to make cost analysis easier.

In my buckets, all keys have unique names using UUID or content digests, such that all URLs should be immutable and cacheable forever. I want the actual user-agents making the request o get far-future cache-control headers. I try to set S3 cache-control metadata with far-future expiration. But if some got missed or I change my mind about what these should look like, it is cumbersome (and has some costs) to try to check/reset metadata on many keys. Perhaps I want the CloudFront distro/behavior to force add/overwrite far-future cache-control header itself? I could do that either with a custom response headers policy (might want to start with one of the managed policies, and copy/paste it modifying to add cache-control header), or perhaps a custom origin request policy that added on a S3 response-cache-control query param to ask S3 to return a far-future cache-control header. (You might want to make sure you aren’t telling the user-agent to cache error messages from origin though!)

You may be interested in limiting to a CloudFront price class to control costs.

Terraform example

Terraform files demonstrating what is described here can be found: https://gist.github.com/jrochkind/4edcc8a4a1abf090a771a3e0324f6187

More detailed explanation below.

Detailed Implementation Notes and Examples Custom Cache Policy

Creating cache polices discussed in AWS docs.

Documentation that Cache Policy results in query params being included in origin requests from documentation on Control origin requests with a policy.

Although the two kinds of policies are separate, they are related. All URL query strings, HTTP headers, and cookies that you include in the cache key (using a cache policy) are automatically included in origin requests. Use the origin request policy to specify the information that you want to include in origin requests, but not include in the cache key. Just like a cache policy, you attach an origin request policy to one or more cache behaviors in a CloudFront distributionz

You set a cache policy for your distribution (or specific behavior) by editing a Behavior here:

I created the Cache Policy with TTL values from “CachingOptimized” managed behavior, and added the query params I was interested in:

Which looks like this in terraform:

 resource "aws_cloudfront_distribution" "example-test2" {
      # etc
      default_cache_behavior {
          cache_policy_id        = "658327ea-f89d-4fab-a63d-7e88639e58f6"
      }
}

resource "aws_cloudfront_cache_policy"  "jrochkind-test-caching-optimized-plus-s3-params" {
  name        = "jrochkind-test-caching-optimized-plus-s3-params"
  comment     = "Based on Managed-CachingOptimized, but also forwarding select S3 query params"
  default_ttl = 86400
  max_ttl     = 31536000
  min_ttl     = 1
  parameters_in_cache_key_and_forwarded_to_origin {
    enable_accept_encoding_brotli = true
    enable_accept_encoding_gzip   = true

    cookies_config {
      cookie_behavior = "none"
    }
    headers_config {
      header_behavior = "none"
    }
    query_strings_config {
      query_string_behavior = "whitelist"
      query_strings {
        items = [
          "response-content-disposition",
          "response-content-type"
        ]
      }
    }
  }
}

Cloudfrong Origin Access Control (OAC) to sign requests to S3

Covered in CloudFront docs Restrict access to an Amazon Simple Storage Service origin, which lead you through it pretty nicely.

While you could leave off the parts that actually restrict access (say allowing public access), and just follow the parts for setting up an OAC to sign requests… you probably also want to restrict access to s3 so only CloudFront has it, not the public?

Give the S3 bucket a (permissions) Policy allowing access by your specific Cloudfront Distribution
Create an Cloudfront Origin Access Control (OAC) object of Origin type S3. Then in Cloudfront Origin settings, set “Origin Access” to use your new OAC object.

Relevant terraform follows. (You may want to use templating feature for the json policy, shown in complete example above).

resource "aws_cloudfront_distribution" "example-test2" {
    # etc
    origin {
        connection_attempts = 3
        connection_timeout  = 1
        domain_name         = aws_s3_bucket.example-test2.bucket_regional_domain_name
        origin_id           = aws_s3_bucket.example-test2.bucket_regional_domain_name
        origin_access_control_id = aws_cloudfront_origin_access_control.example-test2.id
    }
}

resource "aws_s3_bucket_policy" "example-test2" {
    bucket = "example-test2"
    
    policy = jsonencode(
        {
            Id        = "PolicyForCloudFrontPrivateContent"
            Statement = [
                {
                    Action    = "s3:GetObject"
                    Condition = {
                        StringEquals = {
                            "AWS:SourceArn" = aws_cloudfront_distribution.example-test2.arn
                        }
                    }
                    Effect    = "Allow"
                    Principal = {
                        Service = "cloudfront.amazonaws.com"
                    }
                    Resource  = "arn:aws:s3:::example-test2/*"
                    Sid       = "AllowCloudFrontServicePrincipal"
                  },
            ]
            Version   = "2008-10-17"
        }
    )
}

resource "aws_cloudfront_origin_access_control" "example-test2" {
  description                       = "Cloudfront signed s3"
  name                              = "example-test2"
  origin_access_control_origin_type = "s3"
  signing_behavior                  = "always"
  signing_protocol                  = "sigv4"
}

Restrict public access to CloudFront

We want to require signed urls with our CloudFront distro, similar to what would be required with a non-public S3 bucket directly. Be aware that CloudFront uses a different signature algorithm and type of key than s3 and expirations can be further out.

See AWS doc at Serve private content with signed URLs and signed cookies.

Create a public/private RSA key pair
- openssl genrsa -out private_key.pem 2048
- extrat just public key with openssl rsa -pubout -in private_key.pem -out public_key.pem
- Upload the public_key.pem to CloudFront “Public Keys”, and keep the private key in a secure place yourself.
Create a CloudFront “Key Group”, and select that public key from select menu
In the Distribution “Behavior”, select “Restrict Viewer Access”, to a “Trusted Key Group”, and choose the Trusted Key Group you just created.

Now all CloudFront URLs for this distribution/behavior will need to be signed to work, or else you’ll get an error Missing Key-Pair-Id query parameter or cookie value. See Use signed URLs. (you could also use a signed cookie, but that’s not useful to me right now).

You’ll need the private key to sign a URL. Note that CloudFront uses an entirely different key signing algorithm, protocol, and key than s3 signed urls! Shrine’s S3 docs have a good ruby example of using ruby AWS SDK Aws::CloudFront::UrlSigner, which will by default use a “canned” policy. (I’m not sure the default expiration you’ll get without specifing it in the call, as in that example.)

In terraform, the public key, trusted key group, and distribution settings might look like the following, using a “canned” policy that just has a simple expiration. Passing a custom expiration for 7 days in future might look something like this:

signed_url = signer.signed_url(
  "https://mydistro.cloudfront.net/content.jpg?response-content-disposition=etc",
  expires: Time.now.utc.to_i + 7 * 24 * 60 * 60,
)

Terraform for creating restricted cloudfront access as above:

resource "aws_cloudfront_public_key" "example-test2" {
  comment     = "public key used by our app for signing urls"
  encoded_key = file("public_key-example-test2.pem")
  name        = "example-test2"
}

resource "aws_cloudfront_key_group" "example-test2" {
  comment = "key group used by our app for signing urls"
  items   = [aws_cloudfront_public_key.example-test2.id]
  name    = "example-test2"
}

resource "aws_cloudfront_distribution" "example-test2" {
  # etc
  trusted_key_groups = [aws_cloudfront_key_group.example-test2.id]
}

(Warning, with terraform aws provider v5.53.0, to have terraform remove the trusted_key_groups and have the distro be public again, have to leave in trusted_key_groups = [], rather than remove the key entirely. Perhaps that’s part of how terraform works)

http://bibwild.wordpress.com/?p=12742

Extensions

Run your Rails gem CI on rails main branch

jrochkind May 30, 2024

attr_json is basically an ActiveRecord extension. It works with multiple versions of Rails, so definitely runs CI on each version it supports. But a while ago on attr_json, i set up CI to run on Rails main unreleased branch. I already was using appraisal to test under multiple Rails versions. (which I recommend; sure it … Continue reading Run your Rails gem CI on rails main branch →

Show full content

attr_json is basically an ActiveRecord extension. It works with multiple versions of Rails, so definitely runs CI on each version it supports.

But a while ago on attr_json, i set up CI to run on Rails main unreleased branch. I already was using appraisal to test under multiple Rails versions.

(which I recommend; sure it seems easy enough to do this ‘manually’ with conditionals in your Gemspec or separate Gemfiles and BUNDLE_GEMFILE — but as soon as you start needing things like different extra dependencies (version of rspec-rails anyone?) for different Rails versions… stop reinventing the wheel, appraisal just works).

So I added one more appraisal block for rails-edge, pretty straightforward. (This example also uses combustion which I don’t necessarily recommend, I think recent Rails dummy app generated by rails plugin new is fine, unlike Rails back in 5.x or whatever).

The “edge rails” CI isn’t required to pass for PR’s to be merged. I put it in it’s own separate Github Actions workflow, in part so I can give it it’s own badge on the README. (The way things are currently set up, I think you don’t even get “edge rails CI” feedback on the PR — it would be ideal to get it as feedback, but make it clear it’s in its own category and failures aren’t a blocker).

I intend this to tell the person looking at the README considering using the gem, and evaluating it’s health and making guesses about its maintenance level and effective cost of ownership: Hey, this maintainer is continually testing on unreleased Rails Edge. That’s a pretty good sign! Especially that it’s green, means it’s working on unreleased rails edge. And when the next Rails release happens, we already know it’s in a state to work on it, I won’t have to delay my Rails upgrade for this dependency.

And if a change happens on Rails edge main branch that breaks my build — I find out when it happens. If you don’t look at whether your code passes the build on (eg) Rails 7.2 until it’s released, and you find a bunch of failures — it turns out that was basically deferred maintenance waiting for you.

I find out about breakages when they happen. I fix them when I have time, but seeing that red build breakage on “Future Rails Versions” is a big motivator to get it green. (I might have called that “edge Rails” in retrospect, I think that’s a generally understood term?). And when Rails 7.2 really is released — I just need to change my gemspec to allow Rails 7.2, and release attr_json, I don’t have deferred maintenance on compat with latest Rails release piling up for me, and I can release an attr_json supporting the new Rails release immediately, and not be a blocker for my users upgrading to latest Rails release on their schedule.

This has worked out very well for me, and I would really encourage all maintainers of Rails plugins/engines to run CI on Rails edge.

http://bibwild.wordpress.com/?p=12696

Extensions

Consider a small donation to rubyland.news?

jrochkind Nov 27, 2023

I started rubyland.news a few years ago because it was a thing I wanted to see for the Ruby community. I had been feeling a shrinking of the ruby open source collaborative community, it felt like the room was emptying out. If you find value in Rubyland News, just a few dollars contribution on my … Continue reading Consider a small donation to rubyland.news? →

Show full content

If you find value in Rubyland News, just a few dollars contribution on my Github Sponsors page would be so appreciated.

I wanted to make people writing about ruby and what they were doing with it visible to each other and to the community, in order to try to (re)build/preserve/strengthen a self-conception as a community, connect people to each other, provide entry to newcomers, and just make it easier to find ruby news.

I develop and run rubyland.news in my spare time, as a hobby project, all by myself, on custom Rails software. I have never and will never accepted money for editorial placement — the feeds included in rubyland.news are exclusively based on my own judgement of what will serve readers and the community well.

Why am I asking for money?

The total cost of Rubyland News, including hosting and the hostname itself, are around $180 a month. Current personal regular monthly donations add up to about $100 a year — from five individual sponsors (thank you!!!!)

I pay for this out of my pocket. I’m doing totally fine, no need to worry about me, but I do work for an academic non-profit, and don’t have the commercial market software engineer income some may assume.

Some donations would also help motivate me to keep putting energy into this, showing me that the project really does have value to the community. If I am funded to exceed my costs, I might also add resources necessary for additional features (like a non-limited DB to keep a searchable history around?)

You can donate one-time or monthly on my Github Sponsors page. The suggested levels are $1 and $5 per month. If I get an increase in $5-$10/month more contributions this year, I will consider it a huge success, it really makes a difference!

If you donate $5/month or more, and would like to be publicly listed/thanked, I am very happy to do so, just let me know!

If you don’t want to donate or can’t spare the cash, but do want to send me an email telling me about your use of rubyland news, I would love that too! I really don’t get much feedback! And would love to know any features you want or need. (With formerly-known-as-twitter being on the downslide, are there similar services you’d like to see rubyland.news published to?) jonathan at rubyland.news)

Thanks

Thanks to anyone who donates anything at all
also to anyone who sends me a note to tell me that they value Rubyland News (seriously, I get virtually no feedback — telling me things you’d like to be better/different is seriously appreciated too! Or things you like about how it is now. I do this to serve the community, and appreciate feedback and suggestions!)
To anyone who reads Rubyland News at all
To anyone who blogs about ruby, especially if you have an RSS feed, especially if you are doing it as a hobbyist/community-member for purposes other than business leads!
To my current monthly github sponsors, it means a lot!
To anyone contributing in their own way to any part of open source communities for reasons other than profit, sometimes without much recognition, to help create free culture that isn’t just about exploiting each other!

http://bibwild.wordpress.com/?p=12535

Extensions

Beware sinatra, rails 7.1, rack 3, resque bundler dependency resolution

jrochkind Nov 9, 2023

tldr practical advice for google: If you use resque 3.6.0 or less, and Rails 7.1, and are getting an error: cannot load such file -- rack/showexceptions — you probably need to add rack "~> 2.0" to your Gemfile! The latest version of the ruby gem sinatra, as I write this, is 3.1.0, and it does … Continue reading Beware sinatra, rails 7.1, rack 3, resque bundler dependency resolution →

Show full content

tldr practical advice for google: If you use resque 3.6.0 or less, and Rails 7.1, and are getting an error: cannot load such file -- rack/showexceptions — you probably need to add rack "~> 2.0" to your Gemfile!

The latest version of the ruby gem sinatra, as I write this, is 3.1.0, and it does not yet support the recently released rack 3. It correctly specifies that in it’s gemspec, with gem "rack", "~> 2.2", ">= 2.2.4”

[And as of this writing, that is true in sinatra github main branch too, no work has been done to allow rack 3.x]

The new Rails 7.1 does work with and allow Rack 3.x, as well as still working with Rack 2.x, it allows any rack >= 2.2.4 (specifying it will be compatible with a future rack 4.x too, which seems dangerous, for reasons, read on)

There is a version of sinatra that (wrongly) specifies working with rack 3.x: Sinatra 1.0 (Released March 2010!) specifies in it’s gemspec that it will work with any rack >= 1.0. They quickly corrected that in Sinatra 1.1a to say “~> 1.1”, meaning “1.x greater than or equal to 1.1 only”.

But sinatra 1.0 is still there in the repo, as a target for bundler dependency resolution, claiming to work fine with rack 3.x. By the way, sinatra 1.0 is wrong about that, it certainly does not work with rack 3.x. One error you might get from it is cannot load such file -- rack/showexceptions on boot, which is a lot better than a subtler error that only shows up at runtime, for sure!

Do you see where this is going?

I am in process of updating my app to Rails 7.1. I didn’t even know my app had a sinatra dependency… but it turns out it did, my app uses resque latest version 2.6.0, which has a dependency on sinatra sinatra >= 0.9.2

So okay, poor bundler has to take this dependency tree and create a resolution for it. Rails 7.1 allows rack 2 or 3; resque 2.6.0 allows any sinatra at all; sinatra 1.0 allows any rack, but sinatra 3.1.0 only allows rack 2.x.

There are two possible resolutions that satisfy those restrictions (really more than two if you can use any old version of a dependency), but the one bundler picked was:

rack 3.0.8
sinatra 1.0

Which then failed CI because sinatra 1.0 doesn’t really work with rack 3.x.

The other possible resolution would have been rack 2.2.8 and sinatra 3.1.0.

That’s the one I actually want.

To help it it along I just need to add rack "~> 2.0" to my Gemfile. This was a bit confusing to debug!

What is the problem? The danger of open-ended gem dependencies

So the problem here is sinatra 1.0 (ten years ago!) claiming it supported any rack version no matter how high! It should have said ~> 1.0 meaning “1.x, but not 2” — how could it possibly predict it would work with rack 2, or 3, or 4, or 9.0?

If sinatra 1.0 had put an upper bound on the version of rack it woudl work with, bundler would have done the ‘correct’ (to us humans) resolution out of the box, cause the ‘wrong’ one it did would not have been available as satisfying all restrictions. Doing an open-ended spec like this leaves a bomb that can get someone decades later, as it did here.

And Rails is still doing that! actionpack 7.1.x says it works with any rack >= 2.2.4 — it ought to add in a < 4 there, it knows it works with rack 2.x and 3.x, but how can it predict it works with rack 5.x or 6.x, which don’t exist at all yet? It’s leaving the same bomb for bundler dependency resolution in the future that sinatra 1.0 did, and there’s no real way to fix it once the versions are out there.

Alternately, if sinatra released a version that did support rack 3, and said so, bundler would preferentially choose that version, with rack 3, and we wouldn’t have a problem. (Bundler’s dependency resolution is actually really amazing, it’s amazing how often it makes the “right” choice among many possible versions that would satisfy all dependency restrictions) I’m not sure how much maintenance energy sinatra is getting, but eventually it’s going to have to get there or there’s going to be a conflict with something that has sinatra in it’s dependency tree and also has something that requires rack 3 in it’s dependency tree.

And more immediately… resque says it works with any sinatra >= 0.9.2 (released in 2009)…. but does it really? Who knows. Releasing a resque that says it needs, oh, sinatra >= 2.0 (released 2017) might help bundler come to a more satisfying dependency resolution… or could just result in bundler deciding to use an old version of resque so it can use an old version of sinatra which says (incorrectly!) it supprots rack 3…. hard to predict. But maybe I’ll PR resque. But resque is also not exactly overflowing with maintenance applied to it these days…

Eventually I just need to switch away from resque. I have my eye on good_job.

http://bibwild.wordpress.com/?p=12478

Extensions

S3 CORS headers proxied by CloudFront require HEAD not just GET?

jrochkind Oct 9, 2023

I’m not totally sure what happened, but the tldr is that at the end of last week, our video.js-played HLS videos served from an S3 bucket — via CloudFront — appears to have started requiring us to list “HEAD” in the “AllowedMethods” for CORS configuraton, in addition to pre-existing “GET”. I’m curious if anyone else … Continue reading S3 CORS headers proxied by CloudFront require HEAD not just GET? →

Show full content

I’m curious if anyone else has any insight into what’s going on there… I have some vague guesses at the end, but still don’t really have a handle on it.

Our setup: HLS video from S3 buckets

We use the open-source video.js to display some video, in the HLS format. Which involves linking to a .m3u8 manifest file, which is the first file the user-agent will request.

When implementing, we discovered that if the .m3u8 and other HLS files are on a different domains than the web page running the JS, you need the server hosting the HLS files to supply CORS headers. Makes sense, reasonable.

Our HLS files are on a public S3 bucket. We also have a simple Cloudfront distribution in front of the public S3 bucket.

We set this CORS policy on the S3 bucket, probably one I just found/copy/pasted at some point. (CORS policies on S3 are now set, I think, only in JSON form; in the past they could be XML and you can find XML examples too). (warning, may not be sufficient)

[
    {
        "AllowedHeaders": 
            "*"
        ],
        "AllowedMethods": [
            "GET"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 43200
    }
]

And for a long time, that just worked. The S3 bucket responded with proper CORS headers for video.js to work. The CloudFront distribution appropriately cached/forwarded the response with those headers. (note * as allowed origin, so the cache is not origin-specific, which should be fine then!)

Last week it broke? How I investigated

Some time around Wednesday Oct 4-Thursday Oct 5th, our browser video display started broking. In a very hard to reproduce way.

Some viewers got the error from video.js it gives when it can’t fetch the video source (for instance, a network failure might give you this same error message):

“Media could not be loaded, either because the server or network failed or because the format is not supported.”

(and, by the way, this error could happen on new videos at new urls that didn’t exist 24 hours previous…)

Once a developer managed to reproduce this, looking in Dev Tools console in the browser, we could see a CORS error reported:

Access to XMLHttpRequest at ‘[url]’ has been blocked by CORS policy: No ‘Access-Control-Allow-Origin’ header is present on the requested resource.

It took me a bit to figure out how to investigate whether CORS headers were being returned appropriately or not. It turns out that S3, at least, only returns the CORS headers when an Origin header is present in the request, and it matches the CORS rules (the second condition, in this case, should be universal, as our allowed origin is *). Maybe this is how CORS always works?

So we could investigate like, so using verbose mode to see headers from a GET request:

curl -H "Origin: https://our-example.org" -v "https://some-s3-or-cloudfront/etc"

Doing this, I discovered that for some people a cloudfront request as above would return CORS headers (we’re looking for eg Access-Control-Allow-Origin: * in the response!), and other times it wouldn’t! Cloudfront headers include a x-amz-cf-pop header, which reminded me, right there are different Cloudfront POPs different people could be connecting to… okay, so some Cloudfront POPs are returning the CORS headers others not? Which kind of violates my model of CloudFront, i thought POPs would be synchronized to always return the same content, but who knows.

But okay then, was the S3 original source returning CORS headers?

Well, to make matters more confusing, I made a mistake which ultimately led me to the solution too. Instead of doing curl -v, I had originally been doing curl -I, which I had come to think as “just show me the response headers not body”, but of course actually is a synonym for --head and tells curl to do a HTTP HEAD method request.

And I configured S3 to allow only GET method, so, no, when I did a HEAD request to the direct S3 source, no CORS headers were included, duh. If I did it with GET they were.

I actually didn’t totally realize what was going on at first (really forgot that -I was a HEAD request to curl, not a GET where it only showed me resposne headers!)…. but something about this experience, and while googling seeing an occasional S3 CORS example that included HEAD as well as GET in allowed options…

Led me to try just adding HEAD to my AllowedOptions… So now this is my public S3 buckets CORS policy:

[
    {
        "AllowedHeaders": [
            "*"
        ],
        "AllowedMethods": [
            "GET",
            "HEAD"
        ],
        "AllowedOrigins": [
            "*"
        ],
        "ExposeHeaders": [],
        "MaxAgeSeconds": 43200
    }
]

And… this seemed to fix things? Along with clearing the CloudFront cache though, to make sure it wasn’t serving bad response headers from cache, so that could have played a role too.

At this point I really don’t understand what was going on, why/how I fixed it more or less accidentally, how I got lucky enough to fix it… or honestly if I even really did fix it?

What is going on anyway?

We have had this system in place for over a year, with no changes I know about — no changes to S3 or CloudFront configuration, or to video.js version. What changed?

I feel like the symptoms probably mean that CloudFront is sometimes doing a HEAD request to S3 for these files, and caching the response headers, and then using those cached response headers from a HEAD request on a GET request response… but why would it do that? And again why would it start encountering this situation now after a long time working fine?

At first I wondered, wait… we’ve had this setup for about a year… and we tell CloudFront to cache these responses (with content-unique URLs) for the HTTP max cache age of a year…. has our content just started to exceed it’s year max-age… so now CloudFront is maybe doing some conditional HEAD requests to S3 to see if it’s cache is still good (it is, Etag unchanged)… and for some reason it uses the CORS headers it gets back from there to update it’s cached headers, while still using it’s original cached body?

That seemed maybe plausible (if unclear whether it was a defensible thing for CloudFront to do), but then I remembered — no, we are seeing this problem too with new content and URLs that have only existed for less than 24 hours, so it can’t be a case of year-old content that CloudFront has been caching for a year.

I’m pretty mystified. Why this started breaking now after working for months, with no known changes. Has something on S3’s end changed with how it executes CORS policies to produce CORS headers? Or something on CloudFront changed with how it forwards/caches them? Or something in browsers or video.js changed with regard to exactly what requests are made? (is the browser now making HEAD requests for this content, and requiring CORS headers on response, in places it didn’t before? But that doesn’t explain why CloudFront POPs were giving me unexpectedly inconsistent results to GET requests, sometimes including CORS headers in response sometimes not!)

AND I don’t really understand why I have to include HEAD in my S3 CORS policy at all — I hadn’t been expecting to need to authorize HEAD requests via CORS, I expected video.js would be doing GET requests, and that’s all I’d need to authorize.

So I seem to have fixed the problem… but I never like it when I don’t understand my own solution. Have I really fixed it, or will it just come back?

Googling I can not find anything that seems relevant to this at all. Should anyone using CloudFront in front of a public S3 bucket, where responses need origin * CORS headers — always include HEAD as well as GET in AllowedMethods? Is this really such a weird situation? Why can’t I find anyone talking about it? Is it for some reason special to HLS video?

So anyway, I blog this. Hoping that someone else running into a mysterious problem will find this post, when I could find nothing! And hoping for the even slimmer chance that someone will see this who thinks they know exactly what was going on and can explain it!

http://bibwild.wordpress.com/?p=12313

Extensions

Investigating OCR and Text PDFs from Digital Collections

jrochkind Jul 18, 2023

At the Science History Institute Digital Collections, we have a fairly small collection compared to some peers (~70,000 images) of historical materials. Many of those images are of text: Books, pamphlets, advertisements, memos, etc. We haven’t previously done any OCR (Optical Character Recognition), but started thinking about doing that. In addition to using captured text … Continue reading Investigating OCR and Text PDFs from Digital Collections →

Show full content

We haven’t previously done any OCR (Optical Character Recognition), but started thinking about doing that. In addition to using captured text for the site-wide search, it made sense to us to look to providing:

“search inside the book/work”, with results highlighted on in-browser page images, such as provided by this example from Internet Archive using their own viewer, or this example from NCSU using UniversalViewer (search for say “corn”, note how results are highlighted on the page image)
Provide downloadable PDFs with a “text layer” — select-copyable and searchable text in the PDF, as in this corresponding example from Internet Archive.

Both of those use cases require not just text out of OCR, but text with position information so it can be overlaid on the page image.

We decided to investigate starting with the PDF-with-text-layer as the first product, because I (naively, i think!) believed this would be straightforward to do, and because we have user research indications that some portion of our users really love PDFs (which I think would be common among at least any user groups of academics, probably others too).

I had to do a lot of research to understand the technologies, techniques, and tools that are out there in this domain. So I capture my findings here in a giant blog post, partially to capture my own notes, and ideally to help give someone else a head start. (This recorded conference presentation from Merlijn Wajer at Internet Archive is also a good overview of the technical ecosystem! Merlijn and the IA are very central figures in what open source work is going on in this area!)

I was a bit bewildered to notice that few of our peers seemed to offer PDF downloads (especially with a text layer), as I’m pretty confident our collective users would want this. I then discovered the tooling is somewhat limited, and used to be worse. I’m not sure and am curious why our sector/field/industry hasn’t invested more in software development to create better tooling here!

tldr Summary of findings and plan

So after some initial research, I had discovered that several peers used the hOCR format to represent OCR-with-position output, to power their in-browser online “highlight a search result on a scanned page” interfaces.

I somewhat naively figured I could use my choice of OCR engines (the open source tesseract seemed most popular; or maybe Amazon Textract?) to produce HOCR, which I’d do on image ingest.

And then I imagined (and thought I saw confirmed based on a bit of googling), I’d have my choice of tools to combine the hOCR with raster images into PDFs. Ideally I could use a fast compiled-C tool that could easily be installed via apt on ubuntu.

It turns out — that was over-optimistic. hOCR isn’t as quite as widely inter-operably standard as hoped. Tools for rendering a PDF based on hOCR are in fact very limited (and mostly python). There is a field of abandoned and not-super-robust partial solutions.

In fact, I only found one piece of software that could do this job well, the Internet Archive’s archive-pdf-tools. And even it does not currently seem to do as good a job of positioning text in the PDF as tesseract itself does (although it intends to port tesseract’s logic). It’s also a bit difficult to install, and may not be installable on MacOS. It’s not super widely-used software.

Later I also discovered a clever kind of compression meant for this kind of textual PDF, called Mixed Raster Content (MRC). When this works, it can really reduce the file-size of otherwise enormous PDFs of hundreds of pages of page images. And in fact there is only one working open-source implementation of this as far as I can tell, again the Internet Archive’s archive-pdf-tools. (There are implementations in commercial software, I have not evaluated them, more later).

The rest of this post will be some lengthy notes going into my findings about PDF technology; and evaluation of all the software I could find and easily evaluate that would render hOCR to a PDF or be otherwise useful; and tips and tricks and difficulties in using that software.

But in the end, I identified basically only two realistic paths to get to PDFs with text layer from my scanned TIFFs.

Have tesseract output a “text-only” PDF (a tiny PDF that includes only invisible “text layer”), and then use another tool (such as qpdf) to combine it with raster images.
- tesseract just does a better job of laying out text in the PDFs it outputs them any other tool I could find (and there aren’t many). Although archive-pdf-tool intends to match tesseract, with a port of tesseract’s logic even — it’s not currently doing so.
- Optionally, you could take the output of this, and try to run it through archive-pdf-tool’s experimental pdfcomp tool, to apply the MRC compression to a PDF it did not itself create. (I haven’t yet figured out how to access/run this experimental tool)
- If we aren’t doing MRC compression, look into using JP2 instead of JPEG — turns out PDF supports jp2, and it may be a smaller file for same quality.
- I’m still going to need the hOCR for future online-search applications, so I’ll be having tesseract output both hOCR and the text-layer PDF, one way or another, and storing them both.
- This does not give us the opportunity to manually correct hOCR — if we wanted to correct PDFs (perhaps for accessibility), it would have to be directly on the PDF, and would not apply corrections to other uses of the hOCR.
- Details on this approach below in the tesseract section.
Have tesseract output hOCR (probably at time of ingest), and then use archive-pdf-tool’s recode_pdf tool to assemble the hOCR with raster images into PDFs.
- At present, we won’t get quite as good text positioning as with tesseract
- it’s a bit harder to get installed (and may not be installable on our dev box) — which would apply if we tried to use pdfcomp for compression in path 1, but in path 1 it’s an optional add-on, here we’d have to get it solved from go
- But the text positioning is better than anything else I found but tesseract, and we’ll get good MRC compression out of it too.
- This would give us an opportunity to apply corrections to hOCR (perhaps for accessibility remediation) and (re-)generate PDFs accordingly, if we had a workflow and tooling for that.
- recode_pdf starts with TIFFs, so the PDF-generation process is going to be a bit slower and more resource-intensive.
- details of this approach below in the archive-pdf-tools section.

With either of those paths, it might be convenient to generate single-image single-page PDFs at file ingestion time, and then combine them into an aggregated multi-page PDF-demand. This makes it somewhat easier to deal with the fact that our app allows staff to add/remove or publish/unpublish individual pages on demand, which would invalidate generated PDFs. This approach would wind up with duplicated copies of an embedded font, but tesseract’s embedded “glyphless” font is only ~600 bytes, less than 1% of the likely sizes of outputs.

Anyway, these are pretty much the only options I came up with after much investigation of software that didn’t quite cut the mustard. It turns out going from hOCR to positioned text in a PDF is non-trivial, different tools do it differently, and not as well as others. Other open source software investigated (there isn’t a lot!):

ocropus/hocr-tools a python package including a hocr-pdf tool for rendering PDF from hOCR. Didn’t do a great job positioning the hOCR, was unable to handle positioning non-completely-horizontal lines diagonally, which tesseract and archive-pdf-tools were.
eloops/hocr2pdf a Javascript package that was meant by it’s original author as just a proof-of-concept experiment and hadn’t been touched in a while, did not do a good job of positioning
Exactimage hocr2pdf: At first appeared to be the compiled C hocr=>pdf tool I imagine existed. But it seems to be old unmaintained software, and I could not get it to work with contemporary tesseract hOCR.
pdfbeads: Ancient ruby software that can hypothetically do hOCR positioning and a MRC-like compression. I could not get it’s MRC to work for me; it’s hOCR positioning was inferior to archive-pdf-tool’s; and it’s weird zombie software with unclear mainline source repository.

You’ve now gotten the important bits of this post summarized. In the remainder of this post, we have more musings on state of the field, context of technologies available, and notes on individual software packages reviewed — it’s a LOT of stuff. I am not certain how useful others may find these notes on what I have discovered!

Other options? Commercial options? State of the market?

I just couldn’t find many tools for eg hOCR rendering — although there may be more in the .Net world. There are some relevant commercial offerings here, that deal with OCR and PDF generation. They are often Windows-only, and often GUI software meant for someone to be operating as part of a scanning workflow. I think the market may be “corporate document management”. Some (or maybe just one?) of them claim to do MRC compression. Some of them have cloud “SDKs”. (as far as actual local SDKs, the market seems to be only for .Net).

I got the feeling that there was a lot of collaborative open-source energy on these techniques, for purposes of “ebooks” and “scanned books” 10-15 years ago (around the time of Google Books introduction?), but that it sort of petered out. This does not seem to be something our library and archive institutions have invested in. Thanks to the Internet Archive for being the main player working in this field and releasing open source tools! (Here is a video from the Internet Archive’s Merlijn Wajer that explains their procedures and how in 2020 they moved to an open source stack here. It also serves as a great overview of the technologies and tools discussed in this blog post.)

With few open source options, I would be potentially willing to pay for an appropriate tool at the right price. But the publicly-available documentation and general “developer experience” of commercial tools tends to be even worse than open source, it’s very difficult to even figure out if it’s going to work for you. I have a few notes on commercial tools in the “MRC Compression” section below, but mostly I have not spent the time to understand the market.

OCR: Tesseract is the open source option

Optical Character Recognition, or OCR, is the process of taking an image, and extracting the text from it as text.

As far as I can tell, Tesseract is the only current widely used (or at all?) open source OCR option.

There were other packages at one time popular, but for instance I don’t believe “Cuneiform” is currently being maintained or getting much use. (Wikipedia says last cuneiform release was in 2011, so).

Tesseract is currently at version 5.x (5.0 released Nov 2021) — but Ubuntu 22 apt repo still only has the latest 4.x release. And when I tried asking library field peers, it seems most are currently still using latest 4.x release. Tesseract 4.0 actually introduced “a new neural network-based recognition engine” (although it still supports using models with the old engine too, I think), so earlier than 4.x would be a really different product, but 4.x already has you on the new engine.

Tesseract works with human-language-based models, so you have to tell it which languages you expect in a document (you can tell it more than one). It has official support for a lot of human languages (including some historical early-modern ones). It does not, as far as I know, have official support for handwriting (rather than type-set) recognition.

It is also possible to train your own models for tesseract, and some people may be sharing non-official trained models for certain kinds of materials. I am not certain if I’ve seen any such that use the new “neural network-based recognition engine”, and at any rate I haven’t spent any time investigating this area.

On ubuntu, you can install tesseract with apt-get install tesseract-ocr, and install individual language packs with eg apt-get install tesseract-ocr-deu (you need to look up the appropriate tesseract language code). On MacOS, you can install tesseract with brew install tesseract, and install all supported language packs at once with brew install tesseract-lang.

For officially supported language packs, there are “FAST” and “BEST” model variants available. The distribution packages above will install the “FAST” packages. The “FAST” packages are smaller on disk and intended to result in much faster operation, with only slightly decreased accuracy. If you want to install and use the “BEST” packages instead… I am not sure how, and have not spent time with them or comparing.

Other OCR options? Commercial? AWS Textract?

I looked briefly at AWS Textract. It only handles six major European languages. BUT it claims to be able to recognize hand-writing? We def have hand-written items in the collection, would be big if it worked well.

It has all sorts of fancy tools for recognizing structured text on various types of business documents (invoices, business cards) that are mostly not of concern to us. It does not produce hOCR, but does produce it’s own XML format that maybe could be converted to hOCR, although a converter isn’t included in this project I found of other hOCR conversions.

If I understand the pricing properly, At $15/1000 pages it’s quite expensive. We estimated the price of CPU time on heroku using tesseract to be 100x less.

Perhaps we’d investigate in the future to expand our processes to OCR’ing handwriting. But first the lower-hanging fruit.

There are other commercial OCR solutions, including Google Cloud Vision, and lots and lots of Windows-based “document management workflow” solutions, that I haven’t really even looked at.

Note on Accessibility and OCR

Automatically OCR’d text does not necessarily produce an “accessible” copy, say for people with vision impairments. While current OCR results from eg tesseract are surprisingly good, and provide a good product for “searchability”, they still include too many errors to be simply read as a primary text, as you can see if you look at the text alone.

Additionally, in PDF form, I am told for accessibility for man purposes the text really needs to be “tagged” in a way that simple OCR will not produce.

It is almost certain that we do not have the resources to produce this level of accessibility for the tens of thousands of page images in our corpus. While adding machine-generated OCR may increase accessiblity somewhat for some purposes, contexts, and users — it definitely is going to leave a lot of people out, people who have vision impairments among others.

Another possible intervention is that we could provide a clear functions for users to request accessible/remediated PDF (or other) copies on a per-work basis. It’s still not totally clear how we’d best provide that, whether we’d do remediation in-house (and using what tools and workflows), or perhaps send PDFs to vendors to produce accessible copies (this is not cheap, it looks like maybe $1/page or more for PDFs with accessible tagging, although I’m not certain).

In my imagination, a well-engineered process for remediating OCR might involve producing hOCR, then correcting the hOCR that is then used to (re-)generate PDFs. This way the corrections would also apply to other uses of the (h)OCR such as indexing for collection-wide search in Solr, or for search-inside-the-page with highlighting features offered via a web browser.

However, the tooling for this seems to be pretty limited, this kind of workflow does not in fact seem to be common. hocrjs is a possibly still-maintained tool for viewing hocr in a browser. It could be a building block into making a GUI for reviewing/fixing hocr (which may be internet-archive has for their own use, see this video?). Here is a more full-featured proof-of-concept for actually editing/correcting hocr. Alternately, hocr-proofreader seems to be a proof-of-concept not “finished” into actually supporting some kinds of review and editing of HOCR — while the notes suggest it’s not ‘finished’ it is a very impressive proof-of-concpet — check out the demo!

Of course, even if you corrected typos in the hOCR, that wouldn’t necessarily give you enough for the accessible “tagging” in the PDF. (Is there even a format that can capture OCR-with-position and all the semantics necessary to produce PDF tagging too? I don’t think it’s hOCR. The state of the ecosystem is underwhelming here).

A more realistic approach for the existing eco-system might be remediating a PDF as PDF (either sending to a vendor, or in-house with tools like Acrobat Pro or Abbyy FineReader) — and then extracting the (corrected) text from it as hOCR, to put into our system for other uses. The Internet Archive archive-hocr-tools project has a script that can extract a text layer from PDF to hocr, although it’s not mentioned in project readme (I might PR this), I’m not sure how I found it!

Some PDF tech details What is a “PDF with text-layer” anyway?

PDF’s don’t actually have “layers” or “text layers”. But this is shorthand for a PDF that includes actual computer-readable text in addition, in these cases, to a “raster” (pixel-based) image of a photograph of a physical text.

The PDF text isn’t in a “layer”, it’s just individual pieces of text positioned in the PDF. PDF actually has a “rendering mode” (constant 3) for non-displayed text. (See this StackOverlow for some discussion).

In the kind of PDFs we’re talking about there are non-displayed text objects positioned in the same place/size as the words in the picture, so you can select (to copy and paste) text, and it looks like you are selecting the image itself. And you can “search within the document” in a PDF viewer, and it will highlight your results, looking like it’s highlighting the image itself.

Even though the text is not displayed, it needs to be associated with a font. There are fonts that are “built-in” to PDF, but they can only display characters in traditional “Latin-1” character sets. Displaying text in this pre-Unicode-asendance format is a bit tricky if you are trying to do it yourself with raw bytes. Fonts in a PDF can be embedded in the PDF itself — and typically are for this sort of thing — to make sure the text can be displayed (or possibly even interpreted at all?) on a machine without the chosen font installed.

The text that isn’t even going to be displayed can get by with just a bare-bones stub of a font, a “glyphless” font, since they don’t actually have to display, they just need to be encoded as machine-readable text. Tesseract, for instance, seems to use it’s own TrueType “glyphless font” that weighs only 572 bytes. It has in the past sometimes had to be tweaked, almost anything you want to do with a PDF ends up being non-trivial to do reliably.

HOCR and Alto: Formats to Represent OCR data with positions

You could do an OCR operation and just get text out. But if you want to overlay the text on top of the scanned image for select-copy-paste or search-result-highlighting, you need position information too.

Are there standard interchangeable formats that encode this information? Yes…. sort of.

The most popular one seems to be hOCR. It literally is an HTML document, with <p> for paragraphs and <div>s and <span>s, that embeds positional and other information in title attributes. (Flashback to “HTML microformats” for anyone else? Nevermind).

When I asked around for colleagues to see what they were using to power online on-scanned-page search-highlighting, the answer was hOCR. tesseract can output hocr. There were several tools I found that could take hOCR as input.

The thing is… it’s unclear how well hOCR actually serves as a mutually-intelligible interchange format. Going back to 2016, there has been some concern that hOCR allows too much variation and hOCR from different producers may not truly be mutually intelligible. I think some of the tools I found that take “hOCR” as input may really only work with tesseract hOCR, and maybe even only certain versions of tesseract.

At the moment, there seem to be very few pieces of currently-maintained software that produce hOCR directly. (tesseract and… maybe there’s another open source package called Kraken? And a couple other barely- or non-maintained little-used open source packages).

As far as I can tell, most proprietary/commercial solutions can not read or write hOCR; they mostly use their own proprietary XML formats, if anything. Hypothetically you could translate from and to hOCR, and for some formats there are tools that claim to do so. Github cneud/ocr-conversion is a repository of scripts to convert between various OCR-position formats; it contains scripts to convert FROM several vendor formats (incluing Abbyy) to hOCR, but not usually the reverse.

There is another similar format, endorsed by the Library of Congress, called ALTO, which some think is technically superior, but it doesn’t seem to be supported by very many (any?) tools. (Tesseract can output ALTO, although it isn’t very well documented).

The end result is that this field isn’t quite as standards-based inter-operable as I had hoped/assumed.

MRC compression

So, raster (pixel) images are big, especially when you have hundreds of them. In our current application, we’re making PDFs out of 1200 pixel JPGs (made at default JPG compression level). The PDF for one particular 700-page book is 325MB. That’s a big file.

You could reduce the resolution or image quality. But 1200pixels is already only ~150dpi for an 8.5″/11″ page, and increasing JPG compression may introduce noticeable artifacts in some images — although we could experiment with this more. (If you do want to reduce byte size, do you get better perceived quality for the reduced size with less resolution or more JPG compression? I suspect keeping the resolution but increasing compression is the way to go, but I’m not sure).

However, it turns out someone (maybe these guys in 1998?) invented a very clever way to apply higher compression with less loss of perceived quality — specifically for the kinds of images likely in scanned books or scanned text. Called “Mixed Raster Compression” (MRC), or “hyper-compression”, it involves separating the page “background” (which can be highly compressed), from any embedded graphics and text (which can’t be compressed as much without noticeable problems — especially the text), separating them in separate images with separate levels of compression and/or resolutions, then combining them back together with a “mask”, in a way that PDF technology supports.

More information on MRC can be found on Wikipedia , this vendor’s markettng page, this other vendor’s marketing page, or the Internet Archive’s archive-pdf-tools README. Merlijn from Internet Archive also explains it in a conference presentation.

My sense that is that MRC compression is more of a technique than an exact algorithm. Different implementations may do it differently, and have output that can be more or less successful. There can be bugs or areas for improvement, that can differ between tools. The different layers can be split purely by automated image processing, but also can use the (eg) hOCR file to identify regions with text that need higher fidelity than backgrounds.

I believe the Internet Archive’s archive-pdf-tools is the only functional open source implementation of MRC encoding.

One commercial tool that may do some kind of MRC compression is the suite of tools known as “GDPicture” (the company behind that has merged with competitor Orpalis making things even more confusing). They do advertise supporting MRC compression. I had a brief phone call with a sales engineer, who wasn’t super familiar with this feature but confirmed they had it, and gave me an overview of the products in general. There is a page at avepdf.com that is “powered by GDPicture MRC Compression SDK” that will let you apply MRC compression to existing PDFs for free… but only a couple an hour, so I haven’t managed to totally wrap my head around it. Hypothetically, then, the PassportPDF cloud SDK from the same company would give me access to “GDPicture MRC Compression” — but I haven’t yet managed to figure out how. (But see if you can at the API reference?). Figuring out what is available from proprietary projects can sometimes seem even more challenging than from open source.

The market-leader Abbyy also says they support MRC, including via an SDK? One of the first or most popular commercial tools to apply this technique may have been called “LuraTech”, I’m not sure the current status of that software.

Evaluating Internet Archive recode_pdf, compared to alternatives

When I ran internet archive’s recode_pdf (with arg --bg-downsample 3 and otherwise default arguments) on full-resolution TIFFs, it resulted in PDFs that were about 6% the size of a PDF I made from a full-res JPG! Or about 50% the size of the PDFs we make from 1200px JPGs — still a significant reduction. Looking at them side-by-side… in one of my samples the MRC-compressed PDFs did have some visible artifacts, but text is still sharp. In two other cases, no visible artifacts.

I tried to test the free trial at avepdf.com — the extreme rate limit and cumbersome manual browser process made it hard to test a lot. I tested with PDFs that included lossless full-res PNG images, to avoid any lossy=>lossy quality issues. My initial reaction is that the text seems noticeably less sharp in the avepdf MRC-compressed PDF, even at “low” compression level — but if you zoom in, the text seems to get sharp again, which I don’t understand. My subjective impression of image quality is of course subjective, it’s hard to compare. avepdf MRC compression at “low” or “medium” compression seem to be approximately the same size as my recode_pdf output.

If we end up not using MRC, then our 1200px JPG PDFs would be maybe ~2x the size of the recode_pdf full-resolution MRC PDFs. I learned from Merlijn’s presentation that PDF actually supports embedded JPEG2000 (jp2) instead of JPEG (which their MRC technqiue uses), and that jp2 may compress smaller for the same quality. Switching to jp2 instead of jpeg and playing around with maximum compression without artifacts across my sample size… I can get my 1200px JPG PDFs to be about the same size as the recode_pdf full-res MRC compressed PDFs — although of course at reduced resolution.

note on dpi and PDF page size and variation

PDFs as a format is based on a 72 dots-per-inch (dpi) standard grid, with objects sometimes measured in actual inches. (It was a format meant for encoding things to be printed physically!).

You can embed an image of a given resolution, say 500×500 pixels in a PDF, but say it should take up however many “inches” you want, and it will be scaled on display. And the page size can be a given number of “inches” high and wide, which will determine how big it displays on a screen in most viewers.

The TIFF format also has a dpi value embedded in it, which sort of says how big in inches the TIFF (or the thing photographed) was. Some of the tools I tested detected the dpi from the source TIFF, and used it to determine the PDF page size. Others did not, and used a default or guessed size.

Many tools allow you to pass a dpi argument that it will use to determine the “page size” in resulting PDF — in my understanding this should not effect actual image resolution or much other than initial zoom level or size of page if printed. If it does with a given tool, I don’t understand what is going on.

In my tests, I generally did not supply an explicit dpi value, to have one less knob to twiddle. So resulting PDF page sizes can vary.

Available Software to make text PDF from hOCR+images Source Test Material and Methodology

To try out different tools and techniques, I started with three somewhat representative images from our collection.

A fairly ordinary page of single-column clear text from a book
A page where the photo has text more at an angle and contains figures and several text blocks
A graphical advert that has text headlines and blocks in several places and sizes

Note on embedded thumbnails: Our original TIFFs in our actual repository often have an embedded second image, a tiny ~100px thumbnail. (did you know TIFFs can contain more than one image file?). This is something software involved in some of our photographing workflow at some points in history did without us totally knowing. It can really mess up various image tools, including some included in these tests (I had some really confusing errors at some points, thanks to @MerlijnWajer for helping me out.). So the first thing I did was extract just the first image with vips copy original.tiff just_one_image.tiff (verified with imagemagick identify, which will tell you how many images are in there). (This may also have stripped some metadata, but preserved DPI metadata)

Tesseract — can create PDF with text layer directly

So you can ask tesseract to do the OCR and output PDF with text layers.

You have very little control of the raster image in the output — tesseract will convert your TIFF to a JPG (no control over JPG compression level), of the same resolution as the TIFF you used as input. This results in a pretty large PDF file — for our one page samples: 3.5M -5M per page, which is a lot, when we consider we will want PDFs for books hundred of pages long.

You want to give tesseract the full-resolution TIFF for best OCR, but maybe want to use smaller files in the PDF. Or maybe you want to manually correct the OCR output before making a PDF?

One obvious option is having Tesseract generate an HOCR file with OCR-positional info, and using another tool to combine the HOCR with a raster image into a PDF. But, it turns out — no other tools I found actually render the tesseract-produced HOCR with text postioned as well as Tesseract itself does.

It made me wish tesseract had an option to take the HOCR (that you have perhaps edited), and combine it with images (of your choice of resolution and compression quality) to make a PDF, using tesseract’s superior HOCR-rendering. It turns out, things along those lines have been suggested, but rejected by tesseract developers who don’t want to get into the general business of creating PDFs.

Instead, they introduced a feature to create a “text-only” PDF — a very small PDF that actually only has the invisible glyphless text layer. The idea, as shown in that ticket, is that you can then use external tools to merge that with images or a PDF with raster images, to create the actual PDF with your choice of raster image and text layer.

I did get this to work pretty well, with qpdf as my merge utility. I merged the tesseract (invisible) text-only PDF with my “legacy” PDF of 1200px-wide JPGs, using these commands:

$ tesseract source.tiff source.tesseract_text_only -l eng -c textonly_pdf=1 pdf

$ qpdf image_only.pdf --underlay source.tesseract_text_only.pdf -- output_image_plus_text.pdf

One caveat — PDF pages have an inherent page size (usually expressed in inches, believe it or not). If the two PDFs you are merging are exactly the same size, that’s fine. If the text-only PDF is bigger than the image one (in PDF inches), that’s fine — that qpdf command will scale it down to match, and the output is just right. But if the text-only PDF is smaller, that qpdf command will just embed it in the middle, and the embedded invisible text won’t be properly aligned with the visible text on raster image.

You can supply dpi arguments to most PDF-creating utilities (including tesseract and recode_pdf below), which basically just effect the PDF inches size set on the resulting PDF. So you just want to make sure to do this to ensure the text-only PDF is larger in PDF-inches.

This works — but doesn’t accomodate the use case where we might want to correct errors in the OCR by editing an HOCR file, before producing the PDF. I haven’t found any way to take advantage of tesseract’s superior layout of OCRd text in the PDF, while correcting the OCR content before the PDF is produced. You can of course edit the PDF directly, but this is cumbersome, and doesn’t get you a corrected HOCR file you can use for other purposes too.

I’m probably going to need HOCR anyway for other purposes. You can have tesseract produce PDF and HOCR in one go if you want. (Btw it turns out tesseract can also produce alto although I’m not sure where this is documented).

tesseract dhc6a4r.tiff scratch/test.tesseract -l eng -c textonly_pdf=1 pdf hocr

Beware that if you produce individual tesseract PDFs with text content and try to combine them… you’ll wind up with duplicate copies of tesseract’s “glyphless font” embedded, one per each source PDF. I haven’t found a good way to merge/de-duplicate them, but I think the embedded glyphless font is only 527 bytes?

Other tools can take the hocr tesseract produces, and use it to position a text layer on a PDF… with mixed results. None currently do as well as tesseract’s own PDF positioning. It turns out going from tesseract hOCR to correctly positioned text on PDF is not a trivial operation?

archive-pdf-tools: recode_pdf — a sophisticated, and supported, tool

The Internet Archive’s archive-pdf-tools is a currently maintained, well-written package in python, extracted from their own workflow and shared. It began with an effort at the Archive that began in 2020 to move to an open source pipeline.

The recode_pdf command takes a TIFF and HOCR, and renders a PDF with text layer, and compressed with the sophisticated MRC compression. It may be the only open-source implementation of MRC compression technique.

It has quite a few non-python dependencies. Installations directions specified for ubuntu worked well for me on ubuntu. One C dependency, jbig2enc — does not exist in the standard Ubuntu package manager. It built from source fine for me on ubuntu, but that gives me some challenges for trying to get it installed on heroku. jbig2enc also has a non-standard-location apt package and a snap, as well as a brew package (I think from former Code4Libber Misty De Meo?). jbib2enc appears mostly unmaintained (although it does have occasional trivial new PR merged, it’s not totally abandoned); but also appears to have a variety of forks out there with different bugfixes/patches, so I’m not sure all those sources are actually the same code!

I am having a bit of trouble installing archive-pdf-tools reliably on MacOS, but that may be corrected soon or may be my own fault.

recode_pdf’s rendering of text from the hOCR file delivered by tesseract — is currently not as good rendering as tesseract itself does when making PDFs for my samples. I describe my observations in this issue filed at archive-pdf-tools.

In fact, archive-pdf-tools’s HOCR rendering is ported from tesseract (and writes PDFs directly with raw bytes, not using a PDF library of any kind). So why isn’t it’s rendering/positioning as good? Not yet clear.

This inferior HOCR rendering is unfortunate, because this is otherwise for sure the most mature/supported open-source HOCR rendering solution I found, which does do a better job of positioning than any other open source code I found. It’s also the only working open-source MRC compression implementation I found.

It was interesting to see the MRC compression. The output PDFs, which have as many pixels as our full-size source images (but under increased lossy compression), fro the most part really do look just as good as much larger bytesize PDFs, while being very small on disk. (There are compression artifacts in some samples though). The archive-pdf-tools MRC-compressed TIFFs are about 10% of the size of tesseract’s PDFs created with full-size JPGs. For our two high-text images they were about 50% of the size of our 1200px wide JPG PDFs; for the graphical image with less text, it was about 80% the size of our 1200px-wide JPG PDF.

As this is the only open-source implementation known for MRC compression, it would be nice to be able to apply it de-coupled from the HOCR rendering. There has been some discussion and work on creating a pdfcomp executable for this, but it seems to still be ongoing. I have not managed to figure out how to test it myself yet. (It’s not clear to me if you are going to have quality problems giving it PDF input that is already JPG lossy compressed, or if this own’t matter in the end).

recode_pdf --bg-downsample 3 --from-imagestack source.tiff --hocr-file  source.tesseract.hocr -o output.recode_pdf.pdf

While I was running only on one page at a time, I believe if you are running on multiple pages, recode_pdf wants a single HOCR file, with multiple pages, in the right order to correspond to the order of TIFF input arguments.

(Note, it turns out you CAN use recode_pdf without jbigenc2, by telling it to use a different inferior compression algorithm, with --mask-compression ccitt. For my three samples, this resulted in 13-25% larger file output. In the case of the mostly graphical one only, it made the PDF output larger than my 1200px JPG output.)

ocropus/hocr-tools: hocr-pdf (python) — has some problems

The hocr-tools package in python includes an hocr-pdf command that is intended to combine HOCR and JPGs to produce PDFs with text layers.

I installed hocr-tools 1.3.0 on my MacOS laptop with simple pip install hocr-tools.

The way hocr-pdf takes it’s input is a bit confusing — you need to run it on a directory which includes only source files, where a corresponding JPG and HOCR have the same name but for suffix. (JPEG must end in .jpg not .jpeg!)

hocr-pdf ./directory > output.pdf

The apache-licensed source code creates a PDF using a python PDF generation library — this is different than some code (such as archive-pdf-tools) that writes raw PDF bytes. So it may make it a good place to look to understand the/an algorithm, possibly for porting to another language, if you want to use a PDF library rather than write raw PDF bytes. I considered this at one point; I’m not sure if (eg) ruby’s prawn has analagous features to all being used from the python PDF library, I’m not sure how hard it would be.

It did not like it when I tried using with JPG with different smaller resolution than the TIFF the HOCR was created from — it produced wrongly scaled output. There are some tools/scripts available to resize HOCR coordinates (javascript, ruby), that I believe would be what you’d need to do this.

To begin with as a demonstration, though, I just used it with a full-size JPG converted from the source TIFF at same resolution.

I did not get great results — the page sizes were weird. For the standard and graphical pages, the image was cut off, not entirely in the PDF — it seems to insist on 8.5″/11″ aspect ratio/page size? For the intermediate “diagonal” page, the page just took up a portion of the canvas, it was too small. The text still did line up with the image, but it seems like perhaps some assumptions about DPI we are not meeting, or other bugs in how the tool calculates PDF page size. I have not yet spent time to report these problems on Github Issues, because other problems encountered probably make this tool unsuitable for me anyway.

In all cases, the HOCR rendering is… OK. I would say it is about the same quality as archive-pdf-tool’s, although it is not identical to archive-pdf-tools, even from the same HOCR file. Apparently HOCR positioning is non-trivial.

On the “diagonal” page, hocr-pdf didn’t make the lines too high like recode_pdf — but it seemed incapable of including angled lines at all, the lines are rendered straight, which makes them not match up with the actual image text. (Try selecting the line “The liquor…” at the bottom). This seems to make it pretty unsuitable for use with our actual input corpus.

`hocr-pdf` also strangely bloated the size of resulting PDFs. Creating a PDF from a JPG that was 3.3M, the resulting PDF was 4.1M! (Compare to tesseract-produced PDF of 3.5M, which makes sense, adding just 200K for textual info). And the PDFs it created generated lots of warning-complaints from poppler and other pdftools.

eloops/hocr2pdf (js) — proof of concept without great rendering

When looking for any open source HOCR rendering code I could find, I found this package on github. At the time I found it, it hadn’t had a commit in many years, and from the commit history and repo activity it was unclear if it had ever really been used at scale, and it didn’t have a license on it. At that early point, if it was working code in Javascript (which uses a PDF-generating library instead of writing raw PDF bytes), I was potentially interested in porting it’s logic to ruby.

I got the author’s email address from the commit history, and emailed them to inquire. Stephen Poole kindly got back to me to confirm this was basically a proof-of-concept that was never used for real work. Stephen kindly added an MIT license in case I wanted to use it.

Curious, I wanted to test it on my test images and hocr. I was able to get it to work after realizing it needed an old version of the cheerio HTML parser, and fixing up the example in the README.

It didn’t do a great job of rendering. Trying to highlight-select lines, it was often impossible to select a line continuously, perhaps because the words on the line ended up with very different heights and baseline positions. It was not able to render the diagonal text diagonally in the diagonal example. (Try selecting “This effect, especially as
regards purples” in the diagonal file to see both issues).

An interesting example, mostly demonstrating that positioning rendered HOCR even as well as archive-pdf-tools does is not necessarily trivial.

Exactimage hocr2pdf — didn’t work for me at all

At first I imagined I was going to find a compiled executable available through package managers that simply combined hocr and images to make a PDF, as if this were a normal thing.

At first that’s what it looked like the ExactImage hocr2pdf tool was. Available via “brew install exactimage” or “apt-get install exactimage“.

The problem is… it didn’t work for me.

At first I had trouble getting it to take my inputs at all, it said “Error reading input file.” If I opened the TIFF in MacOS preview and re-exported as a TIFF again, I could get it to read it.

But it produced weird PDFs with no scanned images at all, and just a portion of the HOCR text rendered visibly in giant font.

It is an old package that doesn’t seem to be getting maintenance; the docs suggest it was written for use with HOCR from the (also non-maintained) cuneiform OCR package. Either I don’t understand how to use it, or HOCR has changed over time/between vendors that it can’t handle contemporary tesseract HOCR.

hocr2pdf -i scan.tiff -o test.pdf < ocr.hocr

pdfbeads (ruby) — a historical artifact, of unclear current utility

Researching this stuff, I found mention of this mythical project “pdfbeads”, which was written, in ruby, over 10 years ago, and appeared to be targetted at creating “ebook” PDFs from scans — there was a lot of energy in this domain back then, and at one point this was a well-known package with implementations of some things not found elsewhere.

It did/does both HOCR rendering and a kind of compression that seems to be similar to MRC, if not being MRC, although it’s not referred to as such in rubybeads code or docs.

I am not certain when it was first written, because it was originally in a “rubyforge” repo, and rubyforge is gone, along with it’s commit history and discussions that were there, which is sad. Some “forks” of pdfbeads exist on github, but none of them copied history from the original rubyforge (svn?) repo. Some claim to do things like “update for ruby 2.0”. For instance d235j/pdfbeads (which has a version number of 1.1.1), and ifad/pdfbeads (which has a version number of 1.0.11).

OK, the weird thing is… rubybeads got a rubygems release 1.1.3 in Jan 2022 — only a year ago — the first rubygems release since 2014. I have no idea if the repo this release came from is public, or really where to find the code for this release (other than in the gem package) — rubygem metadata for “homepage” still points to rubyforge!

But a CHANGELOG file is captured in the rubygem package, which rubygems.org conveniently shows us in a diff, so we can see what features have been added/changed in the latest release.

The READMEs found in all those locations do have an email address for the pdfbeads author, Alexey Kryukov. I tried emailing him for info (and if there is a public repo), but haven’t heard back.

I was initially interested in pdfbeads because I thought it might have a useful ruby implementation of HOCR rendering (writing direct raw PDF bytes, it looks like), and because I thought it might be the only other identified open source implementation of MRC-style compression!

pdfbeads input methods are kind of confusing — not sure if it wants an HOCR file per image, or one combined one like archive-pdf-tools. I tested it on just one image/hocr at a time. Input files can’t have more than one . (period) in them, which had me stuck for a bit. It will leave a lot of intermediate files around, so is best run in a scratch or per-work directory. Using latest 1.1.3 release from rubygems.

 pdfbeads dhc6a4r.tiff dhc6a4r.hocr -f -o dhc6a4r.pdfbeads.pdf

Whatever compression it’s supposed to be doing isn’t working at all for me. That output a 15M PDF, which is 5x the size of the PDF tesseract outputs from the same TIFF input! So… negative compression for me?. Extracting images from the produced PDF shows it was making multiple image overlays MRC-style (and that it decided to downsample the pixel resolution from the source TIFF, by different factors for different images, maybe depending on DPI) — but I guess it’s algorithm just didn’t work well with my input? Maybe it expects black and white input only?

There is probably something I don’t understand about how it is intended to be used. I have found it hard to find instructions/documentation (here’s an HTML doc page at some historical version?), and hard for me to understand.

The HOCR rendering was okay on some pages, but had some serious problems on others. On our image test #2, with “diagonal” text, the diagonal angle of the rendered lines was correct, but they were wrongly vertically offset from their true positions by about half a line? And our #3 graphical image, the line beginning “for home users” was just plain missing, although other lines were positioned well?

Overall, I’m not sure what’s going on with this code.

Ocrmypdf — a high-level tool for adding OCR to PDFs, usually with tesseract

For completeness, I thought I’d mention Ocrmypdf, because it is something that’s actually still maintained/developed (which seems to be unusual in this field!), with a lot of functionality.

It seems focused on the use case of having PDFs of scans, say from a photocopy machine, and wanting to have a “just works” tool that takes that as input and leaves you with a text layer. It’s sort of a high-level integration of lots of other tools to try to give you this one-click solution. It itself is written in python.

While it does have it’s own python implementation of HOCR rendering/positioning, by default it uses a “sandwhich” mode to have tesseract position the OCR’d text, and does not use it’s own HOCR renderer by default. It does say it’s own HOCR renderer “has the best compatibility with Mozilla’s PDF.js viewer”, but also warns it doesn’t currently handle non-Latin Unicode properly.

It does not do MRC compression, but is intersted in in it, began talking to Merlijn from Internet Archive about it, which led to the archive-pdf-tools attempts at the pdfcomp tool.

I didn’t spend too much time actually investigating this, when I saw that it by default just used tesseract for text rendering, and didn’t implement MRC. I haven’t actually tested it’s built-in HOCR rendering, I only just noticed now that OcrMyPDF docs suggest you might want to use it for “better compantibility with Mozilla’s PDF.js viewer”?

jbrinley/HocrConverter (python) — one more

I didn’t take the time to actually play with this one yet, but for completeness — former Code4Libber Jon Brinley has some 13-year-old python code at https://github.com/jbrinley/HocrConverter/blob/master/HocrConverter.py, which also links to a blog post of his at https://xplus3.net/2009/04/02/convert-hocr-to-pdf/

A copyright notice in OcrMyPDF source for Jon Brinley suggests maybe their implementation came first from here? Maybe.

10-15 years ago, people were doing a lot of work in this area that just kind of… stalled out?

http://bibwild.wordpress.com/?p=11255

Extensions

OCFL and “source of truth” — two options

jrochkind Mar 21, 2023

Some great things about conferences is how different sessions can play off each other, and how lots of people interested in the same thing are in the same place (virtual or real) at the same time, to bounce ideas off each other. I found both of those things coming into play to help elucidate what … Continue reading OCFL and “source of truth” — two options →

Show full content

I found both of those things coming into play to help elucidate what I think is an important issue in how software might use the Oxford Common File Layout (OCFL). Prompted by the Code4Lib 2023 session The Oxford Common File Layout – Understanding the specification, institutional use cases and implementations, with presentations by Tom Wrobel, Stefano Cossu, and Arran Griffith. (recorded video here).

OCFL is a specification for laying files out in a disk-like storage system, in a way that is suitable for long-time preservation. With a standard simple layout that is both human- and machine-readable, and would allow someone (some software) at a future point to reconstruct digital objects and metadata from the record left on disk.

The role of OCFL in a software system: Two choices

After the conference presentation, Matt Lincoln from JStor Labs asked a question in Slack chat that had been rising up in my mind too, but which Matt said more clearly than was in my mind at the time! This prompted a discussion on Slack, largely but not entirely between me and Stefano Cossu, which I found to be very productive, and which I’m going to detail here with my own additional glosses, but first let’s start with Matt’s question.

(I will insert slack links to quotes in this piece; you probably can’t see the sources unless you are a member of the Code4Lib workspace).

For the OCFL talk, I’m still unclear what the relationship is/can/will be in these systems between the database supporting the application layer, and the filesystem with all the OCFL-laid-out objects. Does DB act as a source of truth and OCFL as a copy? OCFL as source of truth and DB as cache? No db at all, and just r/w directly to OCFL? If I’m a content manager and edit an item’s metadata in the app’s web interface, does that request get passed to a DB and THEN to OCFL? Is the web app reading/writing directly to the OCFL filesystem without mediating DB representation? Something else?
Matt Lincoln

I think Matt, utilizing the helpful term “source of truth”, accurately identifies two categories of use of OCFL in a software system — and in fact, that different people in the OCFL community — even different presenters in this single OCFL conference presentation — had been taking different paths, and maybe assuming that everyone else was on the same page as them, or at least not frequently drawing out the difference and consequences of these two paths.

Stefano Cossu, one of the presenters from the OCFL talk at Code4Lib, described it this way in a Slack response:

IMHO OCFL can either act as a source from which you derive metadata, or a final destination for preservation derived from a management or access system, that you don’t want to touch until disaster hits. It all depends on how your ideal information flow is. I believe Fedora is tied to OCFL which is its source of truth, upon which you can build indices and access services, but it doesn’t necessarily need to be that way.
Stefano Cossu

It turns out that both paths are challenging in different ways; there is no magic bullet. I think this is a foundational question for the software engineering of systems that use OCFL for preservation, with significant implications on the practice of digital preservation as a whole.

First, let’s say a little bit more about what the paths are.

“OCFL as a source of truth”

If you are treating OCFL as a “source of truth”, the files stored in OCFL are the main primary location of your data.

When the software wants to add, remove, or change data, it will probably happen to the OCFL first, or at any rate won’t be considered a successful change until it is reflected in OCFL.

There might be other layers on top providing alternate access to the OCFL, some kind of “index” to OCFL for faster and/or easier access to the data, but these are considered “derivative”, and can always be re-created from just the OCFL. The OCFL is “the data”, everything else is “derivative” and can be re-created by an automated process from the OCFL on disk.

This may be what some of the OCFL designers were assuming everyone would do; as we’ll see, it makes certain things possible, and provides the highest level of confidence in our preservation activities.

“OCFL off to the side”

Alternately, you might write an application more or less using standard architectures for writing (eg) web applications. The data is probably in a relational database system (rdbms) like postgres or MySQL, or some other data store meant for supporting application development.

When the application makes a change to the data, it’s made to the primary data store.

Then the data is “mirrored” to OCFL. Possibly after every change, or possibly periodically. The OCFL can be thought of as a kind of “backup” — a backup in a specific standard format meant to support long-term preservation and interoperability. I’m calling this “off to the side”, Stefano aboves calls it “final destination”, in either case contrasted with “source of truth”.

It’s possible you haven’t stored all the data the application uses to OCFL, only the data you want to backup “for long-term preservation purposes”. (Stefano later suggests this is their practice, in fact). Maybe there is some data you think is necessary only for the particular present application’s functionalities (say, to support back-end accounts and workflows), which you think of as accidental, ephemeral, contextual, or system-specific and non-standard– and which you don’t see any use to storing for long-term preservation.

In this path, if ALL you have is the OCFL, you aren’t intending that you can necessarily stand your actual present application back up — maybe you didn’t store all the data you’d need for that; maybe you don’t have existing software capable of translating the OCFL back to the form the application actually needs it in to function. Of if you are intending that, the challange is greater to accomplish it, as we’ll see.

So why would you do this? Well, let’s start with that.

Why not OCFL as a source of truth?

There’s really only one reason — because it makes application development a lot harder. What do I mean by “a lot harder”? I mean, it’s going to take more development time, and more development care and decisions, you’re going to have more trouble achieving reasonable performance in a large-scale system — and you’re going to make more mistakes, have more bugs and problems, more initial deliveries that have problems. It’s not all “up-front” cost or known cost, but as you continue to develop the system, you’re going to keep struggling with these things. You honestly have increased chance of failure.

Why?

In the Slack thread, Stefano Cossu spoke up for OCFL to be a “final destination”, not the “source of truth” for the daily operating software:

I personally prefer OCFL to be the final destination, since if it’s meant to be for preservation, you don’t want to “stir” the medium by running indexing and access traffic, increasing the chances of corruption.
Stefano Cossu

If you’re using it as the actual data store for a running application, instead of leaving it off to the side as a backup, it perhaps increases the chances of bugs effecting data reliability.

The problem with that setup [OCFL as source of truth] is that a preservation system has different technical requirements from an access system. E.g. you may not want store (and index) versioning information in your daily-churn system. Or you may want to use a low-cost, low-performance medium for preservation
Stefano Cossu

OCFL is designed to rebuild knowledge (not only data, but also the semantic relationships between resources) without any supporting software. That’s what I intend for long-term preservation. In order to do that, you need to serialize everything in a way that is very inefficient for daily use.
Stefano Cossu

The form that OCFL prescribes is cumbersome to use for ordinary daily functionality. It makes it harder to achieve the goals you want for your actually running software.

I think Stefano is absolutely right about all of this, by the way, and also thank him for skillfully and clearly delineating a perspective that may, explicitly or not, actually be somewhat against the stream of some widespread OCFL assumptions.

One aspect of the cumbersomeness is that writes to OCFL need to be “synchronized” with regard to concurrency — the contents of a new version written to OCFL are as deltas on the previous version, so if another version is added while you are working on preparing your additional version — your version will be wrong. You need to use some form of locking, whether optimistic or naive pessimistic locks.

Whereas a relational database system is built on decades of work to ensure ACID (atomicity, consistency, isolation, durability) with regard to writes, while also trying to optimize performance within these constraints (which can be a real tension) — with OCFL we don’t have the built-up solutions (tools and patterns) for this to the same extent.

Application development gets a lot harder

In general, building a (say) web app on a relational database system is a known problem with a huge corpus of techniques, patterns, shared knowledge, and toolsets available. A given developer may be more or less experienced or skilled; different developers may disagree on optimal choices in some cases. But those choices are being made from a very established field, with deep shared knowledge on how to build applications rapidly (cheaply), with good performance and reliability.

When we switch to OCFL as the primary “source of truth” for an app, we in some ways are charting new territory and have to figure out and invent the best ways to do certain things, with much less support from tooling, the “literature” (even including blogs you find on google etc), and a much smaller community of practice.

The Fedora repository platform is in some sense meant to be a kind of “middleware” to make this lift easier. In its version 6 incarnation, it’s own internal data store is OCFL. It doesn’t give you a user-facing app. It gives you a “middleware” you can access over a more familiar HTTP API with clear semantics, and you don’t have to deal with the underlying OCFL (or in previous incarnations other internal formats) yourself. (Seth Erickson’s ocfl_index could be thought of as similar peer “middleware” in some ways, although it’s read-only, it doesn’t provide for writing).

But it’s still not the well-trodden path of rapid web application development on top of an rdbms.

I think that the samvera (née hydra) community really learned this to some extent the hard way, the way trying to build on top of this novel architecture really raised the complexity, cost, and difficulty of implementing the user-facing application (with implications on succession, hiring, and retention too). I’m not saying this happened becuase Fedora team did something wrong, I’m saying a novel architecture like this inherently and neccessarily raises the difficulty over a well-trodden architectural path. (although it’s possible to recognize the challenge and attempt to ameliorate with features that make things easier on developers, it’s not possible to eliminate).

Some samvera peer instititions have left the Fedora-based architecture, I think as a result of this experience. Where I work at Science History Institute, we left sufia/hydra/samvera to write a closer to “just plain Rails app”, and I believe it successfully and seriously increased our capacity to meet organizational and business needs within our available software engineering capacity. I personally would be really relutant to go back to attempting to use Fedora and/or OCFL as a “source of truth”, instead of more conventional web app data storage patterns.

So… that’s why you might not… but what do you lose?

What do you lose without OCFL as source of truth?

The trade-off is real though — I think some of the assumptions about what OCFL provides how are actually based on assumptions of OCFL as source of truth in your application.

Mike Kastellec’s Code4Lib presentation just before the OCFL one, on How to Survive a Disaster [Recovery] really got me thinking about backups and reliability.

Many of us have heard (or worse, found out ourselves the hard way) the adage: You don’t really know if you have a good backup unless you regularly go through the practice of recovery using it, to test it. Many have found that what they thought was their backup — was missing, was corrupt, or was not in a format suitable for supporting recovery. Because they hadn’t been verifying it would work for recovery, they were just writing to it but not using it for anything.

(Where I work, we try to regularly use our actual backups as the source of sync’ing from a production system to a staging system, in part as a method of incorporating backup recovery verification into our routine).

How is a preservation copy analogous? If your OCFL is not your source of truth, but just “off to the side” as a “preservation copy” — it can easily be a similar “write-only” copy. How do you know what you have there is sufficient to serve as a preservation copy?

Just as with backups, there are (at least) two categories of potential problem: It could be there are bugs in your synchronization routines, such that what you thought was being copied to OCFL was not, or not on the schedule you thought, or was getting corrupted or lost in transit. But the other category, even worse — it could be that your design had problems, and what you chose to sync to OCFL left out some crucial things that these future consumers of your preservation copy would have needed to fully restore and access the data. Stefano also wrote:

We don’t put everything in OCFL. Some resources are not slated for long-term preservation. (or at least, we may not in the future, but we do now)

If you are using the OCFL as your daily “source of truth”, you at least know the data you have stored in OCFL is sufficient to run your current system. Or at least you haven’t noticed any bugs with it yet, and if anyone notices any you’ll fix them ASAP.

The goal of preservation is that some future system will be able to use these files to reconstruct the objects and metadata in a useful way… It’s good to at least know it’s sufficient for some system, your current system. If you are writing to OCFL and not using it for anything… it reminds us of writing to a backup that you never restore from. How do you know it’s not missing things, by bug or by misdesign?

Do you even intend the OCFL to be sufficient to bring up your current system (I think some do, some don’t, some haven’t thought about it), and if you do, how do you know it meets your intents?

OCFL and Completeness and Migrations

The OCFL web page lists as one of its benefits (which I think can also be understood as design goals for OCFL):

Completeness, so that a repository can be rebuilt from the files it stores

If OCFL is your applications “source of truth”, you have this necessarily, in the sense of that almost being the definition of OCFL being the “source of truth”. (maybe suggesting at least some OCFL designers were assuming it as source of truth).

But if your OCFL is “off to the side”… do you even have that? I guess it depends on if you intended the OCFL to be transformable back to your application’s own internal source of truth, and if that intention was successful. If we’re talking about data from your application being written “off to the side” to OCFL, and then later transformed back to your application — I think we’re talking about what is called “round-tripping” the data.

There was another Code4Lib presentation about repository migration at Stanford, in the Slack discussion happening about that presentation, Stanford’s Justin Coyne and Mike Giarlo wrote:

I don’t recommend “round trip mappings”. I was a developer on this project. It’s very challenging to not lose data when going from A -> B -> A
Justin Coyne

We spent sooooo much time on getting these round-trip mappings correct. Goodness gracious.
Mike Giarlo

So, if you want to make your OCFL “off to the side” provide this quality of completeness via round-trippability, you probably have to be focusing on it intentionally, and then it’s still going to be really hard, maybe one of the hardest (most time-consuming, most buggy) aspects of your application, or at least it’s persistence layer.

I found this presentation about repository migration really connecting my neurons to the OCFL discussion generally — when i thought about this I realized, well, that makes sense, woah, is one description of “preservation” activities actually: a practice of trying to plan and provide for unknown future migrations not yet fully spec’d?

So, while we were talking about repository migrations on Slack, and how challenging the data migrations were (several conf presentations dealt with data migrations in repositories) Seth Erickson made a point about OCFL:

One of the arguments for OCFL is that the repository software should upgradeable/changeable without having to migrate the data… (that’s the aspiration, anyways)
Seth Erickson

If the vision is that with nothing more than an OCFL storage system, we can point new software to it and be up and running without a data migration — I think we can see this is basically assuming OCFL as the “source of truth”, and also talking about the same thing the OCFL webpage calls “completeness” again.

And why is this vision aspirational? Well, to begin with, we don’t actually have very many repository systems that use OCFL as a source of truth. We may only have Fedora — that is, systems that use Fedora as middleware. Or maybe ocfl_index too, although it being only read-only and also middleware that doesn’t necessarily have user-facing software built on it yet, it’s probably currently a partial entry at most.

If we had multiple systems that could already do this, we’d be a lot more confident it would work out — but of course, the expense and difficulty of building a system using OCFL as the “source of truth” is probably a large part of why we don’t!

OK, do we at least have multiple systems based on fedora? Well… yes. Even before Fedora was based on OCFL, it would hypothetically be possible to upgrade/change repository software without a data migration if both source and target software were based on Fedora… except, in fact, it was not possible to do this between Samvera sufia/hydra and Islandora, despite both being based on fedora, because even though they both used fedora, their metadata stored in Fedora (or OCFL) was not consistent. A whole giant topic we’re not going to cover here, except to point out it’s a huge challenge for that vision of “completeness” providing for software changes without data migration, a huge challenge that we have seen in practice, without necessarily seeing a success in practice. (Even within hyrax alone, there are currently two different possible fedora data layouts, using traditional activefedora with “wings” adapter or instead valkyrie-fedora adapter, requiring data migration between them!)

And if we think of the practice of preservation as being trying to maximize chances of providing for migration to future unknown systems with unknown needs… then we see it’s all aspirational (that far-future digital preservation is an aspirational endeavor is of course probably not a controversial thing to say either).

But the little bit of paradox here is that while “completeness” makes it more likely you will be able to easily change systems without data loss, the added cost of developing systems that achieve “completeness” via OCFL as “source of truth” means — you will probably have much fewer, if any, choices of suitable systems to change to, or resources available to develop them!

So… what do we do? Can we split the difference?

I think the first step is acknowledging the issue, the tension here between completeness via “OCFL as source-of-truth” and, well, ease of software development. There is no magic answer that optimizes everything, there are trade-offs.

That quality of “completeness” of data (“source of truth”) is going to make your software much more challenging to develop. Take longer, take more skill, have more chance of problems and failures. And another way to say this is: Within a given amount of engineering resources, you will be delivering fewer features that matter to your users and organization, because you are spending more of your resources on implementing on a more challenging architecture.

What you get out of this is aspirationally increased chances of successful preservation. This doesn’t mean you shouldn’t do it, digital preservation is neccessarily aspirational. I’m not sure one balances this cost and benefit — it might likely be different for different institutions — but I think we should be careful not to be routinely under-estimating the cost or over-estimating the size or confidence of benefits from the “source of truth” approach. Undoubtedly many institutions will still choose to develop OCFL as a source of truth, especially using middleware intended to ease the burden, like Fedora.

I will probably not be one of them at my current institution — the cost is just too high for us, we can’t give up the capacity to relatively rapidly meet other organizational and user needs. But I’d like to look at incorporating OCFL as “off to the side” preservation copy anyway in the future.

(And Stefano and me are definitely not the only ones considering this or doing it. Many institutions are using an “off to the side” “final destination” approach to preservation copies, if not with OCFL, than with some of it’s progenitors or peers like BagIt or Stanford’s MOAB — the “off to the side” approach is not unusual, and for good reasons! We can acknowledge it and talk about it without shame!)

If you are developing instead with OCFL as a “off to the side” (or “final destination”), are there things you can do to try to get closer to the benefits of OCFL as “source of truth”?

The main thing I can think of involves “round-trippability”

Yes, commit to storing all of your objects and metadata necessary to restore a working current system in your OCFL
And commit to storing it round-trippably
One way to ensure/enforce this would be — every time you write a new version to OCFL, run a job that serializes those objects and metadata to OCFL, and back to your internal format, and verify that it is still equivalent. Verify the round-trip.

Round-trippability doens’t just happen on it’s own, and ensuring it will definitely significantly increase the cost of your development — as the Stanford folks said from experience, round-trippability is a headache and a major cost! But, it could conceivably get you a lot of the confidence in “completeness” that “source of truth” OCFL gets you. And as it still is “off to the side”, it still allows you to write your application using whatever standard (or innovative in different directions) architectures you want, you don’t have the novel data persistence architecture design involved in all of your feature development to meet user and business needs.

This will perhaps arrive at a better cost/benefit balance for some institutions.

There may be other approaches or thoughts, this is hopefully the beginning of a long conversation and practice.

http://bibwild.wordpress.com/?p=11155

Extensions

Escaping/encoding URI components in ruby 3.2

jrochkind Feb 14, 2023

Thanks to zverok_kha’s awesome writeup of Ruby changes, I noticed a new method released in ruby 3.2: CGI.escapeURIComponent This is the right thing to use if you have an arbitrary string that might include characters not legal in a URI/URL, and you want to include it as a path component or part of the query … Continue reading Escaping/encoding URI components in ruby 3.2 →

Show full content

Thanks to zverok_kha’s awesome writeup of Ruby changes, I noticed a new method released in ruby 3.2: CGI.escapeURIComponent

This is the right thing to use if you have an arbitrary string that might include characters not legal in a URI/URL, and you want to include it as a path component or part of the query string:

require 'cgi'

url = "https://example.com/some/#{ CGI.escapeURIComponent path_component }" + 
  "?#{CGI.escapeURIComponent my_key}=#{CGI.escapeURIComponent my_value}"

The docs helpfully refer us to RFC3986, a rare citation in the wild world of confusing and vaguely-described implementations of escaping (to various different standards and mistakes) for URLs and/or HTML
This will escape / as %2F, meaning you can use it to embed a string with / in it inside a path component, for better or worse
This will escape a space ( ) as %20, which is correct and legal in either a query string or a path component
There is also a reversing method available CGI.unescapeURIComponent

What if I am running on a ruby previous to 3.2?

Two things in standard library probably do the equivalent thing. First:

require 'cgi'
CGI.escape(input).gsub("+", "%20")

CGI escape but take the +s it encodes space characters into, and gsub them into the more correct %20. This will not be as performant because of the gsub, but it works.

This, I noticed once a while ago, is what ruby aws-sdk does… well, except it also unescapes %7E back to ~, which does not need to be escaped in a URI. But… generally… it is fine to percent-encode ~ as %7E. Or copy what aws-sdk does, hoping they actually got it right to be equivalent?

Or you can use:

require 'erb'
ERB::Util.url_encode(input)

But it’s kind of weird to have to require the ERB templating library just for URI escaping. (and would I be shocked if ruby team moves erb from “default gem” to “bundled gem”, or further? Causing you more headache down the road? I would not). (btw, ERB::Util.url_encode leaves ~ alone!)

Do both of these things do exactly the same thing as CGI.escapeURIComponent? I can’t say for sure, see discussion of CGI.escape and ~ above. Sure is confusing. (there would be a way to figure it out, take all the chars in various relevant classes in the RFC spec and test them against these different methods. I haven’t done it yet).

What about URI.escape?

In old code I encounter, I often see places using URI.escape to prepare URI query string values…

# don't do this, don't use URI.escape
url = "https://example.com?key=#{ URI.escape value }"

# not this either, don't use URI.escape
url = "https://example.com?" + 
   query_hash.collect { |k, v| "#{URI.escape k}=#{URI.escape v}"}.join("&")

This was never quite right, in that URI.escape was a huge mess… intending to let you pass in whole URLs that were not legal URLs in that they had some illegal characters that needed escaping, and it would somehow parse them and then escape the parts that needed escaping… this is a fool’s errand and not something it’s possible to do in a clear consistent and correct way.

But… it worked out okay because the output of URI.escape overlapped enough with (the new RFC 3986-based) CGI.escapeURIComponent that it mostly (or maybe even always?) worked out. URI.escape did not escape a /… but it turns out / is probably actually legal in a query string value anyway, it’s optional to escape it to %2F in a query string? I think?

And people used it in this scenario, I’d guess, because it’s name made it sound like the right thing? Hey, I want to escape something to put it in a URI, right? And then other people copied from code they say, etc.

But URI.escape was an unpredictable bad idea from the start, and was deprecated by ruby, then removed entirely in ruby 3.0!

When it went away, it was a bit confusing to figure out what to replace it with. Because if you asked, sometimes people would say “it was broken and wrong, there is nothing to replace it”, which is technically true… but the code escaping things for inclusion in, eg, query strings, still had to do that… and then the “correct” behavior for this actually only existed in the ruby stdlib in the erb module (?!?) (where few had noticed it before URI.escape went away)… and CGI.escapeURIComponent which is really what you wanted didn’t exist yet?

Why is this so confusing and weird?

Why was this functionality in ruby stdlib non-existent/tucked away? Why are there so many slightly different implementations of “uri escaping”?

Escaping is always a confusing topic in my experience — and a very very confusing thing to debug when it goes wrong.

The long history of escaping in URLs and HTML is even more confusing. Like, turning a space into a + was specified for application/x-www-form-urlencoded format (for encoding an HTML form as a string for use as a POST body)… and people then started using it in url query strings… but I think possibly that was never legal, or perhaps the specifications were incomplete/inconsistent on it.

But it was so commonly done that most things receiving URLs would treat a literal + as an encode space… and then some standards were retroactively changed to allow it for compatibility with common practice…. maybe. I’m not even sure I have this right.

And then, as with the history of the web in general, there have been a progression of standards slightly altering this behavior, leapfrogging with actual common practice, where technically illegal things became common and accepted, and then standards tried to cope… and real world developers had trouble underestanding there might be different rules for legal characters/escaping in HTML vs URIs vs application/x-www-form-urlencoded strings vs HTTP headers…. and then language stdlib implementers (including but not limited to ruby) implemented things with various understandings acccording to various RFCs (or none, or buggy), documented only with words like “Escapes the string, replacing all unsafe characters with codes.” (unsafe according to what standard? For what purpose?)

PHEW.

It being so confusing, lots of people haven’t gotten it right — I swear that AWS S3 uses different rules for how to refer to spaces in filenames than AWS MediaConvert does, such that I couldn’t figure out how to get AWS MediaConvert to actually input files stored on S3 with spaces in them, and had to just make sure to not use spaces in filenames on S3 destined for MediaConvert. But maybe I was confused! But honestly I’ve found it’s best to avoid spaces in filenames on S3 in general, because S3 docs and implementation can get so confusing and maybe inconsistent/buggy on how/when/where they are escaped. Because like we’re saying…

Escaping is always confusing, and URI escaping is really confusing.

Which is I guess why the ruby stdlib didn’t actually have a clearly labelled provided-with-this-intention way to escape things for use as a URI component until ruby 3.2?

Just use CGI.escapeURIComponent in ruby 3.2+, please.

What about using the Addressable gem?

When the horrible URI.escape disappeared and people that had been wrongly using it to escape strings for use as URI components needed some replacement and the ruby stdlib was confusing (maybe they hadn’t noticed ERB::Util.url_encode or weren’t confident it did the right thing and gee I wonder why not), some people turned to the addressable gem.

This gem for dealing with URLs does provide ways to escape strings for use in URLs… it actually provides two different algorithms depending on whether you want to use something in a path component or a query component.

require 'addressable'

Addressable::URI.encode_component(query_param_value, Addressable::URI::CharacterClasses::QUERY)

Addressable::URI.encode_component(path_component, Addressable::URI::CharacterClasses::PATH)

Note Addressable::URI::CharacterClasses::QUERY vs Addressable::URI::CharacterClasses::PATH? Two different routines? (Both by the way escape a space to %20 not +).

I think that while some things need to be escaped in (eg) a path component and don’t need to be in a query component, the specs also allow some things that don’t need to be escaped to be escaped in both places, such that you can write an algorithm that produces legally escaped strings for both places, which I think is what CGI.escapeURIComponentis. Hopefully we’re in good hands.

On Addressable, neither the QUERY nor PATH variant escapes /, but CGI.escapeURIComponent does escape it to %2F. PHEW.

You can also call Addressable::URI.encode_component with no second arg, in which case it seems to escape CharacterClasses::RESERVED + CharacterClasses::UNRESERVED from this list. Whereas PATH is, it looks like there, equivalent to UNRESERVED with SOME of RESERVED (SUB_DELIMS but only some of GENERAL_DELIMS), and QUERY is just path plus ? as needing escaping…. (CGI.escapeURIComponent btw WILL escape ? to %3F).

PHEW, right?

Anyhow

Anyhow, just use CGI.escapeURIComponent to… escape your URI components, just like it says on the lid.

Thanks to /u/f9ae8221b for writing it and answering some of my probably annoying questions on reddit and github.

http://bibwild.wordpress.com/?p=10480

Extensions

attr_json 2.0 release: ActiveRecord attributes backed by JSON column

jrochkind Feb 9, 2023

attr_json is a gem to provide attributes in ActiveRecord that are serialized to a JSON column, usually postgres jsonb, multiple attributes in a json hash. In a way that can be treated as much as possible like any other “ordinary” (database column) ActiveRecord. It supports arrays and nested models as hashes, and the embedded nested … Continue reading attr_json 2.0 release: ActiveRecord attributes backed by JSON column →

Show full content

It supports arrays and nested models as hashes, and the embedded nested models can also be treated much as an ordinary “associated” record — for instance CI build tests with cocoon , and I’ve had a report that it works well with stimulus nested forms, but I don’t currently know how to use those. (PR welcome for a test in build?)

An example:

# An embedded model, if desired
class LangAndValue
  include AttrJson::Model

  attr_json :lang, :string, default: "en"
  attr_json :value, :string
end

class MyModel < ActiveRecord::Base
   include AttrJson::Record

   # use any ActiveModel::Type types: string, integer, decimal (BigDecimal),
   # float, datetime, boolean.
   attr_json :my_int_array, :integer, array: true
   attr_json :my_datetime, :datetime

   attr_json :embedded_lang_and_val, LangAndValue.to_type
end

model = MyModel.create!(
  my_int_array: ["101", 2], # it'll cast like ActiveRecord
  my_datetime: DateTime.new(2001,2,3,4,5,6),
  embedded_lang_and_val: LangAndValue.new(value: "a sentence in default language english")
)

By default it will serialize attr_json attributes to a json_attributes column (this can also be specified differently), and the above would be serialized like so:

{
  "my_int_array": [101, 2],
  "my_datetime": "2001-02-03T04:05:06Z",
  "embedded_lang_and_val": {
    "lang": "en",
    "value": "a sentence in default language english"
  }
}

Oh, attr_json also supports some built-in construction of postgres jsonb contains (“@>“) queries, with proper rails type-casting, through embedded models with keypaths:

MyModel.jsonb_contains(
  my_datetime: Date.today,
  "embedded_lang_and_val.lang" => "de"
) # an ActiveRelation, you can chain on whatever as usual

And it supports in-place mutations of the nested models, which I believe is important for them to work “naturally” as ruby objects.

my_model.embedded_lang_and_val.lang = "de"
my_model.embedded_lang_and_val_change 
# => will correctly return changes in terms of models themselves
my_model.save!

There are some other gems in this “space” of ActiveRecord attribute json serialization, with different fits for different use cases, created either before or after I created attr_json — but none provide quite this combination of features — or, I think, have architectures that make this combination feasible (I could be wrong!). Some to compare are jsonb_accessor, store_attribute, and store_model.

One use case where I think attr_json really excels is when using Rails Single-Table Inheritance, where different sub-classes may have different attributes.

And especially for a “content management system” type of use case, where on top of that single-table inheritance polymorphism, you can have complex hierarchical data structures, in an inheritance hierarchichy, where you don’t actually want or need the complexity of an actual normalized rdbms schema for the data that has both some polymorphism and some hetereogeneity. We get some aspects of a schema-less json-document-store, but embedded in postgres, without giving up rdbms features or ordinary ActiveRecord affordances.

Slow cadence, stability and maintainability

While the 2.0 release includes a few backwards incompats, it really should be an easy upgrade for most if not everyone. And it comes three and a half years after the 1.0 release. That’s a pretty good run.

Generally, I try to really prioritize backwards compatibility and maintainability, doing my best to avoid anything that could provide backwards incompat between major releases, and trying to keep major releases infrequent. I think that’s done well here.

I know that management of rails “plugin” dependencies can end up a nightmare, and I feel good about avoiding this with attr_json.

attr_json was actually originally developed for Rails 4.2 (!!), and has kept working all the way to Rails 7. The last attr_json 1.x release actually supported (in same codebase) Rails 5.0 through Rails 7.0 (!), and attr_json 2.0 supports 6.0 through 7.0. (also grateful to the quality and stability of the rails attributes API originally created by sgrif).

I think this succesfully makes maintenance easier for downstream users of attr_json, while also demonstrating success at prioritizing maintainability of attr_json itself — it hasn’t needed a whole lot of work on my end to keep working across Rails releases. Occasionally changes to the test harness are needed when a new Rails version comes out, but I actually can’t think of any changes needed to implementation itself for new Rails versions, although there may have been a few.

Because, yeah, it is true that this is still basically a one-maintainer project. But I’m pleased it has successfully gotten some traction from other users — 390 github “stars” is respectable if not huge, with occasional Issues and PR’s from third parties. I think this is a testament to it’s stability and reliability, rather than to any (almost non-existent) marketing I’ve done.

“Slow code”?

In working on this and other projects, I’ve come to think of a way of working on software that might be called “slow code”. To really get stability and backwards compatibility over time, one needs to be very careful about what one introduces into the codebase in the first place. And very careful about getting the fundamental architectural design of the code solid in the first place — coming up with something that is parsimonious (few architectural “concepts”) and consistent and coherent, but can handle what you will want to throw at it.

This sometimes leads me to holding back on satisfying feature requests, even if they come with pull requests, even if it seems like “not that much code” — if I’m not confident it can fit into the architecture in a consistent way. It’s a trade-off.

I realize that in many contemporary software development environments, it’s not always possible to work this way. I think it’s a kind of software craftsmanship for shared “library” code (mostly open source) that… I’m not sure how much our field/industry accomnodates development with (and the development of) this kind of craftsmanship these days. I appreciate working for a non-profit academic institute that lets me develop open source code in a context where I am given the space to attend to it with this kind of care.

The 2.0 Release

There aren’t actually any huge changes in the 2.0 release, mostly it just keeps on keeping on.

Mostly, 2.0 tries to make things adhere even closer and more consistently to what is expected of Rails attributes.

The “Attributes” API was still brand new in Rails 4.2 when this project started, but now that it has shown itself solid and mature, we can always create a “cover” Rails attribute in the ActiveRecord model, instead of making it “optional” as attr_json originally did. Which provides for some code simplification.

Some rough edges were sanded involved making Time/Date attributes timezone-aware in the way Rails usually does transparently. And with some underlying Rails bugs/inconsistencies having been long-fixed in Rails, they can now store miliseconds in JSON serialization rather than just whole seconds too.

I try to keep a good CHANGELOG, which you can consult for more.

The 2.0 release is expected to be a very easy migration for anyone on 1.x. If anyone on 1.x finds it challenging, please get in touch in a github issue or discussion, I’d like to make it easier for you if I can.

For my Library-Archives-Museums Rails people….

The original motivation from this came from trying to move off samvera (nee hydra) sufia/hyrax to an architecutre that was more “Rails-like”. But realizing that the way we wanted to model our data in a digital collections app along the lines of sufia/hyrax, would be rather too complicated to do with a reasonably normalized rdbms schema.

So… can we model things in the database in JSON — similar to how valkyrie-postgres would actually model things in postgres — but while maintaining an otherwise “Rails-like” development architecture? The answer: attr_json.

So, you could say the main original use case for attr_json was to persist a “PCDM“-ish data model ala sufia/hyrax, those kinds of use cases, in an rdbms, in a way that supported performant SQL queries (minimal queries per page, avoiding n+1 queries), in a Rails app using standard Rails tools and conventions, without an enormously complex expansive normalized rdbms schema.

While the effort to base hyrax on valkyrie is still ongoing, in order to allow postgres vs fedora (vs other possible future stores) to be a swappable choice in the same architecture — I know at least some institutions (like those of the original valkyrie authors) are using valkyrie in homegrown app directly, as the main persistence API (instead of ActiveRecord).

In some sense, valkyrie-postgres (in a custom app) vs attr-json (in a custom app) are two paths to “step off” the hyrax-fedora architecture. They both result in similar things actually stored in your rdbms (and we both chose postgres, for similar reasons, including I think good support for json(b)). They have both have advantages and disadvantages. Valkyrie-postgres kind of intentionally chooses not to use ActiveRecord (at least not in controllers/views etc, not in your business logic), one advantage of such is to get around some of the known widely-commented upon deficiencies and complaints with Rails standard ActiveRecord architecture.

Whereas I followed a different path with attr_json — how can we store things in postgres similarly, but while still using ActiveRecord in a very standard Rails way — how can we make it as standard a Rails way as possible? This maintains the disadvantages people sometimes complain about Rails architecture, but with the benefit of sticking to the standard Rails ecosystem, having less “custom community” stuff to maintain or figure out (including fewer lines of code in attr-json), being more familiar or accessible to Rails-experienced or trained developers.

At least that’s the idea, and several years later, I think it’s still working out pretty well.

In addition to attr_json, I wrote a layer on top to provide some parts on top of attr_json, that I thought would be both common and somewhat tricky in writing a pcdm/hyrax-ish digital collections app as “standard Rails as much as it makes sense”. This is kithe and it hasn’t had very much uptake. The only other user I’m aware of (who is using only a portion of what kithe provides; but kithe means to provide for that as a use case) is Eric Larson at https://github.com/geobtaa/geomg.

However, meanwhile, attr_json itself has gotten quite a bit more uptake — from wider Rails developer community, not our library-museum-archives community. attr_json’s 390 github stars isn’t that big in the wider world of things, but it’s pretty big for our corner of the world. (Compare to 160 for hyrax or 721 for blacklight). That the people using attr_json, and submitting Issues or Pull Requests largely aren’t library-museum-archives developers, I consider positive and encouraging, that it’s escaped the cultural-heritage-rails bubble, and is meeting a more domain-independent or domain-neutral need, at a lower level of architecture, with a broader potential community.

http://bibwild.wordpress.com/?p=10343

Extensions

A tiny donation to rubyland.news would mean a lot

jrochkind Dec 23, 2022

I started rubyland.news in 2016 because it was a thing I wanted to see for the ruby community. I had been feeling a shrinking of the ruby open source collaborative community, it felt like the room was emptying out. If you find value in Rubyland News, just a few dollars contribution on my Github Sponsors … Continue reading A tiny donation to rubyland.news would mean a lot →

Show full content

If you find value in Rubyland News, just a few dollars contribution on my Github Sponsors page would be so appreciated.

I’ve been solely responsible for its development, and editorial and technical operations. I think it’s been a success. I don’t have analytics, but it seems to be somewhat known and used.

Rubyland.news has never been a commercial project. I have never tried to “monetize” it. I don’t even really highlight my personal involvement much. I have in the past occasionally had modest paid sponsorship barely enough to cover expenses, but decided it wasn’t worth the effort.

I have and would never provide any kind of paid content placement, because I think that would be counter to my aims and values — I have had offers, specifically asking for paid placement not labelled as such, because apparently this is how the world works now, but I would consider that an unethical violation of trust.

It’s purely a labor or love, in attempted service to the ruby community, building what I want to see in the world as an offering of mutual aid.

So why am I asking for money?

The operations of Rubyland News don’t cost much, but they do cost something. A bit more since Heroku eliminated free dynos.

I currently pay for it out of my pocket, and mostly always have modulo occasional periods of tiny sponsorship. My pockets are doing just fine, but I do work for an academic non-profit, so despite being a software engineer the modest expenses are noticeable.

Sure, I could run it somewhere cheaper than heroku (and eventually might have to) — but I’m doing all this in my spare time, I don’t want to spend an iota more time or psychic energy on (to me) boring operational concerns than I need to. (But if you want to volunteer to take care of setting up, managing, and paying for deployment and operations on another platform, get in touch! Or if you are another platform that wants to host rubyland news for free!)

It would be nice to not have to pay for Rubyland News out of my pocket. But also, some donations would, as much as be monetarily helpful, also help motivate me to keep putting energy into this, showing me that the project really does have value to the community.

I’m not looking to make serious cash here. If I were able to get just $20-$40/month in donations, that would about pay my expenses (after taxes, cause I’d declare if i were getting that much), I’d be overjoyed. Even 5 monthly sustainers at just $1 would really mean a lot to me, as a demonstration of support.

You can donate one-time or monthly on my Github Sponsors page. The suggested levels are $1 and $5.

(If you don’t want to donate or can’t spare the cash, but do want to send me an email telling me about your use of rubyland news, I would love that too! I really don’t get much feedback! jonathan at rubyland.news)

Thanks

Thanks to anyone who donates anything at all
also to anyone who sends me a note to tell me that they value Rubyland News (seriously, I get virtually no feedback — telling me things you’d like to be better/different is seriously appreciated too! Or things you like about how it is now. I do this to serve the community, and appreciate feedback and suggestions!)
To anyone who reads Rubyland News at all
To anyone who blogs about ruby, especially if you have an RSS feed, especially if you are doing it as a hobbyist/community-member for purposes other than business leads!
To my current single monthly github sponsor, for $1, who shall remain unnamed because they listed their sponsorship as private
To anyone contributing in their own way to any part of open source communities for reasons other than profit, sometimes without much recognition, to help create free culture that isn’t just about exploiting each other!

http://bibwild.wordpress.com/?p=9795

Extensions

vite-ruby for JS/CSS asset management in Rails

jrochkind Nov 29, 2022

I recently switched to vite and vite-ruby for managing my JS and CSS assets in Rails. I was switching from a combination of Webpacker and sprockets — I moved all of my Webpacker and most of my sprockets to vite. I am finding it generally pretty agreeble, so I thought I’d write up some of … Continue reading vite-ruby for JS/CSS asset management in Rails →

Show full content

Note that vite-ruby has smooth ready-made integrations for Padrino, Hanami, and jekyll too, and possibly hook points for integrations with arbitrary ruby, plus could always just use vite without vite-ruby — but I’m using vite-ruby with Rails.

I am finding it generally pretty agreeble, so I thought I’d write up some of the things I like about it for others. And a few other notes.

I am definitely definitely not an expert in Javascript build systems (or JS generally), which both defines me as an audience for build tools, but also means I don’t always know how these things might compare with other options. The main other option I was considering was jsbundling-rails with esbuild and cssbundling-rails with SASS, but I didn’t get very far into the weeds of checking those out.

I moved almost all my JS and (S)CSS into being managed/built by vite.

My context

I work on a monolith “full stack” Rails application, with a small two-developer team.

I do not do any very fancy Javascript — this is not React or Vue or anything like that. It’s honestly pretty much “JQuery-style” (although increasingly I try to do it without jquery itself using just native browser API, it’s still pretty much that style).

Nonetheless, I have accumulated non-trivial Javascript/NPM dependencies, including things like video.js , @shoppify/draggable, fontawesome (v4), openseadragon. I need package management and I need building.

I also need something dirt simple. I don’t really know what I’m doing with JS, my stack may seem really old-fashioned, but here it is. Webpacker had always been a pain, I started using it to have something to manage and build NPM packages, but was still mid-stream in trying to switch all my sprockets JS over to webpacker when it was announced webpacker was no longer recommended/maintained by Rails. My CSS was still in sprockets all along.

Vite

One thing to know about vite is that it’s based on the idea of using different methods in dev vs production to build/serve your JS (and other managed assets). In “dev”, you ordinarily run a “vite server” which serves individual JS files, whereas for production you “build” more combined files.

Vite is basically an integration that puts together tools like esbuild and (in production) rollup, as well as integrating optional components like sass — making them all just work. It intends to be simple and provide a really good developer experience where doing simple best practice things is simple and needs little configuration.

vite-ruby tries to make that “just works” developer experience as good as Rubyists expect when used with ruby too — it intends to integrate with Rails as well as webpacker did, just doing the right thing for Rails.

Things I am enjoying with vite-ruby and Rails

You don’t need to run a dev server (like you do with jsbundling-rails and css-bundling rails)
- If you don’t run the vite dev server, you’ll wind up with auto-built vite on-demand as needed, same as webpacker basically did.
- This can be slow, but it works and is awesome for things like CI without having to configure or set up anything. If there have been no changes to your source, it is not slow, as it doesn’t need to re-build.
- If you do want to run the dev server for much faster build times, hot module reload, better error messages, etc, vite-ruby makes it easy, just run ./bin/vite dev in a terminal.
If you DO run the dev server — you have only ONE dev-server to run, that will handle both JS and CSS
- I’m honestly really trying to avoid the foreman approach taken by jsbundling-rails/cssbundling-rails, because of how it makes accessing the interactive debugger at a breakpoint much more complicated. Maybe with only one dev server (that is optional), I can handle running it manually without a procfile.

Handling SASS and other CSS with the same tool as JS is pretty great generally — you can even @import CSS from a javascript file, and also @import plain CSS too to aggregate into a single file server-side (without sass). With no non-default configuration, it just works, and will spit out stylesheet <link> tags, and it means your css/sass is going through the same processing whether you import it from .js or .css.
- I handle fontawesome 4 this way. Include "font-awesome": "^4.7.0" in my package.json, then @import "font-awesome/css/font-awesome.css"; just works, and from either a .js or a .css file. It actually spits out not only the fontawesome CSS file, but also all the font files referenced from it and included in the npm package, in a way that just works. Amazing!!
- Note how you can reference things from NPM packages with just package name. On google for some tools you find people doing contortions involving specifically referencing node-modules, I’m not sure if you really have to do this with latest versions of other tools but you def don’t with vite, it just works.

in general, I really appreciate vite’s clear opinionated guidance and focus on developer experience. Understanding all the options from the docs is not as hard because there are fewer options, but it does everything I need it to. vite-ruby succesfully carries this into ruby/Rails, it’s documentation is really good, without being enormous. In Rails, it just does what you want, automatically.

Vite supports source maps for SASS!
- Not currently on by default, you have to add a simple config.
- Unfortunately sass sourcemaps are NOT supported in production build mode, only in dev server mode. (I think I found a ticket for this, but can’t find it now)
- But that’s still better than the official Rails options? I don’t understand how anyone develops SCSS without sourcemaps!
  - But even though sprockets 4.x finally supported JS sourcemaps, it does not work for SCSS! Even though there is an 18-month-old PR to fix it, it goes unreviewed by Rails core and unmerged.
  - Possibly even more suprisingly, SASS sourcemaps doesn’t seem to work for the newer cssbundling-rails=>sass solution either. https://github.com/rails/cssbundling-rails/issues/68
  - Previous to this switch, I was still using sprockets old-style “comments injected into CSS built files with original source file/line number” — that worked. But to give that up, and not get working scss sourcemaps in return? I think that would have been a blocker for me against cssbundling-rails/sass anyway… I feel like there’s something I’m missing, because I don’t understand how anyone is developing sass that way.

If you want to split up your js into several built files (“chunks), I love how easy it is. It just works. Vite/rollup will do it for you automatically for any dynamic runtime imports, which it also supports, just write import with parens, inside a callback or whatever, just works.

Things to be aware of

vite and vite-ruby by default will not create .gz variants of built JS and CSS
- Depending on your deploy environment, this may not matter, maybe you have a CDN or nginx that will automatically create a gzip and cache it.
- But in eg default heroku Rails deploy, it really really does. Default Heroku deploy uses the Rails app itself to deliver your assets. The Rails app will deliver content-encoding gzip if it’s there. If it’s not… when you switch to vite from webpacker/sprockets, you may now delivering uncommpressed JS and CSS with no other changes to your environment, with non-trivial performance implications but ones you may not notice.
- Yeah, you could probably configure your CDN you hopefully have in front of your heroku app static assets to gzip for you, but you may not have noticed.
- Fortunately it’s pretty easy to configure
  - For me, I do some kind of ugly JS to configure it only when I’m not using dev-mode autoBuild (in dev but without running a vite dev server), becuase it really slows down autoBuild
  - Since I migrated over, the vite-pllugin-rails plugin also does it by default. (I’m not using that, actually)

There are some vite NPM packages involved (vite itself as well as some vite-ruby plugins), as well as the vite-ruby gem, and you have to keep them up to date in sync. You don’t want to be using a new version of vite NPM packages with too-old gem, or vice versa. (This is kind of a challenge in general with ruby gems with accompanying npm packages)
- But vite_ruby actually includes a utility to check this on boot and complain if they’ve gotten out of sync! As well as tools for syncing them! Sweet!
- But that can be a bit confusing sometimes if you’re running CI after an accidentally-out-of-sync upgrade, and all your tests are now failing with the failed sync check. But no big deal.

Things I like less

vite-ruby itself doesn’t seem to have a CHANGELOG or release notes, which I don’t love.
Vite is a newer tool written for modern JS, it mostly does not support CommonJS/node require, preferring modern import. In some cases that I can’t totally explain require in dependencies seems to work anyway… but something related to this stuff made it apparently impossible for me to import an old not-very-maintained dependency I had been importing fine in Webpacker. (I don’t know how it would have done with jsbundling-rails/esbuild). So all is not roses.

Am I worried that this is a third-party integration not blessed by Rails?

The vite-ruby maintainer ElMassimo is doing an amazing job. It is currently very well-maintained software, with frequent releases, quick turnaround from bug report to release, and ElMassimo is very repsonsive in github discussions.

But it looks like it is just one person maintaining. We know how open source goes. Am I worried that in the future some release of Rails might break vite-ruby in some way, and there won’t be a maintainer to fix it?

I mean… a bit? But let’s face it… Rails officially blessed solutions haven’t seemed very well-maintained for years now either! The three year gap of abandonware between the first sprockets 4.x beta and final release, followed by more radio silence? The fact that for a couple years before webpacker was officially retired it seemed to be getting no maintainance, including requiring dependency versions with CVE’s that just stayed that way? Not much documentation (ie Rails Guide) support for webpacker ever, or jsbundling-rails still?

One would think it might be a new leaf with css/jsbundling-rails… but I am still baffled by there being no support for sass sourcemaps in cssbundling-rails and sass! Official rails support doesn’t necessarily get you much “just works” DX when it comes to asset handling for years now.

Let’s face it, this has been an area where being in the Rails github org and/or being blessed by Rails docs has been no particular reason to expect maintenance or expect you won’t have problems down the line anyway. it’s open source, nobody owes you anything, maintainers spend time on what they have interest to spend time on (including time to review/merge/maintain other’s PR’s — which is def non-trivial time!) — it just is what it is.

While the vite-ruby code provides a pretty great integrated into Rails DX, its also actually mostly pretty simple code, especially when it comes to the Rails touch points most at risk of Rails breaking — it’s not doing anything too convoluted.

So, you know, you take your chances, I feel good about my chances compared to a css/jsbundling-rails solution. And if someday I have to switch things over again, oh well — Rails just pulled webpacker out from under us quicker than expected too, so you take your chances regardless!

(thanks to colleague Anna Headley for first suggesting we take a look at vite in Rails!)

http://bibwild.wordpress.com/?p=9845

Extensions

Using engine_cart with Rails 6.1 and Ruby 3.1

jrochkind Jun 13, 2022

Rails does not seem to generally advertise ruby version compatibility, but it seems to be the case taht Rails 6.1, I believe, works with Ruby 3.1 — as long as you manually add three dependencies to your Gemfile. (Here’s a somewhat cryptic gist from one (I think) Rails committer with some background. Although it doens’t … Continue reading Using engine_cart with Rails 6.1 and Ruby 3.1 →

Show full content

gem "net-imap"
gem "net-pop"
gem "net-smtp"

(Here’s a somewhat cryptic gist from one (I think) Rails committer with some background. Although it doens’t specifically and clearly tell you to add these dependencies for Rails 6.1 and ruby 3.1… it won’t work unless you do. You can find other discussion of this on the net.)

Or you can instead add one line to your Gemfile, opting in to using the pre-release mail gem 2.8.0.rc1, which includes these dependencies for ruby 3.1 compatibility. Mail is already a Rails dependency; but pre-release gems (whose version numbers end in something including letters after a third period) won’t be included by bundler unless you mention a pre-release version (whose version number ends in…) explicitly in Gemfile.

gem "mail", ">= 2.8.0.rc1"

Once mail 2.8.0 final is released, if I understand what’s going on right, you won’t need to do any of this, since it won’t be a pre-release version bundler will just use it when bundle updateing a Rails app, and it expresses the dependencies you need for ruby 3.1, and Rails 6.1 will Just Work with ruby 3.1. Phew! I hope it gets released soon (been about 7 weeks since 2.8.0.rc1).

Engine cart

Engine_cart is a gem for dynamically creating Rails apps at runtime for use in CI build systems, mainly to test Rails engine gems. It’s in use in some collaborative open source communities I participate in. While it has plusses (actually integration testing real app generation) and minuses (kind of a maintenance nightmare it turns out), I don’t generally recommend it, if you haven’t heard of it before and am wondering “Does jrochkind think I should use this for testing engine gems in general?” — this is not an endorsement. In general it can add a lot of pain.

But it’s in use in some projects I sometimes help maintain.

How do you get a build using engine_cart to succesfully test under Rails 6.1 and ruby 3.1? Since if it were “manual” you’d have to add a line to a Gemfile…

It turns out you can create a ./spec/test_app_templates/Gemfile.extra file, with the necessary extra gem calls:

gem "net-imap"
gem "net-pop"
gem "net-smtp"

# OR, above OR below, don't need both

gem "mail", ">= 2.8.0.rc1"

I think ./spec/test_app_templates/Gemfile.extra is a “magic path” used by engine_cart… or if the app I’m working on is setting it, I can’t figure out why/how! But I also can’t quite figure out why/if engine_cart is defaulting to it…
Adding this to your main project Gemfile is not sufficient, it needs to be in Gemfile.extra
Some projects I’ve seen have a line in their Gemfile using eval_gemfile and referencing the Gemfile.extra… which I don’t really understand… and does not seem to be necessary to me… I think maybe it’s leftover from past versions of engine_cart best practices?
To be honest, I don’t really understand how/where the Gemfile.extra is coming in, and I haven’t found any documentation for it in engine_cart . So if this doens’t work for you… you probably just haven’t properly configured engine_cart to use the Gemfile.extra in that location, which the project I’m working on has done in some way?

Note that you may still get an error produced in build output at some point of generating the test app:

run  bundle binstubs bundler
rails  webpacker:install
You don't have net-smtp installed in your application. Please add it to your Gemfile and run bundle install
rails aborted!
LoadError: cannot load such file -- net/smtp

But it seems to continue and work anyway!

None of this should be necessary when mail 2.8.0 final is released, it should just work!

The above is of course always including those extra dependencies, for all builds in your matrix, when they are only necessary for Rails 6.1 (not 7!) and ruby 3.1. If you’d instead like to guard it to only apply for that build, and your app is using the RAILS_VERSION env variable convention, this seems to work:

# ./specs/test_app_templates/Gemfile.extra
#
# Only necessary until mail 2.8.0 is released, allow us to build with engine_cart
# under Rails 6.1 and ruby 3.1, by opting into using pre-release version of mail
# 2.8.0.rc1
#
# https://github.com/mikel/mail/pull/1472

if ENV['RAILS_VERSION'] && ENV['RAILS_VERSION'] =~ /^6\.1\./ && RUBY_VERSION =~ /^3\.1\./
  gem "mail", ">= 2.8.0.rc1"
end

http://bibwild.wordpress.com/?p=9703

Extensions