GeistHaus
log in · sign up

Parse huge XML files quick with Rust + Serde + quick-xml

capnfabs.net

Recently, I wondered whether songs were being increasingly released with entirely lowercase titles. I ended up using MusicBrainz’ library as a datasource, which comes packaged up as a Postgres database, but the data source I considered initially was one of Discogs’ monthly data dumps, which is made available for download as a set of (gzipped) XML files. The file I was interested in – the releases dataset – is 11.62 GB gzipped, 74 GB once decompressed. I wanted to iterate through every record and check for (a) entry quality, and (b) whether the track title was lowercase. Normally I’d use some kind of serialization framework, gesture at the shape of data, and then tell the framework to deserialize the whole object, but - that’s not an option when the file you’re working with is several times the RAM on your computer.

0 pages link to this URL

No pages have linked to this URL yet.