GeistHaus
log in · sign up

nvie.com

Part of feedburner.com

stories
15+ years later, Microsoft morged my diagram
How Microsoft continvoucly morged my Git branching diagram.
Show full content

A few days ago, people started tagging me on Bluesky and Hacker News about a diagram on Microsoft's Learn portal. It looked... familiar.

In 2010, I wrote A successful Git branching model and created a diagram to go with it. I designed that diagram in Apple Keynote, at the time obsessing over the colors, the curves, and the layout until it clearly communicated how branches relate to each other over time. I also published the source file so others could build on it. That diagram has since spread everywhere: in books, talks, blog posts, team wikis, and YouTube videos. I never minded. That was the whole point: sharing knowledge and letting the internet take it by storm!

What I did not expect was for Microsoft, a trillion-dollar company, some 15+ years later, to apparently run it through an AI image generator and publish the result on their official Learn portal, without any credit or link back to the original.

Close-up of the "continvoucly morged" text

The AI rip-off was not just ugly. It was careless, blatantly amateuristic, and lacking any ambition, to put it gently. Microsoft unworthy. The carefully crafted visual language and layout of the original, the branch colors, the lane design, the dot and bubble alignment that made the original so readable—all of it had been muddled into a laughable form. Proper AI slop.

Arrows missing and pointing in the wrong direction, and the obvious "continvoucly morged" text quickly gave it away as a cheap AI artifact.

It had the rough shape of my diagram though. Enough actually so that people recognized the original in it and started calling Microsoft out on it and reaching out to me. That so many people were upset about this was really nice, honestly. That, and "continvoucly morged" was a very fun meme—thank you, internet! 😄

Oh god yes, Microsoft continvoucly morged my diagram there for sure 😬

Vincent Driessen (@nvie.com) 2026-02-16T20:55:54.762Z

Other than that, I find this whole thing mostly very saddening. Not because some company used my diagram. As I said, it's been everywhere for 15 years and I've always been fine with that. What's dispiriting is the (lack of) process and care: take someone's carefully crafted work, run it through a machine to wash off the fingerprints, and ship it as your own. This isn't a case of being inspired by something and building on it. It's the opposite of that. It's taking something that worked and making it worse. Is there even a goal here beyond "generating content"?

What's slightly worrying me is that this time around, the diagram was both well-known enough and obviously AI-slop-y enough that it was easy to spot as plagiarism. But we all know there will just be more and more content like this that isn't so well-known or soon will get mutated or disguised in more advanced ways that this plagiarism no longer will be recognizable as such.

I don't need much here. A simple link back and attribution to the original article would be a good start. I would also be interested in understanding how this Learn page at Microsoft came to be, what the goals were here, and what the process has been that led to the creation of this ugly asset, and how there seemingly has not been any form of proof-reading for a document used as a learning resource by many developers.

Till next 'tim'.

https://nvie.com/posts/15-years-later/
Why .every() on an empty list is true
We can all pretty intuitively understand how `.some()` and `.every()` predicate expressions on lists work.
Show full content

We can all pretty intuitively understand how .some() and .every() predicate expressions on lists work.

Let's define a list of Simpsons:

const simpsons = [
  { name: 'Homer', age: 39 },
  { name: 'Marge', age: 37 },
  { name: 'Bart', age: 13 },
  { name: 'Lisa', age: 11 },
  { name: 'Maggie', age: 4 },
];

And a predicate to tell if someone is an adult:

function isAdult(simpson) {
  return simpson.age >= 18;
}

Then, these statements are pretty obvious:

simpsons.some(isAdult);   // => true, because Homer and Marge are adults
simpsons.every(isAdult);  // => false, because not everyone is an adult

But it's not immediately obvious why these are the defaults for an empty list:

[].some(isAdult);   // => false
[].every(isAdult);  // => true

Every person in the empty list is an adult? That sounds weird when you say it.

It doesn't matter what predicate you pass in here though, it will not affect the result. In other words, every person in the empty list is also NOT an adult 😉

One intuition that can help with this, is to think of .some() and .every() as chains of logical "or" and "and" expressions:

// .some() is a chain of "or"s
isAdult(homer) || isAdult(marge) || isAdult(bart) || ...

// .every() is a chain of "and"s
isAdult(homer) && isAdult(marge) && isAdult(bart) && ...

You can tack a constant to the beginning of these chains in a way that keeps them logically equivalent:

false || isAdult(homer) || isAdult(marge) || isAdult(bart) || ...
true  && isAdult(homer) && isAdult(marge) && isAdult(bart) && ...

And that's why those are the defaults for the empty list!

https://nvie.com/posts/why-every-on-an-empty-list-is-true/
Git power tools for daily use
Every developer has their own favorite Git tricks they use daily. Here are some of my favorite ones I have been using for as long as I can remember.
Show full content

Every developer has their own favorite Git tricks they use daily. Here are some of my favorite ones I have been using for as long as I can remember.

First of all, I should mention that most of these commands are bundled in my git-toolbelt project. If you like to use them, all you need to do is install it like so:

$ brew install nvie/tap/git-toolbelt
Quickly opening modified files

While working on a branch, I often find the need to re-open the files I was working on. The Git toolbelt project contains a command to show you all locally modified files. It will only report files that still exist locally, so this overview won't include deleted files.

$ git modified
controllers/foo.py
README.md

This is super useful to quickly open all locally modified files in your editor. Definitely one of my most-used commands throughout the day:

$ vim $(git modified)

After quitting your editor, you can easily re-open the files you're working on this way.

To also include any files modified in the index (files that are git add'ed already), use the -i flag:

$ git modified -i

You can also pass it a commit SHA, which will open all files modified in that commit:

$ git modified HEAD~1

I have the following aliases set up in my shell, for quickly opening a specific set of files:

  • vc: vim locally modified files (not indexed)
  • vca: vim all locally modified files (including the ones indexed)
  • vch: vim files modified in the last commit (HEAD)
  • vc HEAD~1: vim all files modified in the second-last commit
Fixing up the last commit

You're probably familiar with git commit --amend to incorporate the currently-staged changes into the lastcommit, effectively rewriting the last commit. The toolbelt offers a similar command called git fixup, which will do the same, but without prompting for the commit message. So it's like a quicker version of commit --amend.

$ git fixup

This is a great way to build up a commit incrementally. A very typical flow for me looks like:

$ git add -p    # Pick bits to commit
$ git commit
$ git add -p    # Pick more bits
$ git fixup     # Add those to the last commit
"Emptying" the last commit

Sometimes I make a mistake and I accidentally commit too much, or something I didn't intend to commit. For example, an extra file I accidentally added, or a patch within a file that I didn't want to include. Here's how I fix that:

$ git delouse

This "empties" the last commit. Think of "emptying" as keeping the commit message and the author/date info, but "moving" all of its changes back into the work tree.

Technically:

  • Soft-resets the last commit, which means it will remove the last commit from the branch, and put back the contents of that commit into the work tree (basically reverting to the state right before committing). File contents aren't changed by this, only the Git commit disappears.
  • Add an empty commit with the same commit message and author details as the commit that was just removed.

The net result of these actions are that it appears as if the last commit on the branch is "emptied" back into your work tree. This command is non-destructive, since all of your files remain untouched. They are now just local changes again.

This allows re-adding all changes again. Just use git add -p to select the bits to commit, and then git fixup (see previous section) to keep changing the last commit, effectively rebuilding it up from scratch.

Because git delouse kept the commit message and author information around in that empty commit, the original commit info is never lost, and you don't have to re-enter the commit message whenever you run git fixup, which makes this whole process super cheap.

Typical flow:

$ git commit -m 'Add login screen'

# Oops! Checked in a secret key with that... let's fix this mistake!
$ git delouse

# Retry adding stuff
$ git add -p   # This time, don't add the secret key
$ git fixup    # Rewrites the previous commit

And if you make a mistake, you can just run git delouse again and start over, as often as you want. Since none of these commands destroy your local changes, this allows you to carefully craft your commit contents without the risk of losing any data.

Splitting up a commit into pieces

It's also a great way to split up a commit! For example, suppose you are adding a bugfix but you also renamed a variable to have clearer meaning. When submitting the code for review, you realize that the variable rename adds a lot of noise to the actual change. You may then decide it's a good idea to split up this commit into two pieces: one that atomically just changes the variable name everywhere, and one that fixes the bug. You can then point to the bugfix commit when asking for a code review.

How would that work?

$ git commit -m 'Bugfix for login screen'

# Oops, I should've split this one up. Let's start over!
$ git delouse
$ git add -p     # Just pick the bugfix bits
$ git fixup
$ git add -p     # Now pick the var rename bits
$ git commit -m 'Rename variable name to be clearer'

These three commands have become indispensable in my day-to-day Git routine. If you like it, let me know!

https://nvie.com/posts/git-power-tools/
An intro to decoders
Today, I’m thrilled to publicly announce a new open source project that we’ve been using at SimpleContacts in production for months: [**decoders**](https://www.npmjs.com/package/decoders).
Show full content

Today, I’m thrilled to publicly announce a new open source project that we’ve been using at SimpleContacts in production for months: decoders.

To get started:

$ npm install decoders

Here’s a quick example of what decoders can do:

import { guard, number, object, optional, string } from 'decoders';

// Define your decoder
const decoder = object({
  name: string,
  age: optional(number),
});

// Build the runtime checker ("guard") once
const verify = guard(decoder);

// Use it
const unsafeData = JSON.parse(request.body);
//                            ^^^^^^^^^^^^ Could be anything!
const data = verify(unsafeData);
//    ^^^^ Guaranteed to be a person!

// Now, Flow/TypeScript will _know_ the type of data!
data.name;   // string
data.age;    // number | void
data.city;
//   ^^^^ 🎉 Type error! Property `city` is missing in `data`.
Why?

When writing JavaScript programs (whether that’s for the server, or the browser), one tool that has become indispensable for maintainable code bases is a type checker like Flow or TypeScript. Disclaimer: I’m mainly a Flow user, but everything in this post also applies to TypeScript (= also great). Using a static type checker makes making changes to large JS possible in ways that weren’t possible before.

One area where Flow (or TypeScript) coverage is typically hard to achieve is when dealing with external data. Think any form of user input, an HTTP request body, or even the results of a database query are “external” from your app’s perspective. How can we type those things?

For example, say your app wants to do something with data coming in from a POST request with some JSON body:

const data = JSON.parse(request.body);

The type of data here will be “any”. The reason is of course that we’re dealing with a static type checker. So even though Flow will know that the input to JSON.parse() must be a string, it doesn’t know which string and the type of JSON.parse()’s return value will be defined by the value of the string at runtime. In other words, it could be anything.

For example:

typeof JSON.parse('42');              // number
typeof JSON.parse('"hello"');         // string
typeof JSON.parse('{"name": "Joe"}'); // object

Statically, it’s impossible to know the return type. That’s why Flow can only define this type signature as:

JSON.parse :: (value: string) => any;

Worse, even, is that using these any-typed values may implicitly (unknowingly) turn off type checking, even for code that’s type-safe otherwise.

For example, if you could feed an implicitly-any value to a type-safe function like:

function greet(name: string): string {
  return 'Hi, ' + name + '!';
}

const data = JSON.parse(request.body);
greet(data.name);

Then Flow will just accept this, because data is any, and thus data.name is any. But of course this isn’t safe! In this example, data cannot and should not be trusted. Flow lets arbitrary values get passed into greet() anyway, despite its type annotation!

Especially in real applications this puts a significant practical cap on Flow’s usefulness. Using any (whether implicit or explicit) is completely unsafe, and should be avoided like the plague.

Decoders to the Rescue

How, then, can we statically type these seemingly dynamic beasts? We can do so if we change our perspective on the problem a little bit.

Rather than trying to let Flow infer the type of a dynamic expression (which is impossible), what if we would have a way to instead specify the type we are expecting, and have an automatic type-check injected at runtime that will verify those assumptions? This way, Flow is able to know, statically, what the runtime type will be.

As you might have guessed, this is exactly what the decoders library offers.

You can use decoders’ library of composable building blocks that allow you to specify the shape of your expected output value:

import type { Decoder } from 'decoders';
import { guard, number, object, string } from 'decoders';

type Point = {
    x: number,
    y: number,
};

const pointDecoder = object({
    x: number,
    y: number,
});
const asPoint = guard(pointDecoder);

const p1: Point = asPoint({ x: 42, y: 123 });
const p2: Point = asPoint({ x: -3, y: 0, z: 1 });

There are a few interesting pieces to this example.

First of all, you’ll notice the similarity between the Point type, and the structure of the decoder.

Also note that, by wrapping any value in an asPoint() call, Flow will know—statically—that p1 and p2 will be Point instances. And therefore you get full type support in your editor like tab completion, and full Flow type safety like you’re used to elsewhere.

How? Because if the data does not match the decoder’s description of the data, the call to verify() will throw a runtime error. This will be the case in the unhappy path, for example:

const p3: Point = asPoint({ x: 42 });
//                ^^^^^^^^^^^^^^^^^^ Runtime error: Missing "y" key
const p4: Point = asPoint(123);
//                ^^^^^^^^^^^^ Runtime error: Must be object
Composition

Decoders come with batteries included and these base decoders are designed to be infinitely composable building blocks, which you can assemble into complex custom decoders.

The simplest decoder you can create are the scalar types: number, boolean, and string. From there, you can compose them into higher order decoders like object(), array(), optional(), or nullable() to create more complex types.

For example, starting with a decoder for Points:

const point = object({
  x: number,
  y: number,
});

In terms of types:

point            // Decoder<Point>
array(point)     // Decoder<Array<Point>>
optional(point)  // Decoder<Point | void>
nullable(point)  // Decoder<Point | null>

Decoders also comes with a special regex() decoder which is like the string decoder, but will additionally perform a regex match and only allows string values that match:

const hexcolor = regex(
    /^#[0-9a-f]{6}$/,
    'Must be hex color',  // Shown in error output
);

You can then reuse these new decoders above by composing them into a polygon decoder. Notice the reuse of the hexcolor and the point decoders here.

const polygon = object({
  fill: hexcolor,
  stroke: optional(hexcolor),
  points: array(point),
});

You can then reuse that complex definition in a list:

const polygons = array(polygon);

You get the point. The final output type decoders of this type produce will be:

Array<{|
    fill: string,
    stroke: string | void,
    points: Array<{|
        x: number,
        y: number,
    |}>,
|}>;

Notice how the fill and stroke fields here end up as normal strings. Statically, Flow only knows that they are going to be string values, but at runtime, they will only contain hex color values that matched the regex. (Decoders are therefore more expressive than the type system in describing what values are allowed.)

Note: It is not recommended to go fully overboard with this feature. Decoders are best kept simple and straightforward, staying close to the values they express, and not perform too much "magic" at runtime.

The best way to discover which ones are available, is to look at its reference docs.

Error messages

Human readable and helpful error messages are considered important. That’s why decoders will always try to emit very readable error messages at runtime, inlined right into the actual data. An example of such a message would be:

Decode error:
{
  "firstName": "Vincent",
  "age": "37",
         ^^^^
         Either:
         - Must be undefined
         - Must be number
}
^ Missing key "name"

This is a complex error message, but optimized to be very readable to the human eye when outputted to a console.

The same error information can also be presented as a list of error messages for outputting in API responses. In this case, the input data isn't echoed back as part of the error message:

[
  'Value at key "age": Either:\n- Must be undefined\n- Must be number',
  'Missing key: "name"'
]

(For those interested, this inline object annotation is performed by debrief.js.)

Decoders vs Guards?

When you have composed your decoder, it’s often useful to turn the outmost decoder into a “guard”. A “Guard” offers a slightly more convenient API, but is very much like a decoder. It’s also callable on unverified inputs, but it will throw a runtime error if validation fails. They are therefore typically easier to work with: using the guard, you can focus on the happy path and handle any validation errors in normal exception handling mechanism.

Invoking the decoder directly on an input value will not throw a runtime error and instead return a so called “decode result”. A “decode result” is a value that represents either an OK value or an Error, both of which you'll need to “unpack” to do anything useful with it.

For example, given this decoder definition:

const decoder = object({
  name: string,
  age: optional(number),
});
const verify = guard(decoder);

decoder('invalid data');  // Won't throw
verify('invalid data');   // Throws

If you want to programmatically handle the decode result, you can use a decoder directly and inspect the decode result. If you're just interested in the data and not in handling any decoding errors, use a guard.

In terms of types:

type Decoder<T> = any => DecodeResult<T>;
type Guard<T> = any => T;

// The guard() helper builds a guard for a decoder of the same type
guard: <T>(Decoder<T>) => Guard<T>;

(For those interested, the DecodeResult type is powered by lemonsResult type.)

Give it a whirl!

Please try it out and let me know about your experiences.

https://nvie.com/posts/introducing-decoders/
pip-tools 1.0 released
Yesterday, pip-tools version 1.0 was silently released, officially introducing the **pip-compile** and **pip-sync** tools, and replacing the current **pip-dump** and **pip-review** tools.
Show full content

Yesterday, pip-tools version 1.0 was silently released, officially introducing the pip-compile and pip-sync tools, and replacing the current pip-dump and pip-review tools.

I've blogged before about these ideas in Pinning Your Packages and Better Package Management. During the last year, I've been slowly working on the future branch on the pip-tools repo, and have been using the new tools there. The pip-sync script was the only thing that was still delaying the release, but since Hugo Peixoto contributed this one recently, it's now ready to switch over.

So it's now time to switch over to the new tools if you've been using the old ones.

Old: pip-review, pip-dump
New: pip-compile, pip-sync

How to upgrade

If you're using pip-tools 0.x, you'll notice that its main commands, pip-review and pip-dump are gone. Instead, you'll find two new commands, pip-compile and pip-sync, which should allow you to do the same things, but arguably in a more solid way.

Typical usage:

  • pip install pip-tools
  • Record your top-level dependencies in requirements.in. Everything you directly use in your source code should be a top-level dependency.
  • Don't pin them—unless you want them pinned, of course.
  • Put both requirements.in and requirements.txt under version control.
  • Then, run pip-compile. This will produce a requirements.txt that pins the high-level requirements to the highest versions found on PyPI to match the given requirements.
  • Using pip-sync now will install/upgrade/uninstall everything so that your virtual env exactly matches what's in requirements.txt.

For more information, see the README of the new tools.

Let me know how it works for you!

https://nvie.com/posts/pip-tools-10-released/
Beautiful code
When was the last time you looked at code someone else wrote? (Debugging doens't count!) When did you _actually_ study it to learn from it? Perhaps even ponder over its beauty?
Show full content

When was the last time you looked at code someone else wrote? (Debugging doens't count!) When did you actually study it to learn from it? Perhaps even ponder over its beauty?

It's surprising how uncommon it is in our industry to look at existing code just to learn from it. In almost any other engineering or art field, people constantly study the results of their peers. Books on architecture are a great example. What makes a certain design so beautiful or effective? Can I learn something from it to make me a better engineer? I feel we would benefit as an industry if we would collectively take a little more time to reflect and study. We should ask ourselves those question more often, and allocate study time for it occasionally.

The cover of Beautiful Code

Last week, I ordered the book Beautiful Code, by Greg Wilson and Andy Oram. I would recommend this book to any fellow professional programmer. (All of the book's proceeds go to Amnesty International.)

The book's concept is simple: each of the 33 chapters is written by a well-respected professional programmer who answers the question: "What is the most beautiful code you've ever seen?" after which they discuss elaborately why they think it's beautiful.

Beauty, as it turns out, comes in many shapes and forms. The topics in the book vary from an elegant algorithm, to a design that lets all the surrounding puzzle pieces fall into place perfectly. From the cleverness of code generators, to the expressiveness of a language construct.

Naturally, some chapters have been more interesting than others. But all of them have left me with either:

  • a deep admiration for an elegant solution or cleverness;
  • an interesting perspective on good design;
  • an appreciation of seemingly banal things, or things I previously did not find beauty in;
  • insight into how to articulate why exactly something is beautiful (how meta)
https://nvie.com/posts/beautiful-code/
Beautiful map
I was reading Igor Kalnitsky's blog post on why Python's [`map()` is mad](https://kalnytskyi.com/2015/06/14/mad-map/), and wanted to provide a different perspective. In fact, I would call the design of Python's `map()` beautiful instead.
Show full content

I was reading Igor Kalnitsky's blog post on why Python's map() is mad, and wanted to provide a different perspective. In fact, I would call the design of Python's map() beautiful instead.

First off, what does map(f, xs) represent mathematically in the first place? It should invoke function f(x) for every x in xs. Functions, of course, can take many arguments—single argument functions are just the simplest case. So what would be reasonable to assume map(f, xs, ys) would do? In the blog post, Igor suggests the behaviour should be to chain xs and ys, but chances are they represent completely different things, so chaining them would lead to a heterogenous collection of items. Mathematically, you would expect the function calls made to be f(x1, y1), f(x2, y2), ...

Note that this is different from zip()'ing the function arguments. A function f with 2 arguments is different from a function f with 1 argument, expecting a tuple.

Compare:

def f(x, y):
    return x * y

map(f, ['a', 'b', 'c'], [1, 2, 3])    # ['a', 'bb', 'ccc']

to

def f(pair):
    x, y = pair
    return x * y

map(f, zip(['a', 'b', 'c'], [1, 2, 3]))  # ['a', 'bb', 'ccc']

The confusion around the items appearing to be zipped is caused by the implicit behaviour in Python 2 when the first argument is None. I think it's handled as a special case, which is unfortunate. A more consistent behaviour would have been to

Python 2.7.9 (default, Dec 19 2014, 06:00:59)
>>> map(lambda x: x, ['a'], [1])
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: <lambda>() takes exactly 1 argument (2 given)
>>> map(None, ['a'], [1])
[('a', 1)]

The TypeError would have been the sane thing to do, since the identity function should only ever take one argument.

Therefore, my advice would be to never use the implicit None as the first argument. It is broken under Python 3 anyway.

To zip() or to zip_longest()?

The fact that the behaviour changed in Python 3 is unfortunate, but I think it changed for the better. The problem with zip_longest()-like default semantics is that it will only ever work with finite iterables. If only one of the given iterables is infinite, the map will be infinite too. Now, perhaps this is what you want, but in that case you should probably be explicit about it anyway. I think using zip()-like semantics as the default makes perfect sense. It enables the following usage in Python 3:

>>> from itertools import count
>>> 
>>> def f(x, y):
...     return x * y
... 
>>> for x in map(f, ['a', 'b', 'c'], count(1)):
...     print(x)
...
a
bb
ccc

Compare this to Python 2's map behaviour, which would do:

>>> for x in map(f, ['a', 'b', 'c'], count(1)):
...     print(x)
...
a
bb
ccc
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "<stdin>", line 1, in f
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'

Because it tries to invoke f(None, 4) the fourth time, which happens to fail. If it would not fail, it would produce results infinitely.

But what if you actually want zip_longest()-like behaviour? Well, you can either make all arguments be infinite iterables, or you can explicitly wrap your arguments in a zip_longest() wrapper, and pass that to starmap(), which will take an iterable of tuples and spread it over the arguments to f(), just like map:

>>> from itertools import count, islice, starmap, zip_longest
>>>
>>> result = starmap(f, zip_longest(['a', 'b', 'c'], count(1), fillvalue='?'))
>>> for x in islice(result, 7):
...     print(x)
...
a
bb
ccc
????
?????
??????
???????

As a bonus, you can pass in a fillvalue this way, instead of being stuck with the assumption of None (which could happen to be a valid value within the iterable).

However, personally, in this case, I'd prefer the following, more readable version that avoids the zip_longest() and starmap() calls:

map(f, chain(['a', 'b', 'c'], repeat('?')), count(1))

Note how you can thus make the map result infinite by simply making all iterables infinite. Consuming iterables until the first one is exhausted (so zip()-like), thus, is the sanest default behaviour, and the most beautiful of the options.

Python 3 gets it right

I'm glad that Python 3 changed map() to be sane in every way:

  • It's made a lazy iterator, does not directly produce a list
  • It disallows the ambiguous None first argument
  • It consumes the iterables until the first one is exhausted.
Bonus: your own zip()

Did you know you could express zip() with map()? It's easy, now you know the exact semantics:

def zip(*iterables):
    return map(lambda *args: tuple(args), *iterables)
https://nvie.com/posts/beautiful-map/
Thinking in streams
In my previous post, [Use More Iterators][moar-iterators], I have outlined how to harvest some low hanging fruit when it comes to converting your functions to generators. In this series of posts I want to take it to the next level and introduce a few powerful constructs that can assist you when working with streams.
Show full content

In my previous post, Use More Iterators, I have outlined how to harvest some low hanging fruit when it comes to converting your functions to generators. In this series of posts I want to take it to the next level and introduce a few powerful constructs that can assist you when working with streams.

Streams

Previously, I've compared Python's generators to value factories (producing values lazily) and talked about their composability. I want to pay some more attention to these concepts in this blog post.

One particular concept that fits generators like a glove is to use them to stream data. Streams help you express solutions to many data manipulation flows and processes elegantly. Of course, this idea is not novel: the concept of streams finds its roots in the early 60's (as all good CS ideas do).

How do streams fit generators? Since a generator is a function that returns a "value factory", it's a natural component to act as a "node" in a network of such generators. Each such component takes input, does something with it, and emits output.

A Little Word Puzzle

Take a look at this example to generate a little word puzzle. It generates a list of the first 5 dictionary words of 20 characters or more that aren't names, and hides their vowels:

import re
from itertools import islice

vowels_re = re.compile('[aeiouy]')

def all_words():
    with open('/usr/share/dict/words') as f:
        for line in f:
            yield line.strip()

def keep_long_words(iterable, min_len):
    return (word for word in iterable if len(word) >= min_len)

def exclude_names(iterable):
    return (word for word in iterable if word.lower() == word)

def hide_vowels(iterable):
    for word in iterable:
        yield vowels_re.sub('.', word)

def limit(iterable, n):
    return islice(iterable, 0, n)

stream = all_words()
stream = keep_long_words(stream, 20)
stream = exclude_names(stream)
stream = hide_vowels(stream)
stream = limit(stream, 5)

for word in stream:
    print(word)

This will print the following list:

.bd.m.n.h.st.r.ct.m.
.c.t.lm.th.lc.rb.n.l
.c.t.lph.n.lh.dr.z.n.
.m.n..c.t.ph.n.t.d.n.
.n.rch..nd.v.d..l.st
Wrap(Wrap(Wrap(…)))

The variable stream is used to incrementally build up an entire stream (network of stream processors). It start with all_words(), the generator that emits all dictionary words from the dictionary file.

Then with each further step, stream is "wrapped" in another generator, which is used to chain the generators together. The emitted output of all_words() will now be consumed by the keep_long_words() generator, emitting only the words from the input stream that match the length criterium.

We keep "wrapping" these with another filter step (exclude_names()) and a manipulation step (hide_vowels()), and finally limit the list to return maximally 5 items.

As a result, the variable stream is re-assigned a few times. There is a nice advantage to this approach: it avoids using any further variables, and allows us to build up the entire stream line by line. The order in which we build it up, resembles the way the data flows.

Lastly, you can comment out a line in the middle and the code still works. If you decide you do want names in the result list, simply comment out this line:

stream = all_words()
stream = keep_long_words(stream, 20)
# stream = exclude_names(stream)  # comment this out to skip this step
stream = hide_vowels(stream)
stream = limit(stream, 5)

And the stream will still be valid. This is especially useful while trying out your streams as you're still developing them streams. Without using this intermediate variable, the equivalent would look like this:

stream = limit(hide_vowels(exclude_names(keep_long_words(all_words(), 20))), 5)
output = list(stream)

This does the same thing, but is rather messy, and gets unreadable quickly.

A DSL to Compose Streams

Putting together the pieces of the stream as we did above is relatively clunky. There is a better way. What if we could express the thing above like this?

stream = (all_words()
          >> keep_long_words(20)
          >> exclude_names()
          >> hide_vowels()
          >> limit(5))
output = list(stream)

This syntax would have the best of both worlds: it allows you to elegantly chain together generators using the >> operator without using an intermediate variable. If A and B are streams, then the result of A >> B is the composition of both streams, applying A first to its input, then applying B:

We can actually build this. Let's define a Task, a base class for each such component that's chainable using the >> operator. It looks like this:

class Task:
    def pipe(self, other_task):
        return ChainedTask(self, other_task)

    def process(self, inputs):
        raise NotImplementedError

    def __iter__(self):
        return self.process([])

    def __rshift__(self, other):
        return self.pipe(other)


class ChainedTask(Task):
    def __init__(self, task1, task2):
        self.task1 = task1
        self.task2 = task2

    def process(self, inputs):
        return self.task2.process(self.task1.process(inputs))

But how do we get from the generator functions of the example above to these tasks that support the >> operator? We need to convert them to tasks by implementing the process() method. Here's the exclude_names() function converted:

class exclude_names(Task):
    def process(self, inputs):
        return (word for word in iterable if word.lower() == word)

And here is an example of converting a function with arguments. The argument moves to the constructor of the class:

class keep_long_words(Task):
    def __init__(self, min_len):
        self.min_len = min_len

    def process(self, iterable):
        return (word for word in iterable if len(word) >= self.min_len)

Another notable case is the "starting" generator: the one spitting out the dictionary words. As a generator function, this did not take any inputs, but as a Task, it still receives the inputs argument, but should ignore it. This way, we can treat it like any other task, and we'll see an example of how this is useful later on:

class all_words(Task):
    def process(self, inputs):  # inputs arg is ignored
        with open('/usr/share/dict/words') as f:
            for line in f:
                yield line.strip()
Making Compositions

Using this, we can now start to make some abstractions. We can assign a series of chained tasks to a variable and insert where we need it. This can be used to clarify what is going on in those steps, or for reusability of a component.

def puzzlify():
    return (keep_long_words(20)
            >> exclude_names()
            >> hide_vowels())

stream = all_words() >> puzzlify() >> limit(5)
output = list(stream)

Note that calling puzzlify() will return a Task instance (the one that chains together the three sub tasks). Then, this task instance is further chained into the larger example. Also note that puzzlify() itself is a perpetual processor: there's no start or end defined by it. The context it's used in defined the start and the end.

The chaining primitive allows you to craft complex data streams in an elegant fashion.

Complex Operations

The composition operation (>>) isn't the only task common with streams. Another one is the split-and-join operation. In this scenario, you may want to perform multiple operations on a single stream independent of each other. This can be achieved using the & operator:

This will split the inputs and feed copies of the inputs to both processes. After applying each task, the results get merged back, exhausting A first, then B.

Suppose we would want to filter our dictionary for words that are anagrams or contain the substring anana (or both). You could do it as follows:

stream = (all_words()
          >> (is_anagram() & contains_anana()))
for word in stream:
    print(word)

The real power of this split-and-join operator comes when you combine it to perform different actions on each "side" of the split. For example, if you want to uppercase all of the anagrams, but lowercase all of the words containing the "anana" substring:

stream = (all_words()
          >> (is_anagram() >> uppercase()
              &
              contains_anana() >> lowercase()))
for word in stream:
    print(word)

And here are the results:

...
SOOLOOS
TEBBET
TEBET
TENET
TERRET
ULULU
YARAY
anana
ananaplas
ananaples
ananas
banana
bananaland
bananalander
...

This is all done in a single stream.

Why >> and &?

You may ask why I did not follow the more familiar syntax that most shells follow, using A | B. The reason is Python's operator precedence.

https://nvie.com/posts/thinking-in-streams/
Modifying Deeply-Nested Structures
Yesterday, a friend asked me how I would solve a certain problem he was facing. He did have a working solution, but felt like he could make it more generally applicable. Not shying away at a good challenge, I decided to take it and see how I would solve it. In this blog post you can read about my solution.
Show full content

Yesterday, a friend asked me how I would solve a certain problem he was facing. He did have a working solution, but felt like he could make it more generally applicable. Not shying away at a good challenge, I decided to take it and see how I would solve it. In this blog post you can read about my solution.

The Problem

Consider this JSON document. Return the same JSON document, but with the points list sorted by stop time in descending order.

{
  "timestamp": 1412282459,
  "res": [
    {
      "group": "1",
      "catlist": [
        {
          "cat": "1",
          "start": "none",
          "stop": "none",
          "points": [
              {"point": "1", "start": "13.00", "stop": "13.35"},
              {"point": "2", "start": "11.00", "stop": "14.35"}
          ]
        }
      ]
    }
  ]
}

Note: It's kept short for brevity: the actual document contained many more items in each of the nested lists (so multiple groups, categories, and points), but this snippet covers the overall structure.

Analysis

This is an incredibly common task I'm sure any programmer with a sufficiently long career has encountered in one shape or form.

The simple, ad hoc, solution to the problem above is:

def sort_points(d):
    for group in d['res']:
        for cat in group['catlist']:
            cat['points'] = sorted(cat['points'], reverse=True, key=lambda p: p['stop'])

Some downsides in arbitrary order:

  • Not reusable. This function can only deal with dicts of the exact same structure. It's required to know the key names and the type of data living at each nesting level. Any other dict will make it choke;
  • Brittle. The algorithm itself needs to be changed when the document gets nested in another document, or when its structure should change. Nest it in another dictionary, and you need an extra for loop in there. Pass it just a category, and you need to take out a for loop;
  • Lacks abstraction. The algorithm mixes traversing the dict with modifying it—these should be two separate things.
  • Changes the dict in-place. There's no need to rely on using mutable data structures here. It should work for dicts you cannot change, too.
Solution

How do we solve this more elegantly?

The data itself isn't interesting in the slightest. This is a problem of structure only. Let's try to break out the pieces that are specific for this problem and see if we can factor out a generic piece, taking the specific parts as arguments.

The three key tasks we need to perform:

  1. traverse the entire structure recursively;
  2. determine if we've arrived at a given designated path;
  3. change the nested structure at that location (using given function).

Note that these steps already show our function params. We need a way of specifying the "path" to drill down into, and specify what transformation to apply to those targeted elements.


Traversal

First and foremost, we need a way of traversing arbitrarily nested structures. Given that this is only JSON data (so limited to strings, numbers, lists and dicts), we can start with a recursive function that will walk the structure and produce a new output which will essentially be a deep copy of the input:

def traverse(obj):
    if isinstance(obj, dict):
        return {k: traverse(v) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [traverse(elem) for elem in obj]
    else:
        return obj  # no container, just values (str, int, float)

Here are some examples:

>>> traverse(3)
3
>>> traverse('ohai')
'ohai'
>>> traverse(['ohai', {'name': 'Vincent', 'age': 32}])
['ohai', {'name': 'Vincent', 'age': 32}]

Path Matching

To specify the target path into the structure, we need to come up with a syntax that can express those, a mini-language. Here's an example:

'res[].catlist[].points'

This notation clearly specifies the steps to take to drill down into the structure to arrive at any nested element. Note that each step is explicit: it's either a step into a dict key (the string), or into a list (the empty list). This convenient string notation is of course just sugar for the following:

['res', [], 'catlist', [], 'points']

How can we use this structure? Let's take the traverse() function and change it to keep a record of the paths it's traversing along as it traverses:

def traverse(obj, path=None):
    if path is None:
        path = []

    if isinstance(obj, dict):
        return {k: traverse(v, path + [k]) for k, v in obj.items()}
    elif isinstance(obj, list):
        return [traverse(elem, path + [[]]) for elem in obj]
    else:
        return obj

But we're not doing anyting with the path we're tracking this way yet.


Applying an Action

Now we need to use that path and in the interesting case perform an action. One way is to add that to the traverse() function, but it would do more than just traversing that way. Let's update the traverse function with a callback argument that will get called for every node in the structure (every leaf, dict entry, or list item). This would fully decouple traversing the structure from modifying it.

def traverse(obj, path=None, callback=None):
    if path is None:
        path = []

    if isinstance(obj, dict):
        value = {k: traverse(v, path + [k], callback)
                 for k, v in obj.items()}
    elif isinstance(obj, list):
        value = [traverse(elem, path + [[]], callback)
                 for elem in obj]
    else:
        value = obj

    if callback is None:  # if a callback is provided, call it to get the new value
        return value
    else:
        return callback(path, value)

Now the traverse() function is really generic and can be used to replace any node in the structure. We can now implement our traverse_modify() function that will look for a specific node, and update it. In this example, the transformer() function is our callback that will be invoked on every node in the structure. If the current path matches the target path, it will perform the action.

def traverse_modify(obj, target_path, action):
    target_path = to_path(target_path)  # converts 'foo.bar' to ['foo', 'bar']

    # This will get called for every path/value in the structure
    def transformer(path, value):
        if path == target_path:
            return action(value)
        else:
            return value

    return traverse(obj, callback=transformer)
Back to the Problem

There we have it: a generic, reusable function that abstracted out each individual interaction: traversal, path matching, and action. To solve our original problem with it, now simply call:

from operator import itemgetter

def sort_points(points):
    """Will sort a list of points."""
    return sorted(points, reverse=True, key=itemgetter('stop'))

traverse_modify(doc, 'res[].catlist[].points', sort_points)

Here's the gist to the complete solution.

Go Wild

This blog post was meant to provide some insight into a typical software engineering approach to problem solving. Breaking down a problem into pieces, and abstracting out the essence is perhaps the most joyful thing to do as an engineer.

The constructed traverse_modify() function could be written in many different ways if you would like to give it more power. Examples are supporting the following path specs:

qux[].(foo|bar)[]  # match both foo or bar
results[0..4]      # only apply action to first 4 results
foo.bar.*          # match any key value
foo.bar.**.qux     # match 0 or more levels of keys
...

Another extension would be to allow it to take multiple target-path/action combinations and apply them all in a single traversal.

All of that is left as an exercise to the reader though, to keep this post clear and concise.

Summary

The fantastic thing is that, picking a good abstraction for the traverse() function opens a door to an entire world of possibilities to modify arbitrary Python object structures. In the end, you get more than you initially set out for.

If you build something cool based on this however, please let me know :)

https://nvie.com/posts/modifying-deeply-nested-structures/
Use More Iterators
One of my top favorite features of the Python programming language is generators. They are so useful, yet I don't encounter them often enough when reading open source code. In this post I hope to outline their simplest use case and hope to encourage any readers to use them more often.
Show full content

One of my top favorite features of the Python programming language is generators. They are so useful, yet I don't encounter them often enough when reading open source code. In this post I hope to outline their simplest use case and hope to encourage any readers to use them more often.

This post assumes you know what a container and an iterator is. I've explained these concepts in a previous blog post. In a follow-up post, I elaborate on what can be achieved with thinking in streams a bit more.

Why?

Why are iterators a good idea? Code using iterators can avoid intermediate variables, lead to shorter code, run lazily, consume less memory, run faster, are composable, and are more beautiful. In short: they are more elegant.

"The moment you've made something iterable, you've done something magic with your code. As soon as something's iterable, you can feed it to list(), set(), sorted(), min(), max(), heapify(), sum(), ‥. Many of the tools in Python consume iterators."

— Raymond Hettinger (source)

Recently, Clojure added transducers to the language, which is a concept pretty similar to generators in Python. (I highly recommend watching Rich Hickey's talk at Strange Loop 2014 where he introduces them.)

In the video, he talks about "pouring" one collection into another, which I think is a verb that very intuitively describes the nature of iterators in relationship to datastructures. I'm going to write about this idea in more detail in a future blog post.

Example

Here's an example of a pattern commonly seen:

def get_lines(f):
    result = []
    for line in f:
        if not line.startswith('#'):
            result.append(line)
    return result

lines = get_lines(f)

Now look at the equivalent thing as a generator:

def get_lines(f):
    for line in f:
        if not line.startswith('#'):
            yield line

lines = list(get_lines(f))
The Benefits

Not much of a difference at first sight, but the benefits are pretty substantial.

  • No bookkeeping. You don't have to create an empty list, append to it, and return it. One more variable gone;
  • Hardly consumes memory. No matter how large the input file is, the iterator version does not need to buffer the entire file in memory;
  • Works with infinite streams. The iterator version still works if f is an infinite stream (i.e. stdin);
  • Faster results. Results can be consumed immediately, not after the entire file is read;
  • Speed. The iterator version runs faster than building a list the naive way;
  • Composability. The caller gets to decide how it wants to use the result.

The last bullet is by far the most important one. Let's dig in.

Composability

Composability is key here. Iterators are incredibly composable. In the example, a list is built explicitly. What if the caller actually needs a set? In practice, many people will either create a second, set-based version of the same function, or simply wrap the call in a set(). Surely that works, but it is a waste of resources. Imagine the large file again. First a list is built from the entire file. Then it's passed to set() to build another collection in memory. Then the original list is garbage collected.

With generators, the function just "emits" a stream of objects. The caller gets to decide into what collection those objects gets poured.

Want a set instead of a list?

uniq_lines = set(get_lines(f))

Want just the longest line from the file? The file will be read entirely, but at most two lines are kept in memory at all times:

longest_line = max(get_lines(f), key=len)

Want just the first 10 lines from the file? No more than 10 lines will be read from the file, no matter how large it is:

head = list(islice(get_lines(f), 0, 10))
Loop Like a Native

Update: At PyCon 2013, Ned Batchelder gave a great talk that perfectly reflects what I tried to explain in this blog post. You can watch it here, I highly recommend it:

Summary

Don't collect data in a result variable. You can almost always avoid them. You gain readability, speed, a smaller memory footprint, and composability in return.

https://nvie.com/posts/use-more-iterators/
Iterables vs. Iterators vs. Generators
Occasionally I've run into situations of confusion on the exact differences between the following related concepts in Python:
Show full content

Occasionally I've run into situations of confusion on the exact differences between the following related concepts in Python:

  • a container
  • an iterable
  • an iterator
  • a generator
  • a generator expression
  • a {list, set, dict} comprehension

I'm writing this post as a pocket reference for later.

Containers

Containers are data structures holding elements, and that support membership tests. They are data structures that live in memory, and typically hold all their values in memory, too. In Python, some well known examples are:

  • list, deque, …
  • set, frozensets, …
  • dict, defaultdict, OrderedDict, Counter, …
  • tuple, namedtuple, …
  • str

Containers are easy to grasp, because you can think of them as real life containers: a box, a cubboard, a house, a ship, etc.

Technically, an object is a container when it can be asked whether it contains a certain element. You can perform such membership tests on lists, sets, or tuples alike:

>>> assert 1 in [1, 2, 3]      # lists
>>> assert 4 not in [1, 2, 3]
>>> assert 1 in {1, 2, 3}      # sets
>>> assert 4 not in {1, 2, 3}
>>> assert 1 in (1, 2, 3)      # tuples
>>> assert 4 not in (1, 2, 3)

Dict membership will check the keys:

>>> d = {1: 'foo', 2: 'bar', 3: 'qux'}
>>> assert 1 in d
>>> assert 4 not in d
>>> assert 'foo' not in d  # 'foo' is not a _key_ in the dict

Finally you can ask a string if it "contains" a substring:

>>> s = 'foobar'
>>> assert 'b' in s
>>> assert 'x' not in s
>>> assert 'foo' in s  # a string "contains" all its substrings

The last example is a bit strange, but it shows how the container interface renders the object opaque. A string does not literally store copies of all of its substrings in memory, but you can certainly use it that way.

NOTE:
Even though most containers provide a way to produce every element they contain, that ability does not make them a container but an iterable (we'll get there in a minute).

Not all containers are necessarily iterable. An example of this is a Bloom filter. Probabilistic data structures like this can be asked whether they contain a certain element, but they are unable to return their individual elements.

Iterables

As said, most containers are also iterable. But many more things are iterable as well. Examples are open files, open sockets, etc. Where containers are typically finite, an iterable may just as well represent an infinite source of data.

An iterable is any object, not necessarily a data structure, that can return an iterator (with the purpose of returning all of its elements). That sounds a bit awkward, but there is an important difference between an iterable and an iterator. Take a look at this example:

>>> x = [1, 2, 3]
>>> y = iter(x)
>>> z = iter(x)
>>> next(y)
1
>>> next(y)
2
>>> next(z)
1
>>> type(x)
<class 'list'>
>>> type(y)
<class 'list_iterator'>

Here, x is the iterable, while y and z are two individual instances of an iterator, producing values from the iterable x. Both y and z hold state, as you can see from the example. In this example, x is a data structure (a list), but that is not a requirement.

NOTE:
Often, for pragmatic reasons, iterable classes will implement both __iter__() and __next__() in the same class, and have __iter__() return self, which makes the class both an iterable and its own iterator. It is perfectly fine to return a different object as the iterator, though.

Finally, when you write:

x = [1, 2, 3]
for elem in x:
    ...

This is what actually happens:

When you disassemble this Python code, you can see the explicit call to GET_ITER, which is essentially like invoking iter(x). The FOR_ITER is an instruction that will do the equivalent of calling next() repeatedly to get every element, but this does not show from the byte code instructions because it's optimized for speed in the interpreter.

>>> import dis
>>> x = [1, 2, 3]
>>> dis.dis('for _ in x: pass')
  1           0 SETUP_LOOP              14 (to 17)
              3 LOAD_NAME                0 (x)
              6 GET_ITER
        >>    7 FOR_ITER                 6 (to 16)
             10 STORE_NAME               1 (_)
             13 JUMP_ABSOLUTE            7
        >>   16 POP_BLOCK
        >>   17 LOAD_CONST               0 (None)
             20 RETURN_VALUE
Iterators

So what is an iterator then? It's a stateful helper object that will produce the next value when you call next() on it. Any object that has a __next__() method is therefore an iterator. How it produces a value is irrelevant.

So an iterator is a value factory. Each time you ask it for "the next" value, it knows how to compute it because it holds internal state.

There are countless examples of iterators. All of the itertools functions return iterators. Some produce infinite sequences:

>>> from itertools import count
>>> counter = count(start=13)
>>> next(counter)
13
>>> next(counter)
14

Some produce infinite sequences from finite sequences:

>>> from itertools import cycle
>>> colors = cycle(['red', 'white', 'blue'])
>>> next(colors)
'red'
>>> next(colors)
'white'
>>> next(colors)
'blue'
>>> next(colors)
'red'

Some produce finite sequences from infinite sequences:

>>> from itertools import islice
>>> colors = cycle(['red', 'white', 'blue'])  # infinite
>>> limited = islice(colors, 0, 4)            # finite
>>> for x in limited:                         # so safe to use for-loop on
...     print(x)
red
white
blue
red

To get a better sense of the internals of an iterator, let's build an iterator producing the Fibonacci numbers:

>>> class fib:
...     def __init__(self):
...         self.prev = 0
...         self.curr = 1
... 
...     def __iter__(self):
...         return self
... 
...     def __next__(self):
...         value = self.curr
...         self.curr += self.prev
...         self.prev = value
...         return value
...
>>> f = fib()
>>> list(islice(f, 0, 10))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

Note that this class is both an iterable (because it sports an __iter__() method), and its own iterator (because it has a __next__() method).

The state inside this iterator is fully kept inside the prev and curr instance variables, and are used for subsequent calls to the iterator. Every call to next() does two important things:

  1. Modify its state for the next next() call;
  2. Produce the result for the current call.

Central idea: a lazy factory
From the outside, the iterator is like a lazy factory that is idle until you ask it for a value, which is when it starts to buzz and produce a single value, after which it turns idle again.

Generators

Finally, we've arrived at our destination! The generators are my absolute favorite Python language feature. A generator is a special kind of iterator—the elegant kind.

A generator allows you to write iterators much like the Fibonacci sequence iterator example above, but in an elegant succinct syntax that avoids writing classes with __iter__() and __next__() methods.

Let's be explicit:

  • Any generator also is an iterator (not vice versa!);
  • Any generator, therefore, is a factory that lazily produces values.

Here is the same Fibonacci sequence factory, but written as a generator:

>>> def fib():
...     prev, curr = 0, 1
...     while True:
...         yield curr
...         prev, curr = curr, prev + curr
...
>>> f = fib()
>>> list(islice(f, 0, 10))
[1, 1, 2, 3, 5, 8, 13, 21, 34, 55]

Wow, isn't that elegant? Notice the magic keyword that's responsible for the beauty:

yield

Let's break down what happened here: first of all, take note that fib is defined as a normal Python function, nothing special. Notice, however, that there's no return keyword inside the function body. The return value of the function will be a generator (read: an iterator, a factory, a stateful helper object).

Now when f = fib() is called, the generator (the factory) is instantiated and returned. No code will be executed at this point: the generator starts in an idle state initially. To be explicit: the line prev, curr = 0, 1 is not executed yet.

Then, this generator instance is wrapped in an islice(). This is itself also an iterator, so idle initially. Nothing happens, still.

Then, this iterator is wrapped in a list(), which will consume all of its arguments and build a list from it. To do so, it will start calling next() on the islice() instance, which in turn will start calling next() on our f instance.

But one step at a time. On the first invocation, the code will finally run a bit: prev, curr = 0, 1 gets executed, the while True loop is entered, and then it encounters the yield curr statement. It will produce the value that's currently in the curr variable and become idle again.

This value is passed to the islice() wrapper, which will produce it (because it's not past the 10th value yet), and list can add the value 1 to the list now.

Then, it asks islice() for the next value, which will ask f for the next value, which will "unpause" f from its previous state, resuming with the statement prev, curr = curr, prev + curr. Then it re-enters the next iteration of the while loop, and hits the yield curr statement, returning the next value of curr.

This happens until the output list is 10 elements long and when list() asks islice() for the 11th value, islice() will raise a StopIteration exception, indicating that the end has been reached, and list will return the result: a list of 10 items, containing the first 10 Fibonacci numbers. Notice that the generator doesn't receive the 11th next() call. In fact, it will not be used again, and will be garbage collected later.

Types of Generators

There are two types of generators in Python: generator functions and generator expressions. A generator function is any function in which the keyword yield appears in its body. We just saw an example of that. The appearance of the keyword yield is enough to make the function a generator function.

The other type of generators are the generator equivalent of a list comprehension. Its syntax is really elegant for a limited use case.

Suppose you use this syntax to build a list of squares:

>>> numbers = [1, 2, 3, 4, 5, 6]
>>> [x * x for x in numbers]
[1, 4, 9, 16, 25, 36]

You could do the same thing with a set comprehension:

>>> {x * x for x in numbers}
{1, 4, 36, 9, 16, 25}

Or a dict comprehension:

>>> {x: x * x for x in numbers}
{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36}

But you can also use a generator expression (note: this is not a tuple comprehension):

>>> lazy_squares = (x * x for x in numbers)
>>> lazy_squares
<generator object <genexpr> at 0x10d1f5510>
>>> next(lazy_squares)
1
>>> list(lazy_squares)
[4, 9, 16, 25, 36]

Note that, because we read the first value from lazy_squares with next(), it's state is now at the "second" item, so when we consume it entirely by calling list(), that will only return the partial list of squares. (This is just to show the lazy behaviour.) This is as much a generator (and thus, an iterator) as the other examples above.

Summary

Generators are an incredible powerful programming construct. They allow you to write streaming code with fewer intermediate variables and data structures. Besides that, they are more memory and CPU efficient. Finally, they tend to require fewer lines of code, too.

Tip to get started with generators: find places in your code where you do the following:

def something():
    result = []
    for ... in ...:
        result.append(x)
    return result

And replace it by:

def iter_something():
    for ... in ...:
        yield x

# def something():  # Only if you really need a list structure
#     return list(iter_something())
https://nvie.com/posts/iterators-vs-generators/
Writing a Command-Line Tool in Python
I used to find writing command line tools tedious. Not so much the writing of the core of the tool itself, but all the peripheral stuff you had to do to actually _finish_ one.
Show full content

I used to find writing command line tools tedious. Not so much the writing of the core of the tool itself, but all the peripheral stuff you had to do to actually finish one.

The Language?

The first issue is to pick the language to implement it in: do I use Python, which I'm intimitely familiar with, or a Unix shell script? With shell scripts, the syntax is pretty terrible, but the tool typically fits in a single file and there's hardly any overhead running them. On the other hand, making sure the tool works under all circumstances can be tricky. Shell scripts are notorious for breaking when you feed them arguments with spaces. The burden of making sure you properly quote all the variable interpolations in the script is on the programmer. It's possible to do, just unnecessarily hard.

On the other hand, Python is so much more expressive. There are a ton of libraries out there ready to use, and Python itself includes a lot of batteries already in its standard library, of course.

Distribution?

Python comes with its own set of problems, though. Python runtime environments are typically a mess, and I don't want to further pollute people's already cluttered global Python environments. With Python, installing a package is typically just a pip install <pkg> away, but it requires another tedious step: writing a setup.py.

If it comes to distributing the script, a shell script may be much easier. With shell scripts it's either a single file that needs to be copied somewhere. Manually, or via a make install command, which involves adding a Makefile and dealing with subtle differences for each Unix platform, not to even mention trying to run it on Windows machines.

Argument Parsing?

Each script will at some stage require some options or arguments. How should we do the argument parsing? Do I use getopt or getopts? Does it even matter? Can it take --long-form-options? Or do I resign myself to poor man's arg parsing again? The latter has too often become the default choice.

Standing on the Shoulders of Giants

Lately, a few fantastic projects have taken away most of the tedious work surrounding the building of command line tools, and almost make it trivial now.

Click

Click is a Python library written by Armin Ronacher that deals with all the handling of command line option and argument parsing and comes with fantastic defaults. This project is a great step towards more consistent and standard CLI interfaces. Besides solving the options and argument parsing, it also has a ton of useful features packaged, like smart colorized terminal output, file abstractions, subcommands, and rendering progress bars.

It solves the argument parsing problem.

pipsi

Using pipsi (also by Armin!), users can install any Python command line script into an isolated Python runtime environment, so it solves the global cluttered Python environment problem entirely.

Cookiecutter

Cookiecutter (by the awesome Audrey Roy Greenfield) is a project generator, based on a predefined project template. It will read the template, ask the user a few questions to fill in the blanks, and generates a new project for you.

cookiecutter-python-cli

cookiecutter-python-cli is one such Cookiecutter template I wrote that uses all of the above: it sports a predefined setup.py, a package structure that's extensible, and test cases and a test runner to get you started.

Putting it Together

Let's build a new high quality CLI in Python in under 60 seconds now.

First, install pipsi and follow its instructions:

$ curl https://raw.githubusercontent.com/mitsuhiko/pipsi/master/get-pipsi.py | python

Next, using pipsi, install Cookiecutter in its own isolated runtime environment:

$ pipsi install cookiecutter

Now use Cookiecutter to create your brand new project, based on my CLI template:

$ cd ~/Desktop
$ cookiecutter https://github.com/nvie/cookiecutter-python-cli.git
Cloning into 'cookiecutter-python-cli'...
remote: Counting objects: 64, done.
remote: Total 64 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (64/64), done.
Checking connectivity... done.
full_name (default is "Vincent Driessen")?
email (default is "vincent@3rdcloud.com")?
github_username (default is "nvie")?
project_name (default is "My Tool")?
repo_name (default is "python-mytool")?
pypi_name (default is "mytool")?
script_name (default is "my-tool")?
package_name (default is "my_tool")?
project_short_description (default is "My Tool does one thing, and one thing well.")?
release_date (default is "2014-09-04")?
year (default is "2014")?
version (default is "0.1.0")?

When you're done, you'll have a project where you can run tox to run your test suite on all important Python versions. If you don't need the test cases, simply remove the tests/ directory.

$ cd python-mytool/
$ tox
...
  py26: commands succeeded
  py27: commands succeeded
  py33: commands succeeded
  py34: commands succeeded
  pypy: commands succeeded
  flake8: commands succeeded
  congratulations :)

Let's install and run it without further modifications:

$ pipsi install --editable .
...
$ my-tool
Hello, world.
$ my-tool --as-cowboy
Howdy, world.
$ my-tool --as-cowboy Vincent
Howdy, Vincent.

You can edit the setup.py to your liking. The default provided version should already work out of the box. When you're done implementing your tool, you can either upload it to PyPI or just keep it to yourself locally:

$ pipsi install <pkgname>  # install from PyPI
$ pipsi install --editable ../path/to/project/dir   # install locally

If you have any improvements for this template, please submit a pull request. Thanks!

https://nvie.com/posts/writing-a-cli-in-python-in-under-60-seconds/
Better Package Management
> **Update** (Feb 2026): This article is > really really old. You probably want to use `uv` > today. > **Update** (March 2019): The Python > packaging landscape has changed significantly > since I first wrote this post. Your choice > today is mostly between using > [pip-tools](https://github.com/jazzband/pip-tools) > directly, using > [Pipenv](https://docs.pipenv.org/) (which is > a Swiss army knife kind of tool that > internally relies on pip-tools), or newer > tooling like > [Poetry](https://poetry.eustace.io/). A good > post to help you decide which is the tool to > best fit your use case is > [Python Application Dependency Management in 2018](https://hynek.me/articles/python-app-deps-2018/) > by [Hynek Schlawack](https://twitter.com/hynek).
Show full content

Update (Feb 2026): This article is really really old. You probably want to use uv today.
Update (March 2019): The Python packaging landscape has changed significantly since I first wrote this post. Your choice today is mostly between using pip-tools directly, using Pipenv (which is a Swiss army knife kind of tool that internally relies on pip-tools), or newer tooling like Poetry. A good post to help you decide which is the tool to best fit your use case is Python Application Dependency Management in 2018 by Hynek Schlawack.

You are managing your Python packages using pip and requirements.txt spec files already. Maybe, you are even pinning them too—that’s awesome. But how do you keep your environments clean and fresh?

Here’s what I think can be improved to the state of package management in Python.

Virtue 1: Declare only your top-level dependencies

Often, your project will only need a limited set of what I’ll call top-level package dependencies. A typical example is that you’ll depend on Django or Flask. But just putting those names in an requirements.txt file is inherently dangerous and will bite you at some point. If you don’t see why, read this post first.

So now you’re pinning them. If your app needs Flask, this will typically be in your requirements.txt file:

Flask==0.9
Jinja2==2.6
Werkzeug==0.8.3

Jinja2 and Werkzeug are in there, because Flask needs them. And since you don’t want fate to decide which versions of Jinja2 and Werkzeug you’ll get when deploying, you’re wisely pinning them.

The problem with this is that over time your requirements.txt file will accumulate all kinds of dependencies, and in reality, it’s not unusual that you’ll lose sight of which packages are still used, and which have become stale.

The following file is the result of depending on Flask and legit.

async==0.6.1
clint==0.3.1
Flask==0.9
gitdb==0.5.4
GitPython==0.3.2.RC1
Jinja2==2.6
legit==0.1.1
smmap==0.8.2
Werkzeug==0.8.3

Looking at this, I’d have no clue what smmap is, and why it’s needed in there.

Wouldn’t it be awesome to actually have a way of expressing only your top-level dependencies in a file called requirements.in, like this?

Flask>=0.9  # we use 0.9 features
legit       # any version will do for us

And “compiling” that to an actual requirements.txt:

async==0.6.1  # required by legit==0.1.1
clint==0.3.1  # required by legit==0.1.1
Flask==0.9
gitdb==0.5.4  # required by legit==0.1.1
GitPython==0.3.2.RC1  # required by legit==0.1.1
Jinja2==2.6  # required by Flask==0.9
legit==0.1.1
smmap==0.8.2  # required by legit==0.1.1
Werkzeug==0.8.3  # required by Flask==0.9

This tool exists, and is called pip-compile. Check it out on the future branch of pip-tools. (UPDATE This is now the master branch, available since 1.0.) I wrote this together with Bruno Renié over the last few months.

Let’s elaborate on this a bit. The .in file provides the file format that you’d actually would want to use and maintain as a developer, while the result of compilation is the file that you want to use to build deterministic (and thus predictable) envs.

Note that there’s a fundamental difference here between “compiling” these .in files and compiling a file of source code: the result of the compilation itself isn’t deterministic. This means that compiling your requirements may lead to a different requirements.txt file depending on the moment you run it—because in the meantime some packages might have gotten updates in PyPI.

The point is to freeze the specs. Exactly why you were pinning your dependencies already.

As a consequence, you should put both files under version control. This plays well with PAAS providers like Heroku as well. The .in file is only used for your own maintenance convenience, while the .txt file is actually used to install to your env. The difference is, it’s now generated for you.

A Quick Note on Complex Dependencies
We’ve created pip-compile to be smart with respect to resolving complex dependency trees. For example, Flask 0.9 depends on Jinja2>=2.4. If another package, say Foo, declared Jinja2<2.6, you’ll end up having Jinja2==2.5 in your compiled requirements. It can figure this out. (Obviously, conflicts can occur, in which case compilation will fail.)

Virtue 2: Have your envs reflect your specs

The next step, then, would be to rebuild your actual virtualenvs by having them reflect exactly what’s in your (compiled) spec file. Let’s replay the example above.

Recall that we have this in our requirements.in file:

Flask>=0.9
legit

Then we run pip-compile, and get:

async==0.6.1  # required by legit==0.1.1
clint==0.3.1  # required by legit==0.1.1
Flask==0.9
gitdb==0.5.4  # required by legit==0.1.1
GitPython==0.3.2.RC1  # required by legit==0.1.1
Jinja2==2.6  # required by Flask==0.9
legit==0.1.1
smmap==0.8.2  # required by legit==0.1.1
Werkzeug==0.8.3  # required by Flask==0.9

Now, to actually install that into our environment, we typically run:

$ pip install -r requirements.txt

But frankly, this isn’t enough. To actually reliably mimic the spec file, the env might need to uninstall some packages as well. This can actually be very important. Suppose you have a package that’s already installed in your env, say requests. Your code is using it, but you forgot to add it to requirements.txt. That way, running pip install -r requirements.txt will work fine, but deploying this code will break due to an ImportError.

Meet pip-sync. This tool will install all required packages into your env, but will additionally uninstall everything else in there. Combined with pip-compile, this makes for package management nirvana. Say you don’t need legit anymore, and want to remove it as a project dependency.

First, remove that top-level dependency from the .in file:

Flask
# legit  # comment out, or remove

Then run pip-compile to update the compiled spec file:

Flask==0.9
Jinja2==2.6  # required by Flask==0.9
Werkzeug==0.8.3  # required by Flask==0.9

The unused dependencies are removed automatically. Now we need to sync that back to our actual env:

$ pip-sync
Uninstalling package async
Uninstalling package clint
Uninstalling package gitdb
Uninstalling package GitPython
Uninstalling package smmap

This will now uninstall legit and all it’s dependencies from the virtualenv (unless some other package would depend on them still). Your virtualenv is crisp and clean.

I would propose PAAS providers to adopt the use of pip-sync over pip install -r requirements.txt, as environments are automatically cleaned up that way.

Project context and roadmap

As said, over the last few months, Bruno Renié and myself have been working on a better version of the pip-tools project—one that would let us do exactly the above. We’ve not been very public about it, but you might have noticed the future branch. Basically, this would replace the existing pip-dump command by something inherently more manageable.

I do solicit feedback on all this, so feel free to get in touch.

https://nvie.com/posts/better-package-management/
Pin Your Packages
Make your Python production deployments predictable and deterministic by pinning your dependencies.
Show full content

In building your Python application and its dependencies for production, you want to make sure that your builds are predictable and deterministic. Therefore, always pin your dependencies.

Update: A newer blog post about the future of pip-tools is available too: Better Package Management.

Pin Explicitly

Don’t ever use these styles in requirements.txt:

  • lxml
  • lxml>=2.2.0
  • lxml>=2.2.0,<2.3.0

Instead, pin them:

  • lxml==2.3.4

If you don’t, you can never know what you’ll get when you run pip install. Even if you rebuild the env every time, you still can’t predict it. The outcome relies on a) what’s currently installed, and b) what’s the current version on PyPI.

Eventually, all of your environments, and those of your team members, will run out of sync. Worse even, this cannot be fixed by rerunning pip install. It’s just waiting for bad things to happen in production.

The only way of making your builds deterministic, is if you pin every single package dependency (even the dependency’s dependencies).

WARNING: Don’t pin by default when you’re building libraries! Only use pinning for end products.

The biggest complaint from folks regarding explicit pinning is that you won’t benefit from updates that way. Well, yes, you won’t. But think of it. It’s impossible to distinguish between a new release that fixes bugs, or one that introduced them. You are leaving it up to coincidence. There is only one way to retake control: pin every dependency.

Check for Updates Automatically

So: we want to pin packages, but don’t want to let them become outdated. The solution: use a tool that can check for updates. This is exactly what I built pip-tools for.

pip-tools is the collective name for two tools: pip-review + pip-dump

pip-review

It will check for available updates of all packages currently installed in your environment, and report about them when available:

$ pip-review
requests==0.14.0 available (you have 0.13.2)
redis==2.6.2 available (you have 2.4.9)
rq==0.3.2 available (you have 0.3.0)

You can also install them automatically:

$ pip-review --auto
... 

or interactively decide whether you want to install each package:

$ pip-review --interactive
requests==0.14.0 available (you have 0.13.2)
Upgrade now? [Y]es, [N]o, [A]ll, [Q]uit y
... 
redis==2.6.2 available (you have 2.4.9)
Upgrade now? [Y]es, [N]o, [A]ll, [Q]uit n
rq==0.3.2 available (you have 0.3.0)
Upgrade now? [Y]es, [N]o, [A]ll, [Q]uit y
... 

It’s advisable to pick a fixed schedule to run pip-review. For example, every monday during a weekly standup meeting with your engineering team. Make it a point on the agenda. You discuss pip-review’s output, inspect changelogs, or just blindly upgrade them. The important part is that you do it explicitly. You have the chance to run with the upgraded versions for a while in a development environment, before pushing those versions to production.

pip-dump

Whereas pip-review solves the problem of how to check for updates of pinned packages, pip-dump focuses on the problem of how to dump those definitions into requirements files, managed under version control.

Typically, in Python apps, you include a requirements.txt file in the root of your project directory, and you run pip freeze > requirements.txt periodically. While this works for simple projects, this doesn’t scale. Some packages are installed for development, or personal, purposes only and you don’t want to include those in requirements.txt, going to production, or visible to your other team members.

pip-dump provides a smarter way to dump requirements. It understands the convention of separating requirements into multiple files, following the naming convention:

  • requirements.txt is the main (and default) requirements file;
  • dev-requirements.txt, or test-requirements.txt, or actually, *requirements.txt, are secundary dependencies.

When you have a requirements.txt and dev-requirements.txt file in your project, with the following content:

# requirements.txt
Flask==0.9

# dev-requirements.txt
ipython

Then simply running pip-dump will result in the following output:

# requirements.txt
Flask==0.9
Jinja2==2.6
Werkzeug==0.8.3

# dev-requirements.txt
ipython==0.13

It keeps the files sorted for tidiness, and to reduce the chance of merge conflicts in version control.

You can even put packages in an optional file called .pipignore. This is useful if you want to keep some packages installed in your local environment, but don’t want to have them reflected in your requirements files.

Contributing

pip-tools 0.x is relied on by many already on a daily/weekly basis. It’s worth noting that we’re working on Better Package Management too, which will be the future of pip-tools. If you want to contribute, please shout out.

https://nvie.com/posts/pin-your-packages/
Open Sourcing: the Ultimate Isolation
Some thoughts on how I like to write libraries as "open source" projects.
Show full content

Reflecting on how I build software lately, I noticed a pattern. I tend to write libraries in absolute isolation, as if they were open sourced and the world is watching along.

Let me try to explain why this works for me.

Where Theory Fails

“The difference between theory and practice is that, in theory, there is none.”

We have all been schooled to isolate units of software into reusable components. Software engineering literature refers to this as separation of concerns since decades. It reduces the big problem into smaller non-overlapping problems.

We obviously try doing so, by putting related logic into modules, libraries and whatnot. Yet, in practice, so many real world projects fail at their attempts and end up evolving into something unnecessarily complicated.

The main problem with this is that it becomes increasingly hard to comprehend and reason about your software, not to mention the “increased fun” maintaining it.

Why is it so hard to actually achieve this in practice? Separating concerns apparently is easier said than done.

How Things Evolve

The need for a new library often arises when solving a larger problem top-down. In the quest of solving a larger problem, you need to create a smaller component first that is required to get to a fully working solution. This is what most of our work as engineers is about—while solving a larger problem, we run into bumps along the road. When we do, we stop, fix the bump, and continue on our journey to solving the large problem.

In our rush to arrive at our end destination, we want to fix any bumps as quickly as possible. For many good reasons, mostly. We might have a deadline, or we are afraid to get lost in details and lose focus on the bigger problem we were actually solving.

In short, we tend to see those bumps as unsolicited chores that are blocking us and we want to spend as little time as possible at overcoming them. From a quality perspective, however, this may not be the best route to take. So the least you can do is create that Technical Debt ticket, feel less guilty, and move on :)

I’ve never seen a project work any more glorious than this.

Step Back, Breathe, Accept, Open Source

Running into these bumps sucks. You’re frustrated that you’re held up and are continuously thinking: dammit, I don’t want to deal with this now. The reality often is that you don’t have a choice.

Instead, step back a few steps, take a deep breath, and accept that you’ll have to spend more time on this problem than budgeted. This enables the mental rest to make a good engineering decision without too much frustration emotion involved.

What I like to do at this moment, is to start a new open source project.

Not necessarily a public one, but I do set it up like it is and actually consider it to be, or eventually become, open. I start out with a README decribing the problem and the API I’d like it to have. And in the case of Python, I also create a setup.py so integrating this into the original project is only a pip install away.

Then, just start implementing it.

Let me try to highlight the benefits this approach provides.

You’re More Likely To Do It Right

The pressure you get from pretending (or knowing) that many others will read your code, pushes you to do it right. I’d ask myself continuously: would this be an API that I’d show off to the outside world and be proud of? Could I truly explain this API in a README so that people would understand? If not, I don’t implement it like that and push harder.

Many eyeballs make you feel more responsible. Writing stuff in private for yourself, doesn’t.

No Cheating

A big difference between starting an actual new project, or developing it as one-of-many internal libraries, is that it’s impossible to rely on other parts of the end product. For example, that convenient project-specific helper function you already wrote is easily included in a module, but not so much in another project.

In an open source project, you simply cannot cheat on yourself this way and you’re forced to come up with a better solution. This might feel inconvenient at first, but remember that it’s easy to write complex software and it takes more care and dedication to write simpler software.

As a way of visualising this approach: compare programming to electrical engineering for a minute. Say you have to create a circuit board of some sort.

The chip on this board is analogous to an open source software project. Its internals are nicely abstracted away, the pins of the chip form its API, it’s probably well-tested, well-documented and can be reused immediately. It is physically impossible to connect to any of the internals of it—which is exactly the point of abstraction.

Looking at the circuit board, everything about this falls into place pretty obviously.

As programmers, we often fool ourselves that we’re isolating logic into modules/libraries, while in fact we’re merely organising it. Modules will oftentimes still contain project-specific dependencies. (As a good litmus test, move that module to an empty directory and use it. If it breaks, it wasn’t truly isolated.)

The curse with programming is that it’s so easy to create these dependencies. They are only one import statement away. Developers live in a world where that temptation continuously lurks.

But by isolating code into a stand-alone project, you can remove this temptation wholly, thereby reducing ways of cheating on yourself.

Simplicity Pays

Another big benefit of truly isolating your libraries this way, is that you are forced to think about its API. It’s the only way of interacting with the library after all. Doing this, you’ll naturally feel the urge to simplify. Complex APIs lead to complicated documentation and complex tests. The opposite applies, too, fortunately, and you’ll naturally be inclined to simplify.

A concrete example of this: When you are hacking in a web environment, you most likely have “the request” or “the DB connection” at your disposal any time. When you put your library in a module, it’s easy for these to become implicit dependencies of your library. Your library may pretty well work outside of a request context, however, and in fact, the only thing you actually need from the request could be a User instance. When you build your library as a separate project, these decisions fall into place effortlessly. In the end, this makes your library more decoupled, more generic, and overall cleaner. And as such, simpler.

True Isolation™ is the ultimate catalyst of simplicity.

Sharing Can Pay Off, Too

Even if you’re only using this technique privately to produce better software for yourself, this pays off already in a technical sense.

But open sourcing can also pay off in non-technical ways. When it fits your company’s strategy, you now have the choice to actually publish your project at any time, since it’s been written for the public from the beginning. If it solves a common problem, others may like it and take interest in following it or even contributing to it. This may open up a whole new world of users providing feedback and improvements through code or documentation contributions.

Your company may come across as an interesting place to work for to talented people. Your open source project can be your company’s banner. We’ve seen this with companies like Joyent (of Node.js fame), 10gen (of MongoDB fame) and Opscode (of Chef fame), just to name a few. Open sourcing has been an important marketing value to these companies and they have attracted many talented folks through their high-quality work.

Just always remember that simpler projects have a much lower barrier for contributors, so these are more likely to receive patches. Which by itself is another good reason to simplify your libraries :)

How I Built RQ

Many of the things I created recently, I created this way. Back a few months ago, I needed a super-simple solution to put work in the background. I was working on a startup idea, which I was creating a proof of concept for. It was a small Flask web app, and I used this snippet initially to offload work to the background. It did the work fine, but I soon needed it to do more, so I kept tweaking it until it was no longer a snippet, but a library. Although it was nicely organised in a directory, it was still tailored to the specific product I was creating.

This is where I decided to step back and started building that library like an open source project, which became RQ, of course. After using it privately for about four months, its API kept changing quite a bit, but its use became more general over time. I started reusing it for other projects I was working on, until I considered it stable enough. I believed it could be of help to other Python engineers, so I decided to open source RQ in March.

Eventually I dropped the original startup idea, but I still have RQ. Had I not open sourced it, it would now be buried with the rest of that project’s code.

Open sourcing pays off. Even if you do it in private.

https://nvie.com/posts/open-sourcing-is-the-ultimate-isolation/
Introducing RQ
Today, I'm open sourcing a project that I've been working for the last few months. It is a Python library to put work in the background, that you'd typically use in a web context. It is designed to be simple to set up and use, and be of help in almost any modern Python web stack.
Show full content

Today, I’m open sourcing a project that I’ve been working for the last few months. It is a Python library to put work in the background, that you’d typically use in a web context. It is designed to be simple to set up and use, and be of help in almost any modern Python web stack.

Existing solutions

Of course, there already exist a few solutions to this problem. Celery (by the excellent @asksol) is by far the most popular Python framework for working with asynchronous tasks. It is agnostic about the underlying queueing implementation, which is quite powerful, but also poses a learning curve and requires a fair amount of setup.

Don’t get me wrong—I think Celery is a great library. In fact, I’ve contributed to Celery myself in the past. My experiences are, however, that as your Python web project grows, there comes this moment where you want to start offloading small pieces of code into the background. Setting up Celery for these cases is a substantial effort that isn’t done swiftly and might be holding you back.

I wanted something simpler. Something that you’d use in all of your Python web projects, not only the big and serious ones.

Redis as a broker

In many modern web stacks, chances are that you’re already using Redis (by @antirez). Besides being a kick-ass key value store, Redis also provides semantics to build a perfect queue implementation. The commands RPUSH, LPOP and BLPOP are all it takes.

Inspired by Resque (by defunkt) and the simplicity of this Flask snippet (by @mitsuhiko), I’ve challenged myself to imagine just how hard a job queue library really should be.

Introducing RQ

I wanted a solution that was lightweight, easy to adopt, and easy to grasp. So I devised a simple queueing library for Python, and dubbed it RQ. In a nutshell, you define a job like you would any normal Python function.

def myfunc(x, y):
    return x * y

Now, with RQ, it is ridiculously easy to put it in the background like this:

from rq import use_connection, Queue

# Connect to Redis
use_connection()

# Offload the "myfunc" invocation
q = Queue()
q.enqueue(myfunc, 318, 62)

This puts the equivalent of myfunc(318, 62) on the default queue. Now, in another shell, run a separate worker process to perform the actual work:

$ rqworker
12:46:56:
12:46:56: *** Listening on default...
12:47:35: default: mymodule.myfunc(318, 62) (38d9c157-e997-40e2-8d20-574a97ec5a99
12:47:35: Job OK, result = 19716
12:47:35:
12:47:35: *** Listening on default...
...

To poll for the asynchronous result in the web backend, you can use:

>>> r = q.enqueue(myfunc, 318, 62)
>>> r.return_value
None
>>> time.sleep(2)
>>> r.return_value
19716

Although I must admit that polling for job results through the return_value isn’t quite useful and probably won’t be a pattern that you’d use in your day-to-day work. (I would certainly recommend against doing that, at least.)

There’s extensive documentation available at: http://nvie.github.com/rq.

Near-zero configuration

RQ was designed to be as easy as possible to start using it immediately inside your Python web projects. You only need to pass it a Redis connection to use, because I didn’t want it to create new connections implicitly.

To use the default Redis connection (to localhost:6379), you only have to do this:

from rq import use_connection
use_connection()

You can reuse an existing Redis connection that you are already using and pass it into RQ’s use_connection function:

import redis
from rq import use_connection

my_connection = redis.Redis(hostname='example.com', port=6379)
use_connection(my_connection)

There are more advanced ways of connection management available however, so please pick your favorite. You can safely mix your Redis data with RQ, as RQ prefixes all of its keys with rq:.

Building your own queueing system

RQ offers functionality to put work on queues. It provides FIFO-semantics per queue, but how many queues you create is up to you. For the simplest cases, simply using the default queue suffices already.

>>> q = Queue()
>>> q.name
'default'

But you can name your queues however you want:

>>> lo = Queue('low')
>>> hi = Queue('high')
>>> lo.enqueue(myfunc, 2, 3)
>>> lo.enqueue(myfunc, 4, 5)
>>> hi.enqueue(myfunc, 6, 7)
>>> lo.count
2
>>> hi.count
1

Both queues are equally important to RQ. None of these has higher priority as far as RQ is concerned. But when you start a worker, you are defining queue priority by the order of the arguments:

$ rqworker high low
12:47:35:
12:47:35: *** Listening on high, low...
12:47:35: high: mymodule.myfunc(6, 7) (cc183988-a507-4623-b31a-f0338031b613)
12:47:35: Job OK, result = 42
12:47:35:
12:47:35: *** Listening on high, low...
12:47:35: low: mymodule.myfunc(2, 3) (95fe658e-b23d-4aff-9307-a55a0ee55650)
12:47:36: Job OK, result = 6
12:47:36:
12:47:36: *** Listening on high, low...
12:47:36: low: mymodule.myfunc(4, 5) (bfb89229-3ce4-463c-abf8-f19c2808cb7c)
12:47:36: Job OK, result = 20
...

First, all work on the high queue is done (with FIFO semantics), then low is emptied. If meanwhile work is enqueued on high, that work takes precedence over the low queue again after the currently running job is finished.

No rocket science here, just what you’d expect.

Insight over performance

One of the things I missed most in other queueing systems is to have a decent view of what’s going on in the system. For example:

  • What queues exist?
  • How many messages are on each queue?
  • What workers are listening on what queues?
  • Who’s idle or busy?
  • What actual messages are on the queue?

RQ provides an answer to all of these questions (except for the last one, currently), via the rqinfo tool.

$ rqinfo
high       |██████████████████████████ 20
low        |██████████████ 12
default    |█████████ 8
3 queues, 45 jobs total

Bricktop.19233 idle: low
Bricktop.19232 idle: high, default, low
Bricktop.18349 idle: default
3 workers, 3 queues

Showing only a subset of queues (including empty ones):

$ rqinfo high archive
high       |██████████████████████████ 20
archive    | 0
2 queues, 20 jobs total

Bricktop.19232 idle: high
1 workers, 2 queues

If you want to parse the output of this script, you can specify the --raw flag to disable the fancy drawing. Example:

$ rqinfo --raw
queue high 20
queue low 12
queue default 8
worker Bricktop.19233 idle low
worker Bricktop.19232 idle high,default,low
worker Bricktop.18349 idle default

You can also sort the same data by queue:

$ rqinfo --by-queue
high       |██████████████████████████ 20
low        |██████████████ 12
default    |█████████ 8
3 queues, 45 jobs total

high:    Bricktop.19232 (idle)
low:     Bricktop.19233 (idle), Bricktop.19232 (idle)
default: Bricktop.18349 (idle), Bricktop.19232 (idle)
3 workers, 4 queues

By default, these monitoring commands autorefresh every 2.5 seconds, but you can change the refresh interval if you want to. See the monitoring docs for more info.

Limitations

RQ does not try to solve all of your queueing needs. But its codebase is relatively small and certainly not overly complex. Nonetheless, I think it will be helpful for all of the most basic queueing needs that you’ll encounter during Python web development.

Of course, with all this also come some limitations:

  • It’s Python-only
  • It’s Redis-only
  • The workers are Unix-only
Please, give feedback

I’m using RQ for two and a half web projects I’ve worked on during the last few months, and I am currently at the point where I’m satisfied enough to open the curtains to the world. So you’re invited to play with it. I’m very curious to hear your thoughts about this.

If you’d like to contribute, please go fork me on GitHub.

https://nvie.com/posts/introducing-rq/
vim-flake8: Flake8 for Vim
Just a quick post to let you know that I discarded my `vim-pep8` and `vim-pyflakes` Vim plugins yesterday in favor of [vim-flake8](https://github.com/nvie/vim-flake8).
Show full content

Just a quick post to let you know that I discarded my vim-pep8 and vim-pyflakes Vim plugins yesterday in favor of vim-flake8.

As you may know, PyFlakes is a static analysis tool that lets you catch static programming errors when you write them, not when you run into them at runtime. And pep8 is a Python style checking tool that enforces PEP8 guidelines on your code.

Flake8, though, seems to be a much better option to use these days. It integrates both of PEP8 and PyFlakes and even combines it with a cyclomatic complexity checker (which is irrelevant for the Vim plugin, by the way). To install Flake8, simply use:

$ pip install flake8

After installing the plugin in Vim, you can add the following command to your .vimrc file to have it executed after every save of a Python source file.

autocmd BufWritePost *.py call Flake8()

To avoid specific error messages from being reported, put a # noqa comment at the end of that line.

Installation

Assuming you already use vim-pathogen (which you really should), you can simply install the plugin by cloning the repository into the ~/.vim/bundle folder.

https://nvie.com/posts/vim-flake8-flake8-for-vim/
Introducing Times
Lately I’ve been getting sick of working with datetimes and timezones in Python. The standard library offers many different conversion routines, but does not prescribe a best practice way to deal with them. Luckily, Armin Ronacher did in his article [Dealing with Timezones in Python](http://lucumr.pocoo.org/2011/7/15/eppur-si-muove/).
Show full content

Lately I’ve been getting sick of working with datetimes and timezones in Python. The standard library offers many different conversion routines, but does not prescribe a best practice way to deal with them. Luckily, Armin Ronacher did in his article Dealing with Timezones in Python.

The summary is to never ever work with local datetimes. When a local datetime is input, immediately convert it to universal time and only ever store or calculate with those. Only when presenting datetimes to the end user, convert them to local time again.

This seems simple enough, alright. But to actually do it in Python, you still have to think about how to implement it correctly. Every. Single. Time. pytz does help a bit here, but it still isn’t trivial. It should be.

Meet Times, a very small Python library to deal with conversions from universal to local timezones and vice versa. It’s focused on simplicity and opinionated about what is good practice.

Example use

Imagine you’re building a web app that allows your users to set an alarm. Say that someone in the Netherlands sets an alarm to 9:30 am. You can use times to simplify this:

>>> import times
>>> import datetime
>>> 
>>> local_time = datetime.datetime(2012, 2, 3, 9, 30, 0)
>>> universal_time = times.to_universal(local_time, 'Europe/Amsterdam')
>>> universal_time
datetime.datetime(2012, 2, 3, 8, 30)

Now, this universal_time variable is safe to store or calculate with.

Once you want to show this date to the user again, simply format it for the given timezone:

>>> times.format(universal_time, 'Europe/Amsterdam') 
'2012-02-03 09:30:00+0100'

If your app allows users to share alerts, it is just as easy to present the alert date to an end user in New Zealand as well:

>>> times.format(universal_time, 'Pacific/Auckland') 
'2012-02-03 21:30:00+1300'
Current time

If you ever need to record the current time, you can use

>>> times.now()
datetime.datetime(2012, 2, 2, 16, 4, 40, 283090)

Which is actually just an alias to datetime.datetime.utcnow().

Converting from other sources

I’ve added the ability to create universal times from two other sources: UNIX timestamps and date strings. To use any of these, simply pass them to the to_universal function, like so:

>>> time.time()
1328729274.982
>>> times.to_universal(1328729274.982)
datetime.datetime(2012, 2, 8, 19, 27, 54, 982000)

Note that UNIX timestamps must be in UTC (which the output of time.time() is). Local UNIX timestamps are not accepted.

To create universal times from string representations, Times uses the advanced parser from the python-dateutil library. Time zones are automatically recognized if such info is encoded in the string representation. In any other case, you are required to provide it explicitly. Two examples to illustrate both variants:

>>> # Timezone-aware date formats don't require a source timezone
>>> date_str = '2012-02-08 19:27:54+0100'
>>> times.to_universal(date_str)
datetime.datetime(2012, 2, 8, 18, 27, 54)

>>> # Timezone-less date formats require an explicit source timezone
>>> date_str = '2012-02-08 19:27:54'
>>> times.to_universal(date_str, 'Asia/Singapore')
datetime.datetime(2012, 2, 8, 11, 27, 54)
Installing

Times is on PyPI (link), so just pip install times to use it.

Of course, you can fork me on GitHub.

As usual, Times is licensed under the liberal terms of the BSD license.

https://nvie.com/posts/introducing-times/
Chords + Lyrics
I released my first iPad app to import/manage chords and lyrics.
Show full content

It’s been quite a while since I took the time to update this blog. Many things have happened in the meanwhile, though. The most important happening for me is that I launched an iPad app and I founded a company called 3rd Cloud last week.

Hello, Chords + Lyrics!

An annoying problem amateur musicians might be familiar with is that chords or tablature websites all look very differently, format their song data in various formats, and oftentimes are just plain ugly. Oh, and they’re generally paved with ads, too. So to scratch our own itches, I teamed up with @jr00n from StudioWolff to create Chords + Lyrics.

Chords + Lyrics is a simple music manager for your iPad that allows you to easily import songs and lyrics from your favorite chords website (that means: any website—it recognizes the chords semantically):

Once imported, the chords become editable objects in the form of bubbles, which makes it easy for you to edit or finetune the imported songs. A carefully selected choice of fonts is used to create a readable and uniform look and feel for your song’s chords and lyrics:

Furthermore, once the editing is done, you can simply take Chords + Lyrics on stage with your band or solo performance and leave your stack of sheet music at home. When the device is rotated into landscape orientation, the user interface transforms into a big music stand that arranges the songs of choice as a stack of virtual music sheets in the order of your playlist:

You can get a sneak peak of the app by watching this video:

The app is sold at $5.99. Check us out in the App Store.

The Role of Appsterdam

Jeroen and I met during iOS Dev Camp last March and immediately were excited with the idea for Chords + Lyrics. We even received the Most Likely to Succeed award from Dom Sogolla and Mike Lee at the end of those two days, which got us even more excited about the project.

We started hacking away at it in our spare time for the next few months. While at the same time, by some kind of lucky coincidence, the same Mike Lee started building an awesome developer community for app developers: Appsterdam.

Many of our thanks therefore go out to all of the Appsterdammers, in particular Mike and Judy, for inspiring us at the moments we got stuck and for the helpful pieces of advice we got from them. Without the Appsterdam community, the project may have never seen the light of day, or be much less awesome.

Let this be the first of many apps!

https://nvie.com/posts/chords-lyrics/
A git-flow screencast
Mr. [Dave Bock](http://www.davebock.com/) of Code Sherpa’s put together a nice screencast demonstrating a few of the most important git-flow features on their [publications](http://www.codesherpas.com/portfolio/publications) page.
Show full content

Mr. Dave Bock of Code Sherpa’s put together a nice screencast demonstrating a few of the most important git-flow features on their publications page.

Many thanks to Dave for creating this!

https://nvie.com/posts/a-git-flow-screencast/
How I boosted my Vim
Where I lay out the recent changed I made to my Vim setup.
Show full content

A few weeks ago, I felt inspired by articles from Jeff Kreeftmeijer and Armin Ronacher. I took some time to configure and fine-tune my Vim environment. A lot of new stuff made it into my .vimrc file and my .vim directory. This blog post is a summary describing what I’ve added and how I use it in my daily work.

Before doing anything else, make sure you have the following line in your .vimrc file:

" This must be first, because it changes other options as side effect
set nocompatible
Step 0: make the customization process easier

Before starting configuring, it’s useful to install pathogen. Plugins in Vim are files that you drop in subdirectories of your .vim/ directory. Many plugins exist of only a single file that should be dropped in .vim/plugin, but some exist of multiple files. For example, they come with documentation, or ship syntax files. In those cases, files need to be dropped into .vim/doc and .vim/syntax. This makes it difficult to remove the plugin afterwards. After installing pathogen, you can simply unzip a plugin distribution into .vim/bundle/myplugin, under which the required subdirectories are created. Removing the plugin, then, is as simple as removing the myplugin directory.

So, download pathogen.vim, move it into the .vim/autoload directory (create it if necessary) and add the following lines to your .vimrc, to activate it:

" Use pathogen to easily modify the runtime path to include all
" plugins under the ~/.vim/bundle directory
call pathogen#helptags()
call pathogen#runtime_append_all_bundles()

Next, I’ve remapped the leader key to , (comma) instead of the default \ (backslash), just because I like it better. Since in Vim’s default configuration, almost every key is already mapped to a command, there needs to be some sort of standard “free” key where you can place custom mappings under. This is called the “mapleader”, and can be defined like this:

" change the mapleader from \ to ,
let mapleader=","

Once that is done, this is a little tweak that is a time-saver while you’re building up your .vimrc. Here, we start using the leader key:

" Quickly edit/reload the vimrc file
nmap <silent> <leader>ev :e $MYVIMRC<CR>
nmap <silent> <leader>sv :so $MYVIMRC<CR>

This effectively maps the ,ev and ,sv keys to edit/reload .vimrc. (I got this from Derek Wyatt’s .vimrc file.)

Change Vim behaviour

One particularly useful setting is hidden. Its name isn’t too descriptive, though. It hides buffers instead of closing them. This means that you can have unwritten changes to a file and open a new file using :e, without being forced to write or undo your changes first. Also, undo buffers and marks are preserved while the buffer is open. This is an absolute must-have.

set hidden

These are some of the most basic settings that you probably want to enable, too:

set nowrap        " don't wrap lines
set tabstop=4     " a tab is four spaces
set backspace=indent,eol,start
                    " allow backspacing over everything in insert mode
set autoindent    " always set autoindenting on
set copyindent    " copy the previous indentation on autoindenting
set number        " always show line numbers
set shiftwidth=4  " number of spaces to use for autoindenting
set shiftround    " use multiple of shiftwidth when indenting with '<' and '>'
set showmatch     " set show matching parenthesis
set ignorecase    " ignore case when searching
set smartcase     " ignore case if search pattern is all lowercase,
                    "    case-sensitive otherwise
set smarttab      " insert tabs on the start of a line according to
                    "    shiftwidth, not tabstop
set hlsearch      " highlight search terms
set incsearch     " show search matches as you type

There is a lot more goodness in my .vimrc file, which is put in there with a lot of love. I’ve commented most of it, too. Feel free to poke around in it.

Also, I like Vim to have a large undo buffer, a large history of commands, ignore some file extensions when completing names by pressing Tab, and be silent about invalid cursor moves and other errors.

set history=1000         " remember more commands and search history
set undolevels=1000      " use many muchos levels of undo
set wildignore=*.swp,*.bak,*.pyc,*.class
set title                " change the terminal's title
set visualbell           " don't beep
set noerrorbells         " don't beep

I don't like Vim to ever write a backup file. I prefer more modern ways of protecting against data loss.

set nobackup
set noswapfile

There have been some passionate responses about this in comments, so a warning may be appropriate here. If you care about recovering after a Vim or terminal emulator crash, or you often load huge files into memory, do not disable the swapfile. I personally save/commit so often that the swap file adds nothing. Sometimes I conciously kill a terminal forcefully, and I only find the swap file recovery process annoying.

Use file type plugins

Vim can detect file types (by their extension, or by peeking inside the file). This enabled Vim to load plugins, settings or key mappings that are only useful in the context of specific file types. For example, a Python syntax checker plugin only makes sense in a Python file. Finally, indenting intelligence is enabled based on the syntax rules for the file type.

filetype plugin indent on

To set some file type specific settings, you can now use the following:

autocmd filetype python set expandtab

To remain compatible with older versions of Vim that do not have the autocmd functions, always wrap those functions inside a block like this:

if has('autocmd')
    ...
endif
Enable syntax highlighting

Somewhat related to the file type plugins is the syntax highlighting of different types of source files. Vim uses syntax definitions to highlight source code. Syntax definitions simply declare where a function name starts, which pieces are commented out and what are keywords. To color them, Vim uses colorschemes. You can load custom color schemes by placing them in .vim/colors, then load them using the colorscheme command. You have to try what you like most. I like mustang a lot.

if &t_Co >= 256 || has("gui_running")
    colorscheme mustang
endif

if &t_Co > 2 || has("gui_running")
    " switch syntax highlighting on, when the terminal has colors
    syntax on
endif

In this case, mustang is only loaded if the terminal emulator Vim runs in supports at least 256 colors (or if you use the GUI version of Vim).

Hint: If you’re using a terminal emulator that can show 256 colors, try setting TERM=xterm-256color in your terminal configuration or in your shell’s .rc file.

Change editing behaviour

When you write a lot of code, you probably want to obey certain style rules. In some programming languages (like Python), whitespace is important, so you may not just swap tabs for spaces and even the number of spaces is important.

Vim can highlight whitespaces for you in a convenient way:

set list
set listchars=tab:>.,trail:.,extends:#,nbsp:.

This line will make Vim set out tab characters, trailing whitespace and invisible spaces visually, and additionally use the # sign at the end of lines to mark lines that extend off-screen. For more info, see :h listchars.

In some files, like HTML and XML files, tabs are fine and showing them is really annoying, you can disable them easily using an autocmd declaration:

autocmd filetype html,xml set listchars-=tab:>.

One caveat when setting listchars: if nothing happens, you have probably not enabled list, so try :set list, too.

Pasting large amounts of text into Vim

Every Vim user likes to enable auto-indenting of source code, so Vim can intelligently position you cursor on the next line as you type. This has one big ugly consequence however: when you paste text into your terminal-based Vim with a right mouse click, Vim cannot know it is coming from a paste. To Vim, it looks like text entered by someone who can type incredibly fast :) Since Vim thinks this is regular key strokes, it applies all auto-indenting and auto-expansion of defined abbreviations to the input, resulting in often cascading indents of paragraphs.

There is an easy option to prevent this, however. You can temporarily switch to “paste mode”, simply by setting the following option:

set pastetoggle=<F2>

Then, when in insert mode, ready to paste, if you press <F2>, Vim will switch to paste mode, disabling all kinds of smartness and just pasting a whole buffer of text. Then, you can disable paste mode again with another press of <F2>. Nice and simple. Compare paste mode disabled vs enabled:

Another great trick I read in a reddit comment is to use <C-r>+ to paste right from the OS paste board. Of course, this only works when running Vim locally (i.e. not over an SSH connection).

Enable the mouse

While using the mouse is considered a deadly sin among Vim users, there are a few features about the mouse that can really come to your advantage. Most notably—scrolling. In fact, it’s the only thing I use it for.

Also, if you are a rookie Vim user, setting this value will make your Vim experience definitively feel more natural.

To enable the mouse, use:

set mouse=a

However, this comes at one big disadvantage: when you run Vim inside a terminal, the terminal itself cannot control your mouse anymore. Therefore, you cannot select text anymore with the terminal (to copy it to the system clipboard, for example).

To be able to have the best of both worlds, I wrote this simple Vim plugin: vim-togglemouse. It maps <F12> to toggle your mouse “focus” between Vim and the terminal.

Small plugins like these are really useful, yet have the additional benefit of lowering the barrier of learning the Vim scripting language. At the core, this plugin exists of only one simple function:

fun! s:ToggleMouse()
    if !exists("s:old_mouse")
        let s:old_mouse = "a"
    endif

    if &mouse == ""
        let &mouse = s:old_mouse
        echo "Mouse is for Vim (" . &mouse . ")"
    else
        let s:old_mouse = &mouse
        let &mouse=""
        echo "Mouse is for terminal"
    endif
endfunction
Get efficient: shortcut mappings

The following trick is a really small one, but a super-efficient one, since it strips off two full keystrokes from almost every Vim command:

nnoremap ; :

For example, to save a file, you type :w normally, which means:

  1. Press and hold Shift
  2. Press ;
  3. Release the Shift key
  4. Press w
  5. Press Return

This trick strips off steps 1 and 3 for each Vim command. It takes some times for your muscle memory to get used to this new ;w command, but once you use it, you don’t want to go back!

I also find this key binding very useful, since I like to reformat paragraph text often. Just set your cursor inside a paragraph and press Q (or select a visual block and press Q).

" Use Q for formatting the current paragraph (or selection)
vmap Q gq
nmap Q gqap

If you are still getting used to Vim and want to force yourself to stop using the arrow keys, add this:

map <up> <nop>
map <down> <nop>
map <left> <nop>
map <right> <nop>

If you like long lines with line wrapping enabled, this solves the problem that pressing down jumpes your cursor “over” the current line to the next line. It changes behaviour so that it jumps to the next row in the editor (much more natural):

nnoremap j gj
nnoremap k gk

When you start to use Vim more professionally, you want to work with multiple windows open. Navigating requires you to press C-w first, then a navigation command (h, j, k, l). This makes it easier to navigate focus through windows:

" Easy window navigation
map <C-h> <C-w>h
map <C-j> <C-w>j
map <C-k> <C-w>k
map <C-l> <C-w>l

Tired of clearing highlighted searches by searching for “ldsfhjkhgakjks”? Use this:

nmap <silent> ,/ :nohlsearch<CR>

I used to have it mapped to :let/=“”, but some users kindly pointed out that it is better to use:nohlsearch@, because it keeps the search history intact.

It clears the search buffer when you press ,/

Finally, a trick by Steve Losh for when you forgot to sudo before editing a file that requires root privileges (typically /etc/hosts). This lets you use w!! to do that after you opened the file already:

cmap w!! w !sudo tee % >/dev/null
Other cool plugins

In order to make the article not any more longer than it already is, here’s a list of other plugins that are really worth checking out (I use each of them regularly):

  • localrc: lets you load specific Vim settings for any file in the same directory (or a subdirectory thereof). Comes in super handy for project-wide settings.

  • CtrlP: lets you open files or switch buffers quickly using fuzzy search. I'd highly recommend it.

Other resources

Some of the resources from where I have collected inspiration for my .vimrc file, plugins, and tricks:

I hope you like these tips. You can have a look at my full Vim configuration in my Github repo.

https://nvie.com/posts/how-i-boosted-my-vim/
A whole new blog
I've moved my blog to Nanoc.
Show full content

Note: This blog post has been written in 2010, and a lot has changed since then. I’m not using any of this anymore. The post is still available for posterity, but do take this into account when reading this.

Finally, I’ve made the move to a static blog engine! I’m using nanoc now (bye bye WordPress). nanoc is a very flexible and customizable static site generator, written by Denis Defreyne.

As with all static site generators, nanoc lets you write your source files in a simple markup language. Out of the box, nanoc offers you the choice of using Markdown, Textile, reStructuredText or plain HTML (with or without embedded Ruby). In fact, nanoc is nothing more than a generator honoring a Rules-file that tells it how to compile, layout and route the site’s items.

Compiling items

An “item” is a file on your website. It can be any kind of file, like a web site page (HTML), an image, a JavaScript or CSS file or an RSS feed. During the compile phase, you specify which sequential actions should be performed on the content of that item. These actions are called filters. Some examples of filters are an embedded ruby filter, a Textile-to-HTML converter, a less compiler, or minify CSS. Filters can be chained, for example:

compile '/static/css/*/' do
   # compress CSS :)
   filter :less
   filter :rainpress
end

Which turns .less-files into compressed CSS:

Any filter you can imagine, nanoc can handle. nanoc comes with a lot of filters out of the box, but even writing your own filters really is a piece of cake.

Routing items

After compiling (i.e. transforming content through filters) comes the routing of the items. This is a means of assigning file names to compiled content. nanoc calculates default files names from the input, but you can use this to influence the default naming. A special case is where you set the route to Nil which doesn’t write the file at all. I use this to test draft posts locally, like this (oh, did I mention the Rules file is 100% Ruby?):

route '/posts/*/' do
   if $include_drafts or @item[:published] then
      '/posts/' + @item.slug + '/index.html'
   end
end
Laying out items

Finally, layouts are applied. Layouts are kind of templates that can be used to “frame” the item’s contents. This is typically used for HTML files only, but isn’t limited to it. For example, the blog posts are compiled into (partial) HTML, and the layout rules put the site’s container HTML around it, adding CSS styling, jQuery scripts, the header, sidebars and footer and Google Analytics tracking (these go for each page). There’s a special extra layout rule for blog post pages, which additionally adds Disqus comments.

Summary

Each build of this blog also automatically:

  • Converts Textile content to HTML
  • Highlights syntax using pygments
  • Converts less to CSS
  • Minifies CSS
  • Minifies JavaScript
  • Downsizes source images
  • Generates redirect pages for alternative (old-style) URL’s (for user that have existing bookmarks to old WordPress URL’s)
  • Generates a new blog post RSS feed

In short, now nanoc is fully configured to my wishes, I can simply focus on writing blog content, without preparing image content (it is done automatically), and without having to choose between either a “WYSIWYG” editor or writing HTML manually. And I can do it in an offline fashion, too, which was one my main complaints about WordPress.

So I’m happy.

Oh, and since I have been converting my blog anyway, I also created a new look and feel for it. I hope you like it. Feel free to comment.

https://nvie.com/posts/a-whole-new-blog/
An upgrade of gitflow
Last week, I silently tagged gitflow 0.2. These are the most important changes since 0.1.
Show full content

Last week, I silently tagged gitflow 0.2. The most important changes since 0.1 are:

  • Order of arguments changed to have a more “gitish” subcommand structure. For example, you now say: git flow feature start myfeature
  • Better initializer. git flow init now prompts interactively to set up a gitflow enabled repo.
  • Added a command to list all feature/release/hotfix/support branches, e.g.: git flow feature list
  • Made all merge/rebase operations failsafe, providing a non-destructive workflow in case of merge conflicts.
  • Easy diff’ing of all changes on a specific (or the current) feature branch: git flow feature diff [feature]
  • Add support for feature branch rebasing: git flow feature rebase
  • Some subactions now take name prefixes as their arguments, for convenience. For example, if you have feature branches called “experimental”, “refactoring” and “feature-X”, you could say: git flow feature finish ref
    And gitflow will know you mean the “refactoring” feature branch.
    These actions are: finish, diff and rebase.
  • Much better overall sanity checking.
  • Better portability (POSIX compliant code)
  • Better (more portable) flag parsing using Kate Ward’s shFlags.
  • Improved installer. To install git flow as a first-class Git subcommand, simply type: sudo make install
  • Major and minor bug fixes.

That’s all for now.

https://nvie.com/posts/an-upgrade-of-gitflow/
gitflow 0.1 released
After the overwhelming attention and feedback on the Git branching model post, a general consensus was that this workflow would benefit from some form of proper scriptability. This post proposes the initial version of a tool I called git-flow.
Show full content

After the overwhelming attention and feedback on the Git branching model post, a general consensus was that this workflow would benefit from some form of proper scriptability. The workflow works seamlessly if you perform the steps involved manually, but hey… manually is manually, really.

UPDATE 2/4/2010: I recommend NOT USING this very early release, but to jump on the current develop tip, which is much more mature. Release 0.2 is coming very soon.

An assisting tool (dubbed gitflow) was therefore created to provide simple, high-level commands to adopt the workflow into your own software development process. It’s free and it’s open source. Feel free to contribute to it if you like.

Fork me on Github: http://github.com/nvie/gitflow

Since this morning, the first working release 0.1 was tagged, albeit very basic.

A quick walkthrough

The gitflow script essentially features six subcommands: paired start/finish commands for managing the different types of branches from the originating article:

  • Feature branches:
  • gitflow start feature <myfeature>
  • gitflow finish feature <myfeature>
  • Release branches:
  • gitflow start release <version-id>
  • gitflow finish release@ <version-id>
  • Hotfix branches:
  • gitflow start hotfix <version-id>
  • gitflow finish hotfix <version-id>

Each of these scripts exactly reports what actions were taken and what follow-up actions are required by the user. This output will be polished in future versions to improve the UX . An example output:

$ gitflow finish feature foo
Branches 'develop' and 'origin/develop' have diverged.
And local branch 'develop' is ahead of 'origin/develop'.
Switched to branch "develop"
Your branch is ahead of 'origin/develop' by 12 commits.
Merge made by recursive.
 README |    2 ++
 1 files changed, 2 insertions(+), 0 deletions(-)
Deleted branch foo (cd3effb).

Summary of actions:
- The feature branch 'foo' was merged into 'develop'
- Feature branch 'foo' has been removed
- You are now on branch 'develop'
Limitations

The script is very limited at the moment yet, but future versions will fix that, too. Some of the main limitations:

  • Branch names (master, develop) and the remote repo name (origin) are currently fixed.
  • There is no support for dealing with merge conflicts yet.
  • There is no support for support-* branches.
  • There is no documentation.
  • There is no installer.

However, as this post is written, some of the limitations are already taken care of by community members. Power to the open source!

https://nvie.com/posts/gitflow-01-released/
A successful Git branching model
In this post I present a Git branching strategy for developing and releasing version-based software.
Show full content
Note of reflection · March 5, 2020

This model was conceived in 2010, now more than 10 years ago, and not very long after Git itself came into being. In those 10 years, git-flow (the branching model laid out in this article) has become hugely popular in many a software team to the point where people have started treating it like a standard of sorts — but unfortunately also as a dogma or panacea.

During those 10 years, Git itself has taken the world by a storm, and the most popular type of software that is being developed with Git is shifting more towards web apps — at least in my filter bubble. Web apps are typically continuously delivered, not rolled back, and you don't have to support multiple versions of the software running in the wild.

This is not the class of software that I had in mind when I wrote the blog post 10 years ago. If your team is doing continuous delivery of software, I would suggest to adopt a much simpler workflow (like GitHub flow) instead of trying to shoehorn git-flow into your team.

If, however, you are building software that is explicitly versioned, or if you need to support multiple versions of your software in the wild, then git-flow may still be as good of a fit to your team as it has been to people in the last 10 years. In that case, please read on.

To conclude, always remember that panaceas don't exist. Consider your own context. Don't be hating. Decide for yourself.

In this post I present the development model that I’ve introduced for some of my projects (both at work and private) about a year ago, and which has turned out to be very successful. I’ve been meaning to write about it for a while now, but I’ve never really found the time to do so thoroughly, until now. I won’t talk about any of the projects’ details, merely about the branching strategy and release management.

Licensed under Creative Commons BY-SA
Why git?

For a thorough discussion on the pros and cons of Git compared to centralized source code control systems, see the web. There are plenty of flame wars going on there. As a developer, I prefer Git above all other tools around today. Git really changed the way developers think of merging and branching. From the classic CVS/Subversion world I came from, merging/branching has always been considered a bit scary (“beware of merge conflicts, they bite you!”) and something you only do every once in a while.

But with Git, these actions are extremely cheap and simple, and they are considered one of the core parts of your daily workflow, really. For example, in CVS/Subversion books, branching and merging is first discussed in the later chapters (for advanced users), while in every Git book, it’s already covered in chapter 3 (basics).

As a consequence of its simplicity and repetitive nature, branching and merging are no longer something to be afraid of. Version control tools are supposed to assist in branching/merging more than anything else.

Enough about the tools, let’s head onto the development model. The model that I’m going to present here is essentially no more than a set of procedures that every team member has to follow in order to come to a managed software development process.

Decentralized but centralized

The repository setup that we use and that works well with this branching model, is that with a central “truth” repo. Note that this repo is only considered to be the central one (since Git is a DVCS, there is no such thing as a central repo at a technical level). We will refer to this repo as origin, since this name is familiar to all Git users.

Each developer pulls and pushes to origin. But besides the centralized push-pull relationships, each developer may also pull changes from other peers to form sub teams. For example, this might be useful to work together with two or more developers on a big new feature, before pushing the work in progress to origin prematurely. In the figure above, there are subteams of Alice and Bob, Alice and David, and Clair and David.

Technically, this means nothing more than that Alice has defined a Git remote, named bob, pointing to Bob’s repository, and vice versa.

The main branches

At the core, the development model is greatly inspired by existing models out there. The central repo holds two main branches with an infinite lifetime:

  • master
  • develop

The master branch at origin should be familiar to every Git user. Parallel to the master branch, another branch exists called develop.

We consider origin/master to be the main branch where the source code of HEAD always reflects a production-ready state.

We consider origin/develop to be the main branch where the source code of HEAD always reflects a state with the latest delivered development changes for the next release. Some would call this the “integration branch”. This is where any automatic nightly builds are built from.

When the source code in the develop branch reaches a stable point and is ready to be released, all of the changes should be merged back into master somehow and then tagged with a release number. How this is done in detail will be discussed further on.

Therefore, each time when changes are merged back into master, this is a new production release by definition. We tend to be very strict at this, so that theoretically, we could use a Git hook script to automatically build and roll-out our software to our production servers everytime there was a commit on master.

Supporting branches

Next to the main branches master and develop, our development model uses a variety of supporting branches to aid parallel development between team members, ease tracking of features, prepare for production releases and to assist in quickly fixing live production problems. Unlike the main branches, these branches always have a limited life time, since they will be removed eventually.

The different types of branches we may use are:

  • Feature branches
  • Release branches
  • Hotfix branches

Each of these branches have a specific purpose and are bound to strict rules as to which branches may be their originating branch and which branches must be their merge targets. We will walk through them in a minute.

By no means are these branches “special” from a technical perspective. The branch types are categorized by how we use them. They are of course plain old Git branches.

Feature branches

May branch off from: develop Must merge back into: develop Branch naming convention: anything except master, develop, release-*, or hotfix-*

Feature branches (or sometimes called topic branches) are used to develop new features for the upcoming or a distant future release. When starting development of a feature, the target release in which this feature will be incorporated may well be unknown at that point. The essence of a feature branch is that it exists as long as the feature is in development, but will eventually be merged back into develop (to definitely add the new feature to the upcoming release) or discarded (in case of a disappointing experiment).

Feature branches typically exist in developer repos only, not in origin.

Creating a feature branch

When starting work on a new feature, branch off from the develop branch.

$ git checkout -b myfeature develop
Switched to a new branch "myfeature"
Incorporating a finished feature on develop

Finished features may be merged into the develop branch to definitely add them to the upcoming release:

$ git checkout develop
Switched to branch 'develop'
$ git merge --no-ff myfeature
Updating ea1b82a..05e9557
(Summary of changes)
$ git branch -d myfeature
Deleted branch myfeature (was 05e9557).
$ git push origin develop

The --no-ff flag causes the merge to always create a new commit object, even if the merge could be performed with a fast-forward. This avoids losing information about the historical existence of a feature branch and groups together all commits that together added the feature. Compare:

In the latter case, it is impossible to see from the Git history which of the commit objects together have implemented a feature—you would have to manually read all the log messages. Reverting a whole feature (i.e. a group of commits), is a true headache in the latter situation, whereas it is easily done if the --no-ff flag was used.

Yes, it will create a few more (empty) commit objects, but the gain is much bigger than the cost.

Release branches May branch off from: develop Must merge back into: develop and master Branch naming convention: release-*

Release branches support preparation of a new production release. They allow for last-minute dotting of i’s and crossing t’s. Furthermore, they allow for minor bug fixes and preparing meta-data for a release (version number, build dates, etc.). By doing all of this work on a release branch, the develop branch is cleared to receive features for the next big release.

The key moment to branch off a new release branch from develop is when develop (almost) reflects the desired state of the new release. At least all features that are targeted for the release-to-be-built must be merged in to develop at this point in time. All features targeted at future releases may not—they must wait until after the release branch is branched off.

It is exactly at the start of a release branch that the upcoming release gets assigned a version number—not any earlier. Up until that moment, the develop branch reflected changes for the “next release”, but it is unclear whether that “next release” will eventually become 0.3 or 1.0, until the release branch is started. That decision is made on the start of the release branch and is carried out by the project’s rules on version number bumping.

Creating a release branch

Release branches are created from the develop branch. For example, say version 1.1.5 is the current production release and we have a big release coming up. The state of develop is ready for the “next release” and we have decided that this will become version 1.2 (rather than 1.1.6 or 2.0). So we branch off and give the release branch a name reflecting the new version number:

$ git checkout -b release-1.2 develop
Switched to a new branch "release-1.2"
$ ./bump-version.sh 1.2
Files modified successfully, version bumped to 1.2.
$ git commit -a -m "Bumped version number to 1.2"
[release-1.2 74d9424] Bumped version number to 1.2
1 files changed, 1 insertions(+), 1 deletions(-)

After creating a new branch and switching to it, we bump the version number. Here, bump-version.sh is a fictional shell script that changes some files in the working copy to reflect the new version. (This can of course be a manual change—the point being that some files change.) Then, the bumped version number is committed.

This new branch may exist there for a while, until the release may be rolled out definitely. During that time, bug fixes may be applied in this branch (rather than on the develop branch). Adding large new features here is strictly prohibited. They must be merged into develop, and therefore, wait for the next big release.

Finishing a release branch

When the state of the release branch is ready to become a real release, some actions need to be carried out. First, the release branch is merged into master (since every commit on master is a new release by definition, remember). Next, that commit on master must be tagged for easy future reference to this historical version. Finally, the changes made on the release branch need to be merged back into develop, so that future releases also contain these bug fixes.

The first two steps in Git:

$ git checkout master
Switched to branch 'master'
$ git merge --no-ff release-1.2
Merge made by recursive.
(Summary of changes)
$ git tag -a 1.2

The release is now done, and tagged for future reference.

Edit: You might as well want to use the -s or -u <key> flags to sign your tag cryptographically.

To keep the changes made in the release branch, we need to merge those back into develop, though. In Git:

$ git checkout develop
Switched to branch 'develop'
$ git merge --no-ff release-1.2
Merge made by recursive.
(Summary of changes)

This step may well lead to a merge conflict (probably even, since we have changed the version number). If so, fix it and commit.

Now we are really done and the release branch may be removed, since we don’t need it anymore:

$ git branch -d release-1.2
Deleted branch release-1.2 (was ff452fe).
Hotfix branches

May branch off from: master Must merge back into: develop and master Branch naming convention: hotfix-*

Hotfix branches are very much like release branches in that they are also meant to prepare for a new production release, albeit unplanned. They arise from the necessity to act immediately upon an undesired state of a live production version. When a critical bug in a production version must be resolved immediately, a hotfix branch may be branched off from the corresponding tag on the master branch that marks the production version.

The essence is that work of team members (on the develop branch) can continue, while another person is preparing a quick production fix.

Creating the hotfix branch

Hotfix branches are created from the master branch. For example, say version 1.2 is the current production release running live and causing troubles due to a severe bug. But changes on develop are yet unstable. We may then branch off a hotfix branch and start fixing the problem:

$ git checkout -b hotfix-1.2.1 master
Switched to a new branch "hotfix-1.2.1"
$ ./bump-version.sh 1.2.1
Files modified successfully, version bumped to 1.2.1.
$ git commit -a -m "Bumped version number to 1.2.1"
[hotfix-1.2.1 41e61bb] Bumped version number to 1.2.1
1 files changed, 1 insertions(+), 1 deletions(-)

Don’t forget to bump the version number after branching off!

Then, fix the bug and commit the fix in one or more separate commits.

$ git commit -m "Fixed severe production problem"
[hotfix-1.2.1 abbe5d6] Fixed severe production problem
5 files changed, 32 insertions(+), 17 deletions(-)
Finishing a hotfix branch

When finished, the bugfix needs to be merged back into master, but also needs to be merged back into develop, in order to safeguard that the bugfix is included in the next release as well. This is completely similar to how release branches are finished.

First, update master and tag the release.

$ git checkout master
Switched to branch 'master'
$ git merge --no-ff hotfix-1.2.1
Merge made by recursive.
(Summary of changes)
$ git tag -a 1.2.1

Edit: You might as well want to use the -s or -u <key> flags to sign your tag cryptographically.

Next, include the bugfix in develop, too:

$ git checkout develop
Switched to branch 'develop'
$ git merge --no-ff hotfix-1.2.1
Merge made by recursive.
(Summary of changes)

The one exception to the rule here is that, when a release branch currently exists, the hotfix changes need to be merged into that release branch, instead of develop. Back-merging the bugfix into the release branch will eventually result in the bugfix being merged into develop too, when the release branch is finished. (If work in develop immediately requires this bugfix and cannot wait for the release branch to be finished, you may safely merge the bugfix into develop now already as well.)

Finally, remove the temporary branch:

$ git branch -d hotfix-1.2.1
Deleted branch hotfix-1.2.1 (was abbe5d6).
Summary

While there is nothing really shocking new to this branching model, the “big picture” figure that this post began with has turned out to be tremendously useful in our projects. It forms an elegant mental model that is easy to comprehend and allows team members to develop a shared understanding of the branching and releasing processes.

A high-quality PDF version of the figure is provided here. Go ahead and hang it on the wall for quick reference at any time.

Update: And for anyone who requested it: here’s the gitflow-model.src.key of the main diagram image (Apple Keynote).


Git-branching-model.pdf

https://nvie.com/posts/a-successful-git-branching-model/
Auto-generate classes for your Core Data data model, revisited
A few months ago, I wrote about [automatically generating classes for your Core Data entities][prev] and how to automate Xcode using users scripts, such that, when your model changed, you only needed to run your custom script again and your intermediate model files would reflect the new situation.
Show full content

A few months ago, I wrote about automatically generating classes for your Core Data entities and how to automate Xcode using users scripts, such that, when your model changed, you only needed to run your custom script again and your intermediate model files would reflect the new situation.

Well, the guys from the mogenerator project have come up with a far superior solution in the mean time. The newest version of mogenerator comes with an Xcode plugin named Xmo’d, which monitors your *.xcdatamodel file for changes and, as soon as it changes, regenerates all of the neccessary files.

This means that there is officially no more reason not to use mogenerator.

To set it up, download the installer package from their (improved) project website and install it. (Before installing, please read the important release note about the renamed method +newInManagedObjectContext:.)

When installed, all you need to do is Command-click your *.xcdatamodel file, click Get Info, switch to the Comments tab and add the string “xmod” to the comment field. This is the trigger for Xmo’d to start (re)generating your machine-classes (the underscored class files) when the data model changes. Brilliant!

Oh, the default location at which the generated files will be emitted, is in a folder named after your project, right next to where your *.xcdatamodel already sits:

Enjoy it and spread the word!

https://nvie.com/posts/auto-generate-classes-for-your-core-data-data-model-revisited/
Automatically generate classes for your Core Data data model
When designing a Core Data data model for your Xcode projects, you can choose to create Objective-C object wrappers for your entities, so that you can profit from type-safe code. The normal, tedious, workflow for this is that you select each entity from the model designer, select all of its attributes and relationships, Ctrl-click it and from the contextual menu first select “Copy Obj-C 2.0 Method Declarations To Clipboard”, paste it into the appropriate class header file, then do the same thing for the method implementations in the class implementation file. Waaaaaay too much work. Not to mention the manual copy-pastes are really hard to keep in sync once you start adding functionality to these class files, since you don’t want to overwrite those additions, but you want to keep replacing everything else.
Show full content

When designing a Core Data data model for your Xcode projects, you can choose to create Objective-C object wrappers for your entities, so that you can profit from type-safe code. The normal, tedious, workflow for this is that you select each entity from the model designer, select all of its attributes and relationships, Ctrl-click it and from the contextual menu first select “Copy Obj-C 2.0 Method Declarations To Clipboard”, paste it into the appropriate class header file, then do the same thing for the method implementations in the class implementation file. Waaaaaay too much work. Not to mention the manual copy-pastes are really hard to keep in sync once you start adding functionality to these class files, since you don’t want to overwrite those additions, but you want to keep replacing everything else.

Meet mogenerator

Fortunately, there is a great way for automating this process, using mogenerator. The tool can be downloaded as a DMG installer (Aral Balkan’s blog mentions a workaround for older Xcode versions, but for Xcode 3.1.3 it worked out of the box for me), or you can checkout the sources from github and build it yourself.

The mogenerator command line tool eases this generation process by reading the *.xcdatamodel file and generating both class files and intermediate class files for each entity. The intermediate classes (called machine classes) are continuously overwritten by subsequent regenerations, so you should never edit the contents of these files. The actual model object classes (called human classes) inherit from those intermediate classes with a default empty implementation, allowing for all manual extensions.

For example, when you design a model with two entities Foo and Bar, mogenerator can be invokes as follows:

$ mogenerator -m MyDocument.xcdatamodel -M Entities -H Model

The flag -m sets the input model file, while -M and -H specify the output directories where the machine and human classes should be generated respectively.

This does a few things:

  • In the Entities subdirectory, there will be generated header and implementation files for NSManagedObject subclasses called _Foo and _Bar;
  • In the Model subdirectory, there will be generated classes called Foo and Bar—respective subclasses of _Foo and _Bar. These are only created if not available yet. Otherwise, they are left as is.
Wrapping it up

The trick of how mogenerator works is that you can run the script as often as you want. After every change in your model, you’ll want to re-run the generation again to update the machine classes. You could easily leave Xcode, switch over to Terminal and issue the command above. But you’ll get quite tired of that after a few times.

Therefore, I’ve written a custom user script that can be added to Xcode (see figure), which does the following:

  • You can configure the output directories in the first lines of the script. There is no per-project configuration, so choose them as you would like to use them with all your projects;
  • Mind that these generated files are not automatically included in your Xcode project. Drag them there once and ideally put the machine generated classes into a group under “Other resource”, so you never have to see them again. Whenever you add a new class to your model, new files will be generated, so again you must drag the new files to reference those, of course!
  • The script can be run with any file in the project opened. It starts out with that file and walks up the directory tree to search for your Xcode project. If found, it executes all the rest from your project directory. (Suggestions are welcome, I could not find a better implementation since a variable like %%%{PBXProjectPath}%%% does not seem to exist.)
  • It invokes mogenerator to generate all model classes for the project. It is smart enough to detect whether you are using Brian Webster’s BWOrderedManagedObject in your project. If so, your generated machine classes will inherit from BWOrderedManagedObject instead of NSManagedObject.

To add this script to Xcode, open the menu Scripts (the icon) > Edit User Scripts… Click the “+”-button on the bottom-left and select “New shell script”. Set the values for Input, Directory, Output and Errors as in the screenshot above, then copy-paste the script below into the code window. Add a nice keyboard shortcut to this action to top it off :-) I’ve chosen ⌥⌘G for this.

Please feel free to leave any comments if this helped you.

#!/bin/sh
#
# Automatic (re)generation of model classes for all *.xcdatamodel files.
# Written by Vincent Driessen
#
# You are free to use this script in any way.
# The original blog post is http://nvie.com/archives/263
#

# Define output directories
MACHINE_DIR="Entities"
MODEL_DIR="Model"

# Look for the Xcode project directory for this file
cd `dirname "%%%{PBXFilePath}%%%"`
while [ `ls -d *.xcodeproj 2>/dev/null | wc -l` -eq 0 ]; do
    cd ..
    if [ "`pwd`" = "/" ]; then
        echo "No Xcode project found."
        exit 1
    fi
done

echo "Project directory is `pwd`"

#
# Check to see whether the base class is just a default (NSManagedObject) or
# maybe Brian Webster's excellent BWOrderedManagedObject.
# http://fatcatsoftware.com/blog/2008/per-object-ordered-relationships-using-core-data
#
# NOTE:
# The check really is quite arbitrary: if there exists a file called
# BWOrderedManagedObject.h somewhere below the project root directory, we
# assume that we want to use this as the base class for all generated classes.
#
EXTRA_FLAGS=
if [ `find . -name BWOrderedManagedObject.h | wc -l` -gt 0 ]; then
    EXTRA_FLAGS+="--base-class BWOrderedManagedObject"
fi

# Generate the model classes using mogenerator
for model in `find . -name '*.xcdatamodel'`; do
   # The output directories have to exist, so create them
   mkdir -p "${MACHINE_DIR}" "${MODEL_DIR}"
   mogenerator ${EXTRA_FLAGS} -m "${model}" -M "${MACHINE_DIR}" -H "${MODEL_DIR}"
done
https://nvie.com/posts/automatically-generate-classes-for-your-core-data-data-model/
NSManagedObjectContext extensions
The Core Data framework rules, and its API is really really powerful. But really, why does the Core Data API require us to write so much boilerplate code? Simple things need to be simple.
Show full content

The Core Data framework rules, and its API is really really powerful. But really, why does the Core Data API require us to write so much boilerplate code? Simple things need to be simple.

Why is the deletion of a managed object from the NSManagedObjectContext so easy:

[context deleteObject:someObject];

Compared to its creation:

[NSEntityDescription insertNewObjectForEntityForName:@"someObjectClassName"
                              inManagedObjectContext:context];
Extending NSManagedObjectContext

Add the following category on NSManagedObjectContext to all of your Core Data projects and your pains will be history.

@implementation NSManagedObjectContext(NSManagedObjectContextConvenienceMethods)

- (id)newObject:(Class)entity {
   return [NSEntityDescription insertNewObjectForEntityForName:[entity description]
                                        inManagedObjectContext:self];
}

@end

Now, a call to create a new object is as easy as deleting it.

[context newObject:[someEntity class]];
Further enhancements of NSManagedObject

Matt Gallagher has written an excellent article about how to further enhance NSManagedObject for adding simple, one-line fetch support. Be sure to check it out.

https://nvie.com/posts/nsmanagedobjectcontext-extensions/
NSPredicateEditor tutorial
Cocoa offers a nice visual editor for editing NSPredicate objects templates, called NSPredicateEditor. The NSPredicateEditor can be set up using code or in Interface Builder, which is preferable for simple use. The setup is fairly easy once you know how to do it. In this tutorial, we’ll be building a simple predicate editor example which shows the basic functionality of the predicate editor.
Show full content

Cocoa offers a nice visual editor for editing NSPredicate objects templates, called NSPredicateEditor. The NSPredicateEditor can be set up using code or in Interface Builder, which is preferable for simple use. The setup is fairly easy once you know how to do it. In this tutorial, we’ll be building a simple predicate editor example which shows the basic functionality of the predicate editor.

Setting up the AppDelegate

Begin by creating a new Xcode project (⌘⇧N). Name your project wisely and create a new class in the Classes group, called AppDelegate.

Switch to the header file and declare two IBOutlets for the main window and the sheet on which we’re going to display the editor in a few minutes. Also, add two IBActions called -openEditor: and -closeEditor:. Finally, add an ivar that holds the NSPredicate we’re going to be editing.

Next, we’re going to fire up Interface Builder to build the UI. Double click on the MainMenu.xib file under the Resources group.

Drag an NSObject object from the Library into the XIB and call it App Delegate. Hit ⌘6 and make it a subclass of the AppDelegate class we just created. Then, hook it up to the delegate property of the File’s Owner.

Drag a new NSWindow to the XIB-file and call it Sheet. Make sure the checkbox “Visible At Launch” is deselected or the sheet will not display properly at runtime. Open the main window and add a NSButton and a NSTextView to it. To the sheet window, drag a NSPredicateEditor and a NSButton. They should look somewhat like this now:

Now, we can hook up the outlets and actions as usual. Hook up the Edit Predicate button on the main window to -openEditor: and the OK button on the sheet window to closeEditor:. Then hook up the mainWindow and sheet outlets of the AppDelegate class to the respective NSWindow objects.

Configure the NSPredicateEditor

Once we have all of the connections between Xcode and Interface Builder set up, we can continue to configure the predicate editor itself, which is actually what this tutorial is all about. An NSPredicateEditor control uses a list of NSPredicateEditorRowTemplate objects that can handle individual (simple) NSPredicate objects. Combining these row templates enables the NSPredicateEditor to edit compound predicates. There is no limitation to the depth of nested compound predicates, although nesting too deep would not be advisable from a usability perspective.

In the edit window, click a few times until the “name contains” row template is selected. In this row template, you define which key paths are supported. Supported here means two things:

  • matching—given an existing predicate with this key path in it on the left-hand side, this row template can be used to alter the predicate;
  • generation—when using the editor to create new predicates, adding a new rule for this key path will generate a predicate for this key path.

Gotcha

A small gotcha, at least one that initially put me on the wrong foot, is that there is quite a difference between the rows that you see design-time in Interface Builder and the rows that are available run-time. At design-time, you define the NSPredicateEditorRowTemplate objects while at run-time you see instances of them. Hence, the number of rows at design-time is the number of different row templates available. At run-time, however, the number of rows is the number of predicates within the compound predicate (which each has an associated row template instance that handles it). Subtle difference.

In short, in Interface Builder, create a row template for each type of match that you want to allow. Typically, this means for each data type that you want to support. In our example, we have the following setup:

  • Row template #1 is for all string matches. Here, we have defined it for the key paths “firstname”, “lastname”, “address.street” and “address.city”. They, per definition, have the same allowed operators. If we want to have an other set of operators for a specific key path, we need to define a separate row template for it.
  • Row template #2 is for date matches, i.e. our “birthdate” key path.
  • Row template #3 is for all integer matches, i.e. our “address.number” key path.

The result looks like this:

Using bindings to connect the predicate to the UI

Next up, we simply connect both the text view from the main window and the predicate editor from the sheet window to the predicate key path using Cocoa bindings. In order to do so, select the NSPredicateEditor (first click the control to select the scroll view, then click again to select the inner NSPredicateEditor), hit ⌘4. Then, unfold the “Value” binding and hook it up to the App Delegate’s “predicate” key path.

Do the same for the text view in the main window, but this time hook it up to the “predicate.description” key path (since only strings can be displayed in a text view). When you do this, make sure that the text view is read-only, since the description property of objects should never be set.

Writing the code to wrap it all up

Finally, we have only a bit of code to write in our AppDelegate implementation, so let’s go:

//
//  AppDelegate.m
//  PredicateEditorTest
//
//  Created by Vincent on 20-07-09.
//

#import "AppDelegate.h"

#define DEFAULT_PREDICATE @"(firstname = 'John' AND lastname = 'Doe') " 
                    @"OR birthdate > CAST('01/01/1985', 'NSDate') " 
                    @"OR address.city = 'Chicago' " 
                    @"AND address.street != 'Main Street' " 
                    @"OR address.number > 1000"

@implementation AppDelegate

- (id)init
{
   self = [super init];
   if (self != nil) {
      predicate = [[NSPredicate predicateWithFormat:DEFAULT_PREDICATE] retain];
   }
   return self;
}

- (void)dealloc
{
   [predicate release];
   [super dealloc];
}

- (IBAction)openEditor:(id)sender
{
   [NSApp beginSheet:sheet
      modalForWindow:mainWindow
      modalDelegate:nil
      didEndSelector:NULL
        contextInfo:nil];
}

- (IBAction)closeEditor:(id)sender
{
   [NSApp endSheet:sheet];
   [sheet orderOut:sender];
}

@end

In the -init: method, we initialize the AppDelegate by setting and retaining a reference to a rather complex default predicate. When the XIB is loaded at run-time, the textbox shows exactly this predicate and it can be edited by invoking the edit sheet.

The actual implementation of the -openEditor: and -closeEditor: methods aren’t too exciting.

Downloading the source

You can download the source code for this tutorial as an Xcode project here.


PredicateEditorTest.zip

Have a blast!

https://nvie.com/posts/nspredicateeditor-tutorial/