How Microsoft continvoucly morged my Git branching diagram.
Show full content
A few days ago, people started tagging me on Bluesky and Hacker News about
a diagram on Microsoft's Learn portal. It looked... familiar.
In 2010, I wrote A successful Git branching
model and created
a diagram to go with it. I designed that diagram in Apple Keynote, at the time
obsessing over the colors, the curves, and the layout until it clearly
communicated how branches relate to each other over time. I also published the
source file so others could build on it. That diagram has since spread
everywhere: in books, talks, blog posts, team wikis, and YouTube videos.
I never minded. That was the whole point: sharing knowledge and letting the
internet take it by storm!
What I did not expect was for Microsoft, a trillion-dollar company, some 15+
years later, to apparently run it through an AI image generator and publish the
result on their official Learn portal, without any credit or link back to the
original.
The AI rip-off was not just ugly. It was careless, blatantly amateuristic, and
lacking any ambition, to put it gently. Microsoft unworthy. The carefully
crafted visual language and layout of the original, the branch colors, the lane
design, the dot and bubble alignment that made the original so readable—all of
it had been muddled into a laughable form. Proper AI slop.
Arrows missing and pointing in the wrong direction, and the obvious
"continvoucly morged" text quickly gave it away as a cheap AI artifact.
It had the rough shape of my diagram though. Enough actually so that people
recognized the original in it and started calling Microsoft out on it and
reaching out to me. That so many people were upset about this was really nice,
honestly. That, and "continvoucly morged" was a very fun meme—thank you,
internet! 😄
Oh god yes, Microsoft continvoucly morged my diagram there for sure 😬
Other than that, I find this whole thing mostly very saddening. Not because
some company used my diagram. As I said, it's been everywhere for 15 years and
I've always been fine with that. What's dispiriting is the (lack of) process
and care: take someone's carefully crafted work, run it through a machine to
wash off the fingerprints, and ship it as your own. This isn't a case of being
inspired by something and building on it. It's the opposite of that. It's
taking something that worked and making it worse. Is there even a goal here
beyond "generating content"?
What's slightly worrying me is that this time around, the diagram was both
well-known enough and obviously AI-slop-y enough that it was easy to spot as
plagiarism. But we all know there will just be more and more content like this
that isn't so well-known or soon will get mutated or disguised in more advanced
ways that this plagiarism no longer will be recognizable as such.
I don't need much here. A simple link back and attribution to the original
article would be a good start. I would also be interested in understanding how
this Learn page at Microsoft came to be, what the goals were here, and what the
process has been that led to the creation of this ugly asset, and how there
seemingly has not been any form of proof-reading for a document used as
a learning resource by many developers.
Every person in the empty list is an adult? That sounds weird when you say it.
It doesn't matter what predicate you pass in here though, it will not affect
the result. In other words, every person in the empty list is also NOT an
adult 😉
One intuition that can help with this, is to think of .some() and .every()
as chains of logical "or" and "and" expressions:
// .some() is a chain of "or"s
isAdult(homer) || isAdult(marge) || isAdult(bart) || ...
// .every() is a chain of "and"s
isAdult(homer) && isAdult(marge) && isAdult(bart) && ...
You can tack a constant to the beginning of these chains in a way that keeps
them logically equivalent:
Every developer has their own favorite Git tricks they use daily. Here are
some of my favorite ones I have been using for as long as I can remember.
Show full content
Every developer has their own favorite Git tricks they use daily. Here are
some of my favorite ones I have been using for as long as I can remember.
First of all, I should mention that most of these commands are bundled in my
git-toolbelt project. If you like to use them, all you need to do
is install it like so:
While working on a branch, I often find the need to re-open the files I was
working on. The Git toolbelt project contains a command to show you all
locally modified files. It will only report files that still exist locally, so
this overview won't include deleted files.
$ git modified
controllers/foo.py
README.md
This is super useful to quickly open all locally modified files in your editor.
Definitely one of my most-used commands throughout the day:
$ vim $(git modified)
After quitting your editor, you can easily re-open the files you're working on
this way.
To also include any files modified in the index (files that are git add'ed
already), use the -i flag:
$ git modified -i
You can also pass it a commit SHA, which will open all files modified in that
commit:
$ git modified HEAD~1
I have the following aliases set up in my shell, for quickly opening
a specific set of files:
vc: vim locally modified files (not indexed)
vca: vim all locally modified files (including the ones indexed)
vch: vim files modified in the last commit (HEAD)
vc HEAD~1: vim all files modified in the second-last commit
You're probably familiar with git commit --amend to incorporate the
currently-staged changes into the lastcommit, effectively rewriting the last
commit. The toolbelt offers a similar command called git fixup, which will
do the same, but without prompting for the commit message. So it's like
a quicker version of commit --amend.
$ git fixup
This is a great way to build up a commit incrementally. A very typical flow for me looks like:
$ git add -p # Pick bits to commit
$ git commit
$ git add -p # Pick more bits
$ git fixup # Add those to the last commit
Sometimes I make a mistake and I accidentally commit too much, or something
I didn't intend to commit. For example, an extra file I accidentally added, or
a patch within a file that I didn't want to include. Here's how I fix that:
$ git delouse
This "empties" the last commit. Think of "emptying" as keeping the commit
message and the author/date info, but "moving" all of its changes back into the
work tree.
Technically:
Soft-resets the last commit, which means it will remove the last commit from
the branch, and put back the contents of that commit into the work tree
(basically reverting to the state right before committing). File contents
aren't changed by this, only the Git commit disappears.
Add an empty commit with the same commit message and author details as the
commit that was just removed.
The net result of these actions are that it appears as if the last commit on
the branch is "emptied" back into your work tree. This command is
non-destructive, since all of your files remain untouched. They are now just
local changes again.
This allows re-adding all changes again. Just use git add -p to select the
bits to commit, and then git fixup (see previous section) to keep changing
the last commit, effectively rebuilding it up from scratch.
Because git delouse kept the commit message and author information around in
that empty commit, the original commit info is never lost, and you don't have
to re-enter the commit message whenever you run git fixup, which makes this
whole process super cheap.
Typical flow:
$ git commit -m 'Add login screen'
# Oops! Checked in a secret key with that... let's fix this mistake!
$ git delouse
# Retry adding stuff
$ git add -p # This time, don't add the secret key
$ git fixup # Rewrites the previous commit
And if you make a mistake, you can just run git delouse again and start over,
as often as you want. Since none of these commands destroy your local changes,
this allows you to carefully craft your commit contents without the risk of
losing any data.
It's also a great way to split up a commit! For example, suppose you are
adding a bugfix but you also renamed a variable to have clearer meaning. When
submitting the code for review, you realize that the variable rename adds a lot
of noise to the actual change. You may then decide it's a good idea to split
up this commit into two pieces: one that atomically just changes the variable
name everywhere, and one that fixes the bug. You can then point to the bugfix
commit when asking for a code review.
How would that work?
$ git commit -m 'Bugfix for login screen'
# Oops, I should've split this one up. Let's start over!
$ git delouse
$ git add -p # Just pick the bugfix bits
$ git fixup
$ git add -p # Now pick the var rename bits
$ git commit -m 'Rename variable name to be clearer'
These three commands have become indispensable in my day-to-day Git routine.
If you like it, let me know!
Today, I’m thrilled to publicly announce a new open source project that we’ve been using at SimpleContacts in production for months:
[**decoders**](https://www.npmjs.com/package/decoders).
Show full content
Today, I’m thrilled to publicly announce a new open source project that we’ve been using at SimpleContacts in production for months:
decoders.
To get started:
$ npm install decoders
Here’s a quick example of what decoders can do:
import { guard, number, object, optional, string } from 'decoders';
// Define your decoder
const decoder = object({
name: string,
age: optional(number),
});
// Build the runtime checker ("guard") once
const verify = guard(decoder);
// Use it
const unsafeData = JSON.parse(request.body);
// ^^^^^^^^^^^^ Could be anything!
const data = verify(unsafeData);
// ^^^^ Guaranteed to be a person!
// Now, Flow/TypeScript will _know_ the type of data!
data.name; // string
data.age; // number | void
data.city;
// ^^^^ 🎉 Type error! Property `city` is missing in `data`.
When writing JavaScript programs (whether that’s for the server, or the
browser), one tool that has become indispensable for maintainable code bases is
a type checker like Flow or TypeScript. Disclaimer: I’m mainly
a Flow user, but everything in this post also applies to TypeScript (= also
great). Using a static type checker makes making changes to large JS
possible in ways that weren’t possible before.
One area where Flow (or TypeScript) coverage is typically hard to achieve is
when dealing with external data. Think any form of user input, an HTTP
request body, or even the results of a database query are “external” from your
app’s perspective. How can we type those things?
For example, say your app wants to do something with data coming in from a POST
request with some JSON body:
const data = JSON.parse(request.body);
The type of data here will be “any”. The reason is of course that we’re
dealing with a static type checker. So even though Flow will know that the
input to JSON.parse() must be a string, it doesn’t know which string and
the type of JSON.parse()’s return value will be defined by the value of the
string at runtime. In other words, it could be anything.
Statically, it’s impossible to know the return type. That’s why Flow can only
define this type signature as:
JSON.parse :: (value: string) => any;
Worse, even, is that using these any-typed values may implicitly
(unknowingly) turn off type checking, even for code that’s type-safe otherwise.
For example, if you could feed an implicitly-any value to a type-safe function
like:
function greet(name: string): string {
return 'Hi, ' + name + '!';
}
const data = JSON.parse(request.body);
greet(data.name);
Then Flow will just accept this, because data is any, and thus data.name
is any. But of course this isn’t safe! In this example, data cannot and
should not be trusted. Flow lets arbitrary values get passed into greet()
anyway, despite its type annotation!
Especially in real applications this puts a significant practical cap on Flow’s
usefulness. Using any (whether implicit or explicit) is completely unsafe,
and should be avoided like the plague.
How, then, can we statically type these seemingly dynamic beasts? We can do so
if we change our perspective on the problem a little bit.
Rather than trying to let Flow infer the type of a dynamic expression (which
is impossible), what if we would have a way to instead specify the type we
are expecting, and have an automatic type-check injected at runtime that will
verify those assumptions? This way, Flow is able to know, statically, what the
runtime type will be.
As you might have guessed, this is exactly what the decoders library
offers.
You can use decoders’ library of composable building blocks that allow you to
specify the shape of your expected output value:
import type { Decoder } from 'decoders';
import { guard, number, object, string } from 'decoders';
type Point = {
x: number,
y: number,
};
const pointDecoder = object({
x: number,
y: number,
});
const asPoint = guard(pointDecoder);
const p1: Point = asPoint({ x: 42, y: 123 });
const p2: Point = asPoint({ x: -3, y: 0, z: 1 });
There are a few interesting pieces to this example.
First of all, you’ll notice the similarity between the Point type, and the
structure of the decoder.
Also note that, by wrapping any value in an asPoint() call, Flow will
know—statically—that p1 and p2 will be Point instances. And therefore
you get full type support in your editor like tab completion, and full Flow
type safety like you’re used to elsewhere.
How? Because if the data does not match the decoder’s description of the data,
the call to verify() will throw a runtime error. This will be the case
in the unhappy path, for example:
const p3: Point = asPoint({ x: 42 });
// ^^^^^^^^^^^^^^^^^^ Runtime error: Missing "y" key
const p4: Point = asPoint(123);
// ^^^^^^^^^^^^ Runtime error: Must be object
Decoders come with batteries included and these base decoders are designed to
be infinitely composable building blocks, which you can assemble into complex
custom decoders.
The simplest decoder you can create are the scalar types: number,
boolean, and string. From there, you can compose them
into higher order decoders like object(), array(),
optional(), or nullable() to create more complex
types.
Decoders also comes with a special regex() decoder which is like the
string decoder, but will additionally perform a regex match and only allows
string values that match:
const hexcolor = regex(
/^#[0-9a-f]{6}$/,
'Must be hex color', // Shown in error output
);
You can then reuse these new decoders above by composing them into a polygon
decoder. Notice the reuse of the hexcolor and the point decoders here.
Notice how the fill and stroke fields here end up as normal strings.
Statically, Flow only knows that they are going to be string values, but at
runtime, they will only contain hex color values that matched the regex.
(Decoders are therefore more expressive than the type system in describing what
values are allowed.)
Note: It is not recommended to go fully overboard with this feature.
Decoders are best kept simple and straightforward, staying close to the
values they express, and not perform too much "magic" at runtime.
The best way to discover which ones are available, is to look at its reference
docs.
Human readable and helpful error messages are considered important. That’s why
decoders will always try to emit very readable error messages at runtime,
inlined right into the actual data. An example of such a message would be:
Decode error:
{
"firstName": "Vincent",
"age": "37",
^^^^
Either:
- Must be undefined
- Must be number
}
^ Missing key "name"
This is a complex error message, but optimized to be very readable to the human
eye when outputted to a console.
The same error information can also be presented as a list of error messages
for outputting in API responses. In this case, the input data isn't echoed
back as part of the error message:
[
'Value at key "age": Either:\n- Must be undefined\n- Must be number',
'Missing key: "name"'
]
(For those interested, this inline object annotation is performed by
debrief.js.)
When you have composed your decoder, it’s often useful to turn the outmost
decoder into a “guard”. A “Guard” offers a slightly more convenient API, but
is very much like a decoder. It’s also callable on unverified inputs, but it
will throw a runtime error if validation fails. They are therefore typically
easier to work with: using the guard, you can focus on the happy path and
handle any validation errors in normal exception handling mechanism.
Invoking the decoder directly on an input value will not throw a runtime error
and instead return a so called “decode result”. A “decode result” is a value
that represents either an OK value or an Error, both of which you'll need to
“unpack” to do anything useful with it.
If you want to programmatically handle the decode result, you can use a decoder
directly and inspect the decode result. If you're just interested in the data
and not in handling any decoding errors, use a guard.
In terms of types:
type Decoder<T> = any => DecodeResult<T>;
type Guard<T> = any => T;
// The guard() helper builds a guard for a decoder of the same type
guard: <T>(Decoder<T>) => Guard<T>;
(For those interested, the DecodeResult type is powered by
lemons’ Result type.)
Yesterday, pip-tools version 1.0 was silently released, officially introducing
the **pip-compile** and **pip-sync** tools, and replacing the current
**pip-dump** and **pip-review** tools.
Show full content
Yesterday, pip-tools version 1.0 was silently released, officially introducing
the pip-compile and pip-sync tools, and replacing the current
pip-dump and pip-review tools.
I've blogged before about these ideas in Pinning Your Packages and Better
Package Management. During the last year, I've been slowly working on the
future branch on the pip-tools repo, and have been using the new tools there.
The pip-sync script was the only thing that was still delaying the release,
but since Hugo Peixotocontributed this one recently,
it's now ready to switch over.
So it's now time to switch over to the new tools if you've been using the old
ones.
If you're using pip-tools 0.x, you'll notice that its main commands, pip-review
and pip-dump are gone. Instead, you'll find two new commands, pip-compile and
pip-sync, which should allow you to do the same things, but arguably in a more
solid way.
Typical usage:
pip install pip-tools
Record your top-level dependencies in requirements.in. Everything you
directly use in your source code should be a top-level dependency.
Don't pin them—unless you want them pinned, of course.
Put both requirements.in and requirements.txt under version control.
Then, run pip-compile. This will produce a requirements.txt that pins the
high-level requirements to the highest versions found on PyPI to match the
given requirements.
Using pip-sync now will install/upgrade/uninstall everything so that your
virtual env exactly matches what's in requirements.txt.
For more information, see the
README of the new tools.
When was the last time you looked at code someone else wrote? (Debugging
doens't count!) When did you _actually_ study it to learn from it? Perhaps
even ponder over its beauty?
Show full content
When was the last time you looked at code someone else wrote? (Debugging
doens't count!) When did you actually study it to learn from it? Perhaps
even ponder over its beauty?
It's surprising how uncommon it is in our industry to look at existing code
just to learn from it. In almost any other engineering or art field, people
constantly study the results of their peers. Books on architecture are a great
example. What makes a certain design so beautiful or effective? Can I learn
something from it to make me a better engineer? I feel we would benefit as
an industry if we would collectively take a little more time to reflect and
study. We should ask ourselves those question more often, and allocate study
time for it occasionally.
Last week, I ordered the book Beautiful
Code, by Greg Wilson and
Andy Oram. I would recommend this book to any fellow professional programmer.
(All of the book's proceeds go to Amnesty International.)
The book's concept is simple: each of the 33 chapters is written by
a well-respected professional programmer who answers the question: "What is
the most beautiful code you've ever seen?" after which they discuss
elaborately why they think it's beautiful.
Beauty, as it turns out, comes in many shapes and forms. The topics in the
book vary from an elegant algorithm, to a design that lets all the surrounding
puzzle pieces fall into place perfectly. From the cleverness of code
generators, to the expressiveness of a language construct.
Naturally, some chapters have been more interesting than others. But all of
them have left me with either:
a deep admiration for an elegant solution or cleverness;
an interesting perspective on good design;
an appreciation of seemingly banal things, or things I previously did not
find beauty in;
insight into how to articulate why exactly something is beautiful (how
meta)
I was reading Igor Kalnitsky's blog post on why Python's [`map()` is
mad](https://kalnytskyi.com/2015/06/14/mad-map/), and wanted to provide
a different perspective. In fact, I would call the design of Python's `map()`
beautiful instead.
Show full content
I was reading Igor Kalnitsky's blog post on why Python's map() is
mad, and wanted to provide
a different perspective. In fact, I would call the design of Python's map()
beautiful instead.
First off, what does map(f, xs) represent mathematically in the first place?
It should invoke function f(x) for every x in xs. Functions, of course,
can take many arguments—single argument functions are just the simplest case.
So what would be reasonable to assume map(f, xs, ys) would do? In the blog
post, Igor suggests the behaviour should be to chain xs and ys, but chances
are they represent completely different things, so chaining them would lead to
a heterogenous collection of items. Mathematically, you would expect the
function calls made to be f(x1, y1), f(x2, y2), ...
Note that this is different from zip()'ing the function arguments.
A function f with 2 arguments is different from a function f with
1 argument, expecting a tuple.
Compare:
def f(x, y):
return x * y
map(f, ['a', 'b', 'c'], [1, 2, 3]) # ['a', 'bb', 'ccc']
to
def f(pair):
x, y = pair
return x * y
map(f, zip(['a', 'b', 'c'], [1, 2, 3])) # ['a', 'bb', 'ccc']
The confusion around the items appearing to be zipped is caused by the implicit
behaviour in Python 2 when the first argument is None. I think it's handled
as a special case, which is unfortunate. A more consistent behaviour would have been to
The fact that the behaviour changed in Python 3 is unfortunate, but I think it
changed for the better. The problem with zip_longest()-like default
semantics is that it will only ever work with finite iterables. If only one of
the given iterables is infinite, the map will be infinite too. Now, perhaps
this is what you want, but in that case you should probably be explicit about
it anyway. I think using zip()-like semantics as the default makes perfect
sense. It enables the following usage in Python 3:
>>> from itertools import count
>>>
>>> def f(x, y):
... return x * y
...
>>> for x in map(f, ['a', 'b', 'c'], count(1)):
... print(x)
...
a
bb
ccc
Compare this to Python 2's map behaviour, which would do:
>>> for x in map(f, ['a', 'b', 'c'], count(1)):
... print(x)
...
a
bb
ccc
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in f
TypeError: unsupported operand type(s) for *: 'NoneType' and 'int'
Because it tries to invoke f(None, 4) the fourth time, which happens to fail.
If it would not fail, it would produce results infinitely.
But what if you actually want zip_longest()-like behaviour? Well, you can
either make all arguments be infinite iterables, or you can explicitly wrap
your arguments in a zip_longest() wrapper, and pass that to starmap(),
which will take an iterable of tuples and spread it over the arguments to
f(), just like map:
>>> from itertools import count, islice, starmap, zip_longest
>>>
>>> result = starmap(f, zip_longest(['a', 'b', 'c'], count(1), fillvalue='?'))
>>> for x in islice(result, 7):
... print(x)
...
a
bb
ccc
????
?????
??????
???????
As a bonus, you can pass in a fillvalue this way, instead of being stuck with
the assumption of None (which could happen to be a valid value within the
iterable).
However, personally, in this case, I'd prefer the following, more readable
version that avoids the zip_longest() and starmap() calls:
Note how you can thus make the map result infinite by simply making all
iterables infinite. Consuming iterables until the first one is exhausted (so
zip()-like), thus, is the sanest default behaviour, and the most beautiful of
the options.
In my previous post, [Use More Iterators][moar-iterators], I have outlined how
to harvest some low hanging fruit when it comes to converting your functions to
generators. In this series of posts I want to take it to the next level and
introduce a few powerful constructs that can assist you when working with
streams.
Show full content
In my previous post, Use More Iterators, I have outlined how
to harvest some low hanging fruit when it comes to converting your functions to
generators. In this series of posts I want to take it to the next level and
introduce a few powerful constructs that can assist you when working with
streams.
Previously, I've compared Python's generators to value factories
(producing values lazily) and talked about their
composability. I want to pay some more attention to these
concepts in this blog post.
One particular concept that fits generators like a glove is to use them to
stream data. Streams help you express solutions to many data manipulation
flows and processes elegantly. Of course, this idea is not novel: the concept
of streams finds its roots in the early 60's (as all good CS ideas do).
How do streams fit generators? Since a generator is a function that returns
a "value factory", it's a natural component to act as a "node" in a network of
such generators. Each such component takes input, does something with it, and
emits output.
Take a look at this example to generate a little word puzzle. It generates
a list of the first 5 dictionary words of 20 characters or more that aren't
names, and hides their vowels:
import re
from itertools import islice
vowels_re = re.compile('[aeiouy]')
def all_words():
with open('/usr/share/dict/words') as f:
for line in f:
yield line.strip()
def keep_long_words(iterable, min_len):
return (word for word in iterable if len(word) >= min_len)
def exclude_names(iterable):
return (word for word in iterable if word.lower() == word)
def hide_vowels(iterable):
for word in iterable:
yield vowels_re.sub('.', word)
def limit(iterable, n):
return islice(iterable, 0, n)
stream = all_words()
stream = keep_long_words(stream, 20)
stream = exclude_names(stream)
stream = hide_vowels(stream)
stream = limit(stream, 5)
for word in stream:
print(word)
The variable stream is used to incrementally build up an entire stream
(network of stream processors). It start with all_words(), the generator
that emits all dictionary words from the dictionary file.
Then with each further step, stream is "wrapped" in another generator, which
is used to chain the generators together. The emitted output of all_words()
will now be consumed by the keep_long_words() generator, emitting only the
words from the input stream that match the length criterium.
We keep "wrapping" these with another filter step (exclude_names()) and
a manipulation step (hide_vowels()), and finally limit the list to return
maximally 5 items.
As a result, the variable stream is re-assigned a few times. There is a nice
advantage to this approach: it avoids using any further variables, and allows
us to build up the entire stream line by line. The order in which we build it
up, resembles the way the data flows.
Lastly, you can comment out a line in the middle and the code still works. If
you decide you do want names in the result list, simply comment out this
line:
stream = all_words()
stream = keep_long_words(stream, 20)
# stream = exclude_names(stream) # comment this out to skip this step
stream = hide_vowels(stream)
stream = limit(stream, 5)
And the stream will still be valid. This is especially useful while trying out
your streams as you're still developing them streams. Without using this
intermediate variable, the equivalent would look like this:
Putting together the pieces of the stream as we did above is relatively clunky.
There is a better way. What if we could express the thing above like this?
This syntax would have the best of both worlds: it allows you to elegantly
chain together generators using the >> operator without using an intermediate
variable. If A and B are streams, then the result of A >> B is the
composition of both streams, applying A first to its input, then applying
B:
We can actually build this. Let's define a Task, a base class for each such
component that's chainable using the >> operator. It looks like this:
But how do we get from the generator functions of the example above to these
tasks that support the >> operator? We need to convert them to tasks by
implementing the process() method. Here's the exclude_names() function
converted:
class exclude_names(Task):
def process(self, inputs):
return (word for word in iterable if word.lower() == word)
And here is an example of converting a function with arguments. The argument
moves to the constructor of the class:
class keep_long_words(Task):
def __init__(self, min_len):
self.min_len = min_len
def process(self, iterable):
return (word for word in iterable if len(word) >= self.min_len)
Another notable case is the "starting" generator: the one spitting out the
dictionary words. As a generator function, this did not take any inputs, but
as a Task, it still receives the inputs argument, but should ignore it. This
way, we can treat it like any other task, and we'll see an example of how this
is useful later on:
class all_words(Task):
def process(self, inputs): # inputs arg is ignored
with open('/usr/share/dict/words') as f:
for line in f:
yield line.strip()
Using this, we can now start to make some abstractions. We can assign a series
of chained tasks to a variable and insert where we need it. This can be used
to clarify what is going on in those steps, or for reusability of a component.
Note that calling puzzlify() will return a Task instance (the one that
chains together the three sub tasks). Then, this task instance is further
chained into the larger example. Also note that puzzlify() itself is
a perpetual processor: there's no start or end defined by it. The context it's
used in defined the start and the end.
The chaining primitive allows you to craft complex data streams in an elegant
fashion.
The composition operation (>>) isn't the only task common with streams.
Another one is the split-and-join operation. In this scenario, you may want to
perform multiple operations on a single stream independent of each other. This
can be achieved using the & operator:
This will split the inputs and feed copies of the inputs to both processes.
After applying each task, the results get merged back, exhausting A first,
then B.
Suppose we would want to filter our dictionary for words that are anagrams or
contain the substring anana (or both). You could do it as follows:
stream = (all_words()
>> (is_anagram() & contains_anana()))
for word in stream:
print(word)
The real power of this split-and-join operator comes when you combine it to
perform different actions on each "side" of the split. For example, if you
want to uppercase all of the anagrams, but lowercase all of the words
containing the "anana" substring:
stream = (all_words()
>> (is_anagram() >> uppercase()
&
contains_anana() >> lowercase()))
for word in stream:
print(word)
Yesterday, a friend asked me how I would solve a certain problem he was facing.
He did have a working solution, but felt like he could make it more generally
applicable. Not shying away at a good challenge, I decided to take it and see
how I would solve it. In this blog post you can read about my solution.
Show full content
Yesterday, a friend asked me how I would solve a certain problem he was facing.
He did have a working solution, but felt like he could make it more generally
applicable. Not shying away at a good challenge, I decided to take it and see
how I would solve it. In this blog post you can read about my solution.
Note: It's kept short for brevity: the actual document contained many
more items in each of the nested lists (so multiple groups, categories, and
points), but this snippet covers the overall structure.
This is an incredibly common task I'm sure any programmer with a sufficiently
long career has encountered in one shape or form.
The simple, ad hoc, solution to the problem above is:
def sort_points(d):
for group in d['res']:
for cat in group['catlist']:
cat['points'] = sorted(cat['points'], reverse=True, key=lambda p: p['stop'])
Some downsides in arbitrary order:
Not reusable. This function can only deal with dicts of the exact same
structure. It's required to know the key names and the type of data living
at each nesting level. Any other dict will make it choke;
Brittle. The algorithm itself needs to be changed when the document gets
nested in another document, or when its structure should change. Nest it in
another dictionary, and you need an extra for loop in there. Pass it just
a category, and you need to take out a for loop;
Lacks abstraction. The algorithm mixes traversing the dict with
modifying it—these should be two separate things.
Changes the dict in-place. There's no need to rely on using mutable data
structures here. It should work for dicts you cannot change, too.
The data itself isn't interesting in the slightest. This is a problem of
structure only. Let's try to break out the pieces that are specific for this
problem and see if we can factor out a generic piece, taking the specific parts
as arguments.
The three key tasks we need to perform:
traverse the entire structure recursively;
determine if we've arrived at a given designated path;
change the nested structure at that location (using given function).
Note that these steps already show our function params. We need a way of
specifying the "path" to drill down into, and specify what transformation to
apply to those targeted elements.
First and foremost, we need a way of traversing arbitrarily nested structures.
Given that this is only JSON data (so limited to strings, numbers, lists and
dicts), we can start with a recursive function that will walk the structure and
produce a new output which will essentially be a deep copy of the input:
def traverse(obj):
if isinstance(obj, dict):
return {k: traverse(v) for k, v in obj.items()}
elif isinstance(obj, list):
return [traverse(elem) for elem in obj]
else:
return obj # no container, just values (str, int, float)
To specify the target path into the structure, we need to come up with a syntax
that can express those, a mini-language. Here's an example:
'res[].catlist[].points'
This notation clearly specifies the steps to take to drill down into the
structure to arrive at any nested element. Note that each step is explicit:
it's either a step into a dict key (the string), or into a list (the empty
list). This convenient string notation is of course just sugar for the
following:
['res', [], 'catlist', [], 'points']
How can we use this structure? Let's take the traverse() function and change
it to keep a record of the paths it's traversing along as it traverses:
def traverse(obj, path=None):
if path is None:
path = []
if isinstance(obj, dict):
return {k: traverse(v, path + [k]) for k, v in obj.items()}
elif isinstance(obj, list):
return [traverse(elem, path + [[]]) for elem in obj]
else:
return obj
But we're not doing anyting with the path we're tracking this way yet.
Now we need to use that path and in the interesting case perform an action.
One way is to add that to the traverse() function, but it would do more than
just traversing that way. Let's update the traverse function with a callback
argument that will get called for every node in the structure (every leaf, dict
entry, or list item). This would fully decouple traversing the structure from
modifying it.
def traverse(obj, path=None, callback=None):
if path is None:
path = []
if isinstance(obj, dict):
value = {k: traverse(v, path + [k], callback)
for k, v in obj.items()}
elif isinstance(obj, list):
value = [traverse(elem, path + [[]], callback)
for elem in obj]
else:
value = obj
if callback is None: # if a callback is provided, call it to get the new value
return value
else:
return callback(path, value)
Now the traverse() function is really generic and can be used to replace any
node in the structure. We can now implement our traverse_modify() function
that will look for a specific node, and update it. In this example, the
transformer() function is our callback that will be invoked on every node in
the structure. If the current path matches the target path, it will perform
the action.
def traverse_modify(obj, target_path, action):
target_path = to_path(target_path) # converts 'foo.bar' to ['foo', 'bar']
# This will get called for every path/value in the structure
def transformer(path, value):
if path == target_path:
return action(value)
else:
return value
return traverse(obj, callback=transformer)
There we have it: a generic, reusable function that abstracted out each
individual interaction: traversal, path matching, and action. To solve our
original problem with it, now simply call:
from operator import itemgetter
def sort_points(points):
"""Will sort a list of points."""
return sorted(points, reverse=True, key=itemgetter('stop'))
traverse_modify(doc, 'res[].catlist[].points', sort_points)
This blog post was meant to provide some insight into a typical software
engineering approach to problem solving. Breaking down a problem into pieces,
and abstracting out the essence is perhaps the most joyful thing to do as an
engineer.
The constructed traverse_modify() function could be written in many different
ways if you would like to give it more power. Examples are supporting the
following path specs:
qux[].(foo|bar)[] # match both foo or bar
results[0..4] # only apply action to first 4 results
foo.bar.* # match any key value
foo.bar.**.qux # match 0 or more levels of keys
...
Another extension would be to allow it to take multiple target-path/action
combinations and apply them all in a single traversal.
All of that is left as an exercise to the reader though, to keep this post
clear and concise.
The fantastic thing is that, picking a good abstraction for the traverse()
function opens a door to an entire world of possibilities to modify arbitrary
Python object structures. In the end, you get more than you initially set out
for.
If you build something cool based on this however, please let me know :)
One of my top favorite features of the Python programming language is
generators. They are so useful, yet I don't encounter them often enough when
reading open source code. In this post I hope to outline their simplest use
case and hope to encourage any readers to use them more often.
Show full content
One of my top favorite features of the Python programming language is
generators. They are so useful, yet I don't encounter them often enough when
reading open source code. In this post I hope to outline their simplest use
case and hope to encourage any readers to use them more often.
This post assumes you know what a container and an iterator is. I've explained
these concepts in a previous blog post. In
a follow-up post, I elaborate on what can be achieved with
thinking in streams a bit more.
Why are iterators a good idea? Code using iterators can avoid intermediate
variables, lead to shorter code, run lazily, consume less memory, run faster,
are composable, and are more beautiful. In short: they are more elegant.
"The moment you've made something iterable, you've done something magic with
your code. As soon as something's iterable, you can feed it to list(),
set(), sorted(), min(), max(), heapify(), sum(), ‥. Many of the
tools in Python consume iterators."
Recently, Clojure added transducers to the language, which is a concept pretty
similar to generators in Python. (I highly recommend watching Rich Hickey's
talk at Strange Loop 2014 where he introduces them.)
In the video, he talks about "pouring" one collection into another, which
I think is a verb that very intuitively describes the nature of iterators in
relationship to datastructures. I'm going to write about this idea in more
detail in a future blog post.
Composability is key here. Iterators are incredibly composable. In the
example, a list is built explicitly. What if the caller actually needs a set?
In practice, many people will either create a second, set-based version of the
same function, or simply wrap the call in a set(). Surely that works, but it
is a waste of resources. Imagine the large file again. First a list is built
from the entire file. Then it's passed to set() to build another collection
in memory. Then the original list is garbage collected.
With generators, the function just "emits" a stream of objects. The caller
gets to decide into what collection those objects gets poured.
Want a set instead of a list?
uniq_lines = set(get_lines(f))
Want just the longest line from the file? The file will be read entirely, but
at most two lines are kept in memory at all times:
longest_line = max(get_lines(f), key=len)
Want just the first 10 lines from the file? No more than 10 lines will be read
from the file, no matter how large it is:
Update: At PyCon 2013, Ned Batchelder gave a great talk that perfectly
reflects what I tried to explain in this blog post. You can watch it here,
I highly recommend it:
Don't collect data in a result variable. You can almost always avoid them.
You gain readability, speed, a smaller memory footprint, and composability in
return.
Containers are data structures holding elements, and that support membership
tests. They are data structures that live in memory, and typically hold all
their values in memory, too. In Python, some well known examples are:
list, deque, …
set, frozensets, …
dict, defaultdict, OrderedDict, Counter, …
tuple, namedtuple, …
str
Containers are easy to grasp, because you can think of them as real life
containers: a box, a cubboard, a house, a ship, etc.
Technically, an object is a container when it can be asked whether it
contains a certain element. You can perform such membership tests on
lists, sets, or tuples alike:
>>> assert 1 in [1, 2, 3] # lists
>>> assert 4 not in [1, 2, 3]
>>> assert 1 in {1, 2, 3} # sets
>>> assert 4 not in {1, 2, 3}
>>> assert 1 in (1, 2, 3) # tuples
>>> assert 4 not in (1, 2, 3)
Dict membership will check the keys:
>>> d = {1: 'foo', 2: 'bar', 3: 'qux'}
>>> assert 1 in d
>>> assert 4 not in d
>>> assert 'foo' not in d # 'foo' is not a _key_ in the dict
Finally you can ask a string if it "contains" a substring:
>>> s = 'foobar'
>>> assert 'b' in s
>>> assert 'x' not in s
>>> assert 'foo' in s # a string "contains" all its substrings
The last example is a bit strange, but it shows how the container interface
renders the object opaque. A string does not literally store copies of all of
its substrings in memory, but you can certainly use it that way.
NOTE:
Even though most containers provide a way to produce every element they
contain, that ability does not make them a container but an iterable (we'll
get there in a minute).
Not all containers are necessarily iterable. An example of this is a Bloom
filter. Probabilistic data structures like this can be asked whether
they contain a certain element, but they are unable to return their
individual elements.
As said, most containers are also iterable. But many more things are iterable
as well. Examples are open files, open sockets, etc. Where containers are
typically finite, an iterable may just as well represent an infinite source of
data.
An iterable is any object, not necessarily a data structure, that can
return an iterator (with the purpose of returning all of its elements).
That sounds a bit awkward, but there is an important difference between an
iterable and an iterator. Take a look at this example:
Here, x is the iterable, while y and z are two individual
instances of an iterator, producing values from the iterable x. Both
y and z hold state, as you can see from the example. In this example, x
is a data structure (a list), but that is not a requirement.
NOTE:
Often, for pragmatic reasons, iterable classes will implement both
__iter__() and __next__() in the same class, and have __iter__() return
self, which makes the class both an iterable and its own iterator. It is
perfectly fine to return a different object as the iterator, though.
Finally, when you write:
x = [1, 2, 3]
for elem in x:
...
This is what actually happens:
When you disassemble this Python code, you can see the explicit call to
GET_ITER, which is essentially like invoking iter(x). The FOR_ITER is an
instruction that will do the equivalent of calling next() repeatedly to get
every element, but this does not show from the byte code instructions because
it's optimized for speed in the interpreter.
So what is an iterator then? It's a stateful helper object that will
produce the next value when you call next() on it. Any object that has
a __next__() method is therefore an iterator. How it produces a value is
irrelevant.
So an iterator is a value factory. Each time you ask it for "the next" value,
it knows how to compute it because it holds internal state.
There are countless examples of iterators. All of the itertools functions
return iterators. Some produce infinite sequences:
Some produce finite sequences from infinite sequences:
>>> from itertools import islice
>>> colors = cycle(['red', 'white', 'blue']) # infinite
>>> limited = islice(colors, 0, 4) # finite
>>> for x in limited: # so safe to use for-loop on
... print(x)
red
white
blue
red
To get a better sense of the internals of an iterator, let's build an iterator
producing the Fibonacci numbers:
Note that this class is both an iterable (because it sports an __iter__()
method), and its own iterator (because it has a __next__() method).
The state inside this iterator is fully kept inside the prev and curr
instance variables, and are used for subsequent calls to the iterator. Every
call to next() does two important things:
Modify its state for the next next() call;
Produce the result for the current call.
Central idea: a lazy factory
From the outside, the iterator is like a lazy factory that is idle until you
ask it for a value, which is when it starts to buzz and produce a single
value, after which it turns idle again.
Finally, we've arrived at our destination! The generators are my absolute
favorite Python language feature. A generator is a special kind of
iterator—the elegant kind.
A generator allows you to write iterators much like the Fibonacci sequence
iterator example above, but in an elegant succinct syntax that avoids writing
classes with __iter__() and __next__() methods.
Let's be explicit:
Any generator also is an iterator (not vice versa!);
Any generator, therefore, is a factory that lazily produces values.
Here is the same Fibonacci sequence factory, but written as a generator:
Wow, isn't that elegant? Notice the magic keyword that's responsible for the
beauty:
yield
Let's break down what happened here: first of all, take note that fib is
defined as a normal Python function, nothing special. Notice, however, that
there's no return keyword inside the function body. The return value of the
function will be a generator (read: an iterator, a factory, a stateful helper
object).
Now when f = fib() is called, the generator (the factory) is instantiated and
returned. No code will be executed at this point: the generator starts in an
idle state initially. To be explicit: the line prev, curr = 0, 1 is not
executed yet.
Then, this generator instance is wrapped in an islice(). This is itself also
an iterator, so idle initially. Nothing happens, still.
Then, this iterator is wrapped in a list(), which will consume all of its
arguments and build a list from it. To do so, it will start calling next()
on the islice() instance, which in turn will start calling next() on our
f instance.
But one step at a time. On the first invocation, the code will finally run
a bit: prev, curr = 0, 1 gets executed, the while True loop is entered, and
then it encounters the yield curr statement. It will produce the value
that's currently in the curr variable and become idle again.
This value is passed to the islice() wrapper, which will produce it (because
it's not past the 10th value yet), and list can add the value 1 to the list
now.
Then, it asks islice() for the next value, which will ask f for the next
value, which will "unpause" f from its previous state, resuming with the
statement prev, curr = curr, prev + curr. Then it re-enters the next
iteration of the while loop, and hits the yield curr statement, returning
the next value of curr.
This happens until the output list is 10 elements long and when list() asks
islice() for the 11th value, islice() will raise a StopIteration
exception, indicating that the end has been reached, and list will return the
result: a list of 10 items, containing the first 10 Fibonacci numbers. Notice
that the generator doesn't receive the 11th next() call. In fact, it will
not be used again, and will be garbage collected later.
There are two types of generators in Python: generator functions and
generator expressions. A generator function is any function in which the
keyword yield appears in its body. We just saw an example of that. The
appearance of the keyword yield is enough to make the function a generator
function.
The other type of generators are the generator equivalent of a list
comprehension. Its syntax is really elegant for a limited use case.
Suppose you use this syntax to build a list of squares:
>>> numbers = [1, 2, 3, 4, 5, 6]
>>> [x * x for x in numbers]
[1, 4, 9, 16, 25, 36]
You could do the same thing with a set comprehension:
>>> {x * x for x in numbers}
{1, 4, 36, 9, 16, 25}
Or a dict comprehension:
>>> {x: x * x for x in numbers}
{1: 1, 2: 4, 3: 9, 4: 16, 5: 25, 6: 36}
But you can also use a generator expression (note: this is not a tuple
comprehension):
>>> lazy_squares = (x * x for x in numbers)
>>> lazy_squares
<generator object <genexpr> at 0x10d1f5510>
>>> next(lazy_squares)
1
>>> list(lazy_squares)
[4, 9, 16, 25, 36]
Note that, because we read the first value from lazy_squares with next(),
it's state is now at the "second" item, so when we consume it entirely by
calling list(), that will only return the partial list of squares. (This is
just to show the lazy behaviour.) This is as much a generator (and thus, an
iterator) as the other examples above.
Generators are an incredible powerful programming construct. They allow you to
write streaming code with fewer intermediate variables and data structures.
Besides that, they are more memory and CPU efficient. Finally, they tend to
require fewer lines of code, too.
Tip to get started with generators: find places in your code where you do the
following:
def something():
result = []
for ... in ...:
result.append(x)
return result
And replace it by:
def iter_something():
for ... in ...:
yield x
# def something(): # Only if you really need a list structure
# return list(iter_something())
I used to find writing command line tools tedious. Not so much the writing of
the core of the tool itself, but all the peripheral stuff you had to do to
actually _finish_ one.
Show full content
I used to find writing command line tools tedious. Not so much the writing of
the core of the tool itself, but all the peripheral stuff you had to do to
actually finish one.
The first issue is to pick the language to implement it in: do I use Python,
which I'm intimitely familiar with, or a Unix shell script? With shell
scripts, the syntax is pretty terrible, but the tool typically fits in a single
file and there's hardly any overhead running them. On the other hand, making
sure the tool works under all circumstances can be tricky. Shell scripts are
notorious for breaking when you feed them arguments with spaces. The burden of
making sure you properly quote all the variable interpolations in the script is
on the programmer. It's possible to do, just unnecessarily hard.
On the other hand, Python is so much more expressive. There are a ton of
libraries out there ready to use, and Python itself includes a lot of batteries
already in its standard library, of course.
Python comes with its own set of problems, though. Python runtime environments
are typically a mess, and I don't want to further pollute people's already
cluttered global Python environments. With Python, installing a package is
typically just a pip install <pkg> away, but it requires another tedious
step: writing a setup.py.
If it comes to distributing the script, a shell script may be much easier.
With shell scripts it's either a single file that needs to be copied somewhere.
Manually, or via a make install command, which involves adding a Makefile
and dealing with subtle differences for each Unix platform, not to even mention
trying to run it on Windows machines.
Each script will at some stage require some options or arguments. How should
we do the argument parsing? Do I use getopt or getopts? Does it even
matter? Can it take --long-form-options? Or do I resign myself to poor
man's arg parsing again? The latter has too often become the default choice.
Lately, a few fantastic projects have taken away most of the tedious work
surrounding the building of command line tools, and almost make it trivial now.
Click is a Python library written by Armin Ronacher that
deals with all the handling of command line option and argument parsing and
comes with fantastic defaults. This project is a great step towards more
consistent and standard CLI interfaces. Besides solving the options and
argument parsing, it also has a ton of useful features packaged, like smart
colorized terminal output, file abstractions, subcommands, and rendering
progress bars.
Using pipsi (also by Armin!), users can install any Python command
line script into an isolated Python runtime environment, so it solves the
global cluttered Python environment problem entirely.
Cookiecutter (by the awesome Audrey Roy Greenfield)
is a project generator, based on a predefined project template. It will read
the template, ask the user a few questions to fill in the blanks, and generates
a new project for you.
cookiecutter-python-cli is one such Cookiecutter template I wrote
that uses all of the above: it sports a predefined setup.py, a package
structure that's extensible, and test cases and a test runner to get you
started.
Next, using pipsi, install Cookiecutter in its own isolated
runtime environment:
$ pipsi install cookiecutter
Now use Cookiecutter to create your brand new project, based on my CLI template:
$ cd ~/Desktop
$ cookiecutter https://github.com/nvie/cookiecutter-python-cli.git
Cloning into 'cookiecutter-python-cli'...
remote: Counting objects: 64, done.
remote: Total 64 (delta 0), reused 0 (delta 0)
Unpacking objects: 100% (64/64), done.
Checking connectivity... done.
full_name (default is "Vincent Driessen")?
email (default is "vincent@3rdcloud.com")?
github_username (default is "nvie")?
project_name (default is "My Tool")?
repo_name (default is "python-mytool")?
pypi_name (default is "mytool")?
script_name (default is "my-tool")?
package_name (default is "my_tool")?
project_short_description (default is "My Tool does one thing, and one thing well.")?
release_date (default is "2014-09-04")?
year (default is "2014")?
version (default is "0.1.0")?
When you're done, you'll have a project where you can run tox to run
your test suite on all important Python versions. If you don't need the test
cases, simply remove the tests/ directory.
You can edit the setup.py to your liking. The default provided version
should already work out of the box. When you're done implementing your tool,
you can either upload it to PyPI or just keep it to yourself locally:
> **Update** (Feb 2026): This article is
> really really old. You probably want to use `uv`
> today.
> **Update** (March 2019): The Python
> packaging landscape has changed significantly
> since I first wrote this post. Your choice
> today is mostly between using
> [pip-tools](https://github.com/jazzband/pip-tools)
> directly, using
> [Pipenv](https://docs.pipenv.org/) (which is
> a Swiss army knife kind of tool that
> internally relies on pip-tools), or newer
> tooling like
> [Poetry](https://poetry.eustace.io/). A good
> post to help you decide which is the tool to
> best fit your use case is
> [Python Application Dependency Management in 2018](https://hynek.me/articles/python-app-deps-2018/)
> by [Hynek Schlawack](https://twitter.com/hynek).
Show full content
Update (Feb 2026): This article is
really really old. You probably want to use uv
today. Update (March 2019): The Python
packaging landscape has changed significantly
since I first wrote this post. Your choice
today is mostly between using
pip-tools
directly, using
Pipenv (which is
a Swiss army knife kind of tool that
internally relies on pip-tools), or newer
tooling like
Poetry. A good
post to help you decide which is the tool to
best fit your use case is
Python Application Dependency Management in 2018
by Hynek Schlawack.
You are managing your Python packages using pip and requirements.txt spec
files already. Maybe, you are even pinning them
too—that’s awesome. But how do you keep your environments clean and fresh?
Here’s what I think can be improved to the state of package management in
Python.
Virtue 1: Declare only your top-level dependencies ¶
Often, your project will only need a limited set of what I’ll call top-level
package dependencies. A typical example is that you’ll depend on Django or
Flask. But just putting those names in an requirements.txt file is inherently
dangerous and will bite you at some point. If you don’t see why, read this
post first.
So now you’re pinning them. If your app needs Flask, this will typically be in
your requirements.txt file:
Flask==0.9
Jinja2==2.6
Werkzeug==0.8.3
Jinja2 and Werkzeug are in there, because Flask needs them. And since you don’t
want fate to decide which versions of Jinja2 and Werkzeug you’ll get when
deploying, you’re wisely pinning them.
The problem with this is that over time your requirements.txt file will
accumulate all kinds of dependencies, and in reality, it’s not unusual that
you’ll lose sight of which packages are still used, and which have become
stale.
The following file is the result of depending on Flask and legit.
Looking at this, I’d have no clue what smmap is, and why it’s needed in
there.
Wouldn’t it be awesome to actually have a way of expressing only your top-level
dependencies in a file called requirements.in, like this?
Flask>=0.9 # we use 0.9 features
legit # any version will do for us
And “compiling” that to an actual requirements.txt:
async==0.6.1 # required by legit==0.1.1
clint==0.3.1 # required by legit==0.1.1
Flask==0.9
gitdb==0.5.4 # required by legit==0.1.1
GitPython==0.3.2.RC1 # required by legit==0.1.1
Jinja2==2.6 # required by Flask==0.9
legit==0.1.1
smmap==0.8.2 # required by legit==0.1.1
Werkzeug==0.8.3 # required by Flask==0.9
This tool exists, and is called pip-compile. Check it out on the future
branch of pip-tools.
(UPDATE This is now the master branch, available since 1.0.) I wrote this
together with Bruno Renié over the last few
months.
Let’s elaborate on this a bit. The .in file provides the file format that
you’d actually would want to use and maintain as a developer, while the
result of compilation is the file that you want to use to build deterministic
(and thus predictable) envs.
Note that there’s a fundamental difference here between “compiling” these .in
files and compiling a file of source code: the result of the compilation itself
isn’t deterministic. This means that compiling your requirements may lead to
a different requirements.txt file depending on the moment you run it—because
in the meantime some packages might have gotten updates in PyPI.
The point is to freeze the specs. Exactly why you were pinning your
dependencies already.
As a consequence, you should put both files under version control. This plays
well with PAAS providers like Heroku as well. The .in file is only used for
your own maintenance convenience, while the .txt file is actually used to
install to your env. The difference is, it’s now generated for you.
A Quick Note on Complex Dependencies
We’ve created pip-compile to be smart with respect to resolving complex
dependency trees. For example, Flask 0.9 depends on Jinja2>=2.4. If another
package, say Foo, declared Jinja2<2.6, you’ll end up having Jinja2==2.5
in your compiled requirements. It can figure this out. (Obviously, conflicts
can occur, in which case compilation will fail.)
The next step, then, would be to rebuild your actual virtualenvs by having them
reflect exactly what’s in your (compiled) spec file. Let’s replay the example
above.
Recall that we have this in our requirements.in file:
Flask>=0.9
legit
Then we run pip-compile, and get:
async==0.6.1 # required by legit==0.1.1
clint==0.3.1 # required by legit==0.1.1
Flask==0.9
gitdb==0.5.4 # required by legit==0.1.1
GitPython==0.3.2.RC1 # required by legit==0.1.1
Jinja2==2.6 # required by Flask==0.9
legit==0.1.1
smmap==0.8.2 # required by legit==0.1.1
Werkzeug==0.8.3 # required by Flask==0.9
Now, to actually install that into our environment, we typically run:
$ pip install -r requirements.txt
But frankly, this isn’t enough. To actually reliably mimic the spec file, the
env might need to uninstall some packages as well. This can actually be very
important. Suppose you have a package that’s already installed in your env, say
requests. Your code is using it, but you forgot to add it to
requirements.txt. That way, running pip install -r requirements.txt will
work fine, but deploying this code will break due to an ImportError.
Meet pip-sync. This tool will install all required packages into your env,
but will additionally uninstall everything else in there. Combined with
pip-compile, this makes for package management nirvana. Say you don’t need
legit anymore, and want to remove it as a project dependency.
First, remove that top-level dependency from the .in file:
Flask
# legit # comment out, or remove
Then run pip-compile to update the compiled spec file:
Flask==0.9
Jinja2==2.6 # required by Flask==0.9
Werkzeug==0.8.3 # required by Flask==0.9
The unused dependencies are removed automatically. Now we need to sync that
back to our actual env:
This will now uninstall legit and all it’s dependencies from the virtualenv
(unless some other package would depend on them still). Your virtualenv is
crisp and clean.
I would propose PAAS providers to adopt the use of pip-sync over pip install
-r requirements.txt, as environments are automatically cleaned up that way.
As said, over the last few months, Bruno Renié
and myself have been working on a better version of the pip-tools project—one
that would let us do exactly the above. We’ve not been very public about it,
but you might have noticed the future
branch. Basically, this would
replace the existing pip-dump command by something inherently more
manageable.
I do solicit feedback on all this, so feel free to get in touch.
Make your Python production deployments predictable and deterministic by
pinning your dependencies.
Show full content
In building your Python application and its dependencies for production, you
want to make sure that your builds are predictable and deterministic.
Therefore, always pin your dependencies.
Update:
A newer blog post about the future of pip-tools is available too: Better
Package Management.
If you don’t, you can never know what you’ll get when you run pip install.
Even if you rebuild the env every time, you still can’t predict it. The outcome
relies on a) what’s currently installed, and b) what’s the current version on
PyPI.
Eventually, all of your environments, and those of your team members, will run
out of sync. Worse even, this cannot be fixed by rerunning pip install. It’s
just waiting for bad things to happen in production.
The only way of making your builds deterministic, is if you pin every single
package dependency (even the dependency’s dependencies).
WARNING:
Don’t pin by default when you’re building libraries! Only use pinning for end
products.
The biggest complaint from folks regarding explicit pinning is that you won’t
benefit from updates that way. Well, yes, you won’t. But think of it. It’s
impossible to distinguish between a new release that fixes bugs, or one that
introduced them. You are leaving it up to coincidence. There is only one way to
retake control: pin every dependency.
So: we want to pin packages, but don’t want to let them become outdated. The
solution: use a tool that can check for updates. This is exactly what I built
pip-tools for.
pip-tools is the collective name for two tools: pip-review + pip-dump
It will check for available updates of all packages currently installed in your
environment, and report about them when available:
$ pip-review
requests==0.14.0 available (you have 0.13.2)
redis==2.6.2 available (you have 2.4.9)
rq==0.3.2 available (you have 0.3.0)
You can also install them automatically:
$ pip-review --auto
...
or interactively decide whether you want to install each package:
$ pip-review --interactive
requests==0.14.0 available (you have 0.13.2)
Upgrade now? [Y]es, [N]o, [A]ll, [Q]uit y
...
redis==2.6.2 available (you have 2.4.9)
Upgrade now? [Y]es, [N]o, [A]ll, [Q]uit n
rq==0.3.2 available (you have 0.3.0)
Upgrade now? [Y]es, [N]o, [A]ll, [Q]uit y
...
It’s advisable to pick a fixed schedule to run pip-review. For example, every
monday during a weekly standup meeting with your engineering team. Make it
a point on the agenda. You discuss pip-review’s output, inspect changelogs,
or just blindly upgrade them. The important part is that you do it explicitly.
You have the chance to run with the upgraded versions for a while in
a development environment, before pushing those versions to production.
Whereas pip-review solves the problem of how to check for updates of pinned
packages, pip-dump focuses on the problem of how to dump those definitions
into requirements files, managed under version control.
Typically, in Python apps, you include a requirements.txt file in the root of
your project directory, and you run pip freeze > requirements.txt
periodically. While this works for simple projects, this doesn’t scale. Some
packages are installed for development, or personal, purposes only and you
don’t want to include those in requirements.txt, going to production, or
visible to your other team members.
pip-dump provides a smarter way to dump requirements. It understands the
convention of separating requirements into multiple files, following the naming
convention:
requirements.txt is the main (and default) requirements file;
dev-requirements.txt, or test-requirements.txt, or actually,
*requirements.txt, are secundary dependencies.
When you have a requirements.txt and dev-requirements.txt file in your
project, with the following content:
It keeps the files sorted for tidiness, and to reduce the chance of merge
conflicts in version control.
You can even put packages in an optional file called .pipignore. This is
useful if you want to keep some packages installed in your local environment,
but don’t want to have them reflected in your requirements files.
pip-tools 0.x is relied on by many already on a daily/weekly basis. It’s
worth noting that we’re working on Better Package Management too,
which will be the future of pip-tools. If you want to contribute, please
shout out.
Some thoughts on how I like to write libraries as "open source" projects.
Show full content
Reflecting on how I build software lately, I noticed a pattern. I tend to write
libraries in absolute isolation, as if they were open sourced and the world
is watching along.
“The difference between theory and practice is that, in theory, there is
none.”
We have all been schooled to isolate units of software into reusable
components. Software engineering literature refers to this as separation of
concerns since decades. It reduces the big problem into smaller non-overlapping
problems.
We obviously try doing so, by putting related logic into modules, libraries and
whatnot. Yet, in practice, so many real world projects fail at their attempts
and end up evolving into something unnecessarily complicated.
The main problem with this is that it becomes increasingly hard to comprehend
and reason about your software, not to mention the “increased fun” maintaining
it.
Why is it so hard to actually achieve this in practice? Separating concerns
apparently is easier said than done.
The need for a new library often arises when solving a larger problem top-down.
In the quest of solving a larger problem, you need to create a smaller
component first that is required to get to a fully working solution. This is
what most of our work as engineers is about—while solving a larger problem, we
run into bumps along the road. When we do, we stop, fix the bump, and continue
on our journey to solving the large problem.
In our rush to arrive at our end destination, we want to fix any bumps as
quickly as possible. For many good reasons, mostly. We might have a deadline,
or we are afraid to get lost in details and lose focus on the bigger problem we
were actually solving.
In short, we tend to see those bumps as unsolicited chores that are blocking us
and we want to spend as little time as possible at overcoming them. From
a quality perspective, however, this may not be the best route to take. So the
least you can do is create that Technical Debt ticket, feel less guilty, and
move on :)
I’ve never seen a project work any more glorious than this.
Running into these bumps sucks. You’re frustrated that you’re held up and are
continuously thinking: dammit, I don’t want to deal with this now. The reality
often is that you don’t have a choice.
Instead, step back a few steps, take a deep breath, and accept that you’ll have
to spend more time on this problem than budgeted. This enables the mental rest
to make a good engineering decision without too much frustration emotion
involved.
What I like to do at this moment, is to start a new open source project.
Not necessarily a public one, but I do set it up like it is and actually
consider it to be, or eventually become, open. I start out with a README
decribing the problem and the API I’d like it to have. And in the case of
Python, I also create a setup.py so integrating this into the original
project is only a pip install away.
Then, just start implementing it.
Let me try to highlight the benefits this approach provides.
The pressure you get from pretending (or knowing) that many others will read
your code, pushes you to do it right. I’d ask myself continuously: would this
be an API that I’d show off to the outside world and be proud of? Could I truly
explain this API in a README so that people would understand? If not, I don’t
implement it like that and push harder.
Many eyeballs make you feel more responsible. Writing stuff in private for
yourself, doesn’t.
A big difference between starting an actual new project, or developing it as
one-of-many internal libraries, is that it’s impossible to rely on other parts
of the end product. For example, that convenient project-specific helper
function you already wrote is easily included in a module, but not so much in
another project.
In an open source project, you simply cannot cheat on yourself this way and
you’re forced to come up with a better solution. This might feel inconvenient
at first, but remember that it’s easy to write complex software and it takes
more care and dedication to write simpler software.
As a way of visualising this approach: compare programming to electrical
engineering for a minute. Say you have to create a circuit board of some sort.
The chip on this board is analogous to an open source software project. Its
internals are nicely abstracted away, the pins of the chip form its API, it’s
probably well-tested, well-documented and can be reused immediately. It is
physically impossible to connect to any of the internals of it—which is exactly
the point of abstraction.
Looking at the circuit board, everything about this falls into place pretty
obviously.
As programmers, we often fool ourselves that we’re isolating logic into
modules/libraries, while in fact we’re merely organising it. Modules will
oftentimes still contain project-specific dependencies. (As a good litmus test,
move that module to an empty directory and use it. If it breaks, it wasn’t
truly isolated.)
The curse with programming is that it’s so easy to create these dependencies.
They are only one import statement away. Developers live in a world where
that temptation continuously lurks.
But by isolating code into a stand-alone project, you can remove this
temptation wholly, thereby reducing ways of cheating on yourself.
Another big benefit of truly isolating your libraries this way, is that you are
forced to think about its API. It’s the only way of interacting with the
library after all. Doing this, you’ll naturally feel the urge to simplify.
Complex APIs lead to complicated documentation and complex tests. The opposite
applies, too, fortunately, and you’ll naturally be inclined to simplify.
A concrete example of this: When you are hacking in a web environment, you most
likely have “the request” or “the DB connection” at your disposal any time.
When you put your library in a module, it’s easy for these to become implicit
dependencies of your library. Your library may pretty well work outside of
a request context, however, and in fact, the only thing you actually need from
the request could be a User instance. When you build your library as
a separate project, these decisions fall into place effortlessly. In the end,
this makes your library more decoupled, more generic, and overall cleaner. And
as such, simpler.
True Isolation™ is the ultimate catalyst of simplicity.
Even if you’re only using this technique privately to produce better software
for yourself, this pays off already in a technical sense.
But open sourcing can also pay off in non-technical ways. When it fits your
company’s strategy, you now have the choice to actually publish your project at
any time, since it’s been written for the public from the beginning. If it
solves a common problem, others may like it and take interest in following it
or even contributing to it. This may open up a whole new world of users
providing feedback and improvements through code or documentation
contributions.
Your company may come across as an interesting place to work for to talented
people. Your open source project can be your company’s banner. We’ve seen this
with companies like Joyent (of Node.js fame),
10gen (of MongoDB fame) and
Opscode (of Chef fame), just to name a few. Open
sourcing has been an important marketing value to these companies and they have
attracted many talented folks through their high-quality work.
Just always remember that simpler projects have a much lower barrier for
contributors, so these are more likely to receive patches. Which by itself is
another good reason to simplify your libraries :)
Many of the things I created recently, I created this way. Back a few months
ago, I needed a super-simple solution to put work in the background. I was
working on a startup idea, which I was creating a proof of concept for. It was
a small Flask web app, and I used this
snippet initially to offload work to the
background. It did the work fine, but I soon needed it to do more, so I kept
tweaking it until it was no longer a snippet, but a library. Although it was
nicely organised in a directory, it was still tailored to the specific product
I was creating.
This is where I decided to step back and started building that library like an
open source project, which became RQ, of course.
After using it privately for about four months, its API kept changing quite
a bit, but its use became more general over time. I started reusing it for
other projects I was working on, until I considered it stable enough.
I believed it could be of help to other Python engineers, so I decided to open
source RQ in March.
Eventually I dropped the original startup idea, but I still have RQ. Had I not
open sourced it, it would now be buried with the rest of that project’s code.
Open sourcing pays off. Even if you do it in private.
Today, I'm open sourcing a project that I've been working for the last few
months. It is a Python library to put work in the background, that you'd
typically use in a web context. It is designed to be simple to set up and
use, and be of help in almost any modern Python web stack.
Show full content
Today, I’m open sourcing a project that I’ve been working for the last few
months. It is a Python library to put work in the background, that you’d
typically use in a web context. It is designed to be simple to set up and use,
and be of help in almost any modern Python web stack.
Of course, there already exist a few solutions to this problem.
Celery (by the excellent
@asksol) is by far the most popular Python
framework for working with asynchronous tasks. It is agnostic about the
underlying queueing implementation, which is quite powerful, but also poses
a learning curve and requires a fair amount of setup.
Don’t get me wrong—I think Celery is a great library. In fact, I’ve
contributed to
Celery myself in the past. My experiences are, however, that as your Python web
project grows, there comes this moment where you want to start offloading small
pieces of code into the background. Setting up Celery for these cases is
a substantial effort that isn’t done swiftly and might be holding you back.
I wanted something simpler. Something that you’d use in all of your Python
web projects, not only the big and serious ones.
In many modern web stacks, chances are that you’re already using
Redis (by @antirez).
Besides being a kick-ass key value store, Redis also provides semantics to
build a perfect queue implementation. The commands
RPUSH,
LPOP and
BLPOP are all it takes.
Inspired by Resque (by
defunkt) and the simplicity of this Flask
snippet (by
@mitsuhiko), I’ve challenged myself to
imagine just how hard a job queue library really should be.
I wanted a solution that was lightweight, easy to adopt, and easy to grasp. So
I devised a simple queueing library for Python, and dubbed it RQ. In
a nutshell, you define a job like you would any normal Python function.
def myfunc(x, y):
return x * y
Now, with RQ, it is ridiculously easy to put it in the background like this:
from rq import use_connection, Queue
# Connect to Redis
use_connection()
# Offload the "myfunc" invocation
q = Queue()
q.enqueue(myfunc, 318, 62)
This puts the equivalent of myfunc(318, 62) on the default queue. Now, in
another shell, run a separate worker process to perform the actual work:
$ rqworker
12:46:56:
12:46:56: *** Listening on default...
12:47:35: default: mymodule.myfunc(318, 62) (38d9c157-e997-40e2-8d20-574a97ec5a99
12:47:35: Job OK, result = 19716
12:47:35:
12:47:35: *** Listening on default...
...
To poll for the asynchronous result in the web backend, you can use:
Although I must admit that polling for job results through the return_value
isn’t quite useful and probably won’t be a pattern that you’d use in your
day-to-day work. (I would certainly recommend against doing that, at least.)
RQ was designed to be as easy as possible to start using it immediately inside
your Python web projects. You only need to pass it a Redis connection to use,
because I didn’t want it to create new connections implicitly.
To use the default Redis connection (to localhost:6379), you only have to do
this:
from rq import use_connection
use_connection()
You can reuse an existing Redis connection that you are already using and pass
it into RQ’s use_connection function:
There are more advanced ways of connection management available however, so
please pick your favorite. You
can safely mix your Redis data with RQ, as RQ prefixes all of its keys with
rq:.
RQ offers functionality to put work on queues. It provides FIFO-semantics per
queue, but how many queues you create is up to you. For the simplest cases,
simply using the default queue suffices already.
Both queues are equally important to RQ. None of these has higher priority as
far as RQ is concerned. But when you start a worker, you are defining queue
priority by the order of the arguments:
$ rqworker high low
12:47:35:
12:47:35: *** Listening on high, low...
12:47:35: high: mymodule.myfunc(6, 7) (cc183988-a507-4623-b31a-f0338031b613)
12:47:35: Job OK, result = 42
12:47:35:
12:47:35: *** Listening on high, low...
12:47:35: low: mymodule.myfunc(2, 3) (95fe658e-b23d-4aff-9307-a55a0ee55650)
12:47:36: Job OK, result = 6
12:47:36:
12:47:36: *** Listening on high, low...
12:47:36: low: mymodule.myfunc(4, 5) (bfb89229-3ce4-463c-abf8-f19c2808cb7c)
12:47:36: Job OK, result = 20
...
First, all work on the high queue is done (with FIFO semantics), then low
is emptied. If meanwhile work is enqueued on high, that work takes precedence
over the low queue again after the currently running job is finished.
By default, these monitoring commands autorefresh every 2.5 seconds, but you
can change the refresh interval if you want to. See the monitoring
docs for more info.
RQ does not try to solve all of your queueing needs. But its codebase is
relatively small and certainly not overly complex. Nonetheless, I think it will
be helpful for all of the most basic queueing needs that you’ll encounter
during Python web development.
Of course, with all this also come some limitations:
I’m using RQ for two and a half web projects I’ve worked on during the last few
months, and I am currently at the point where I’m satisfied enough to open the
curtains to the world. So you’re invited to play with it. I’m very curious to
hear your thoughts about this.
Just a quick post to let you know that I discarded my `vim-pep8` and
`vim-pyflakes` Vim plugins yesterday in favor of
[vim-flake8](https://github.com/nvie/vim-flake8).
Show full content
Just a quick post to let you know that I discarded my vim-pep8 and
vim-pyflakes Vim plugins yesterday in favor of
vim-flake8.
As you may know, PyFlakes is a static analysis tool that lets you catch static
programming errors when you write them, not when you run into them at runtime.
And pep8 is a Python style checking tool that enforces
PEP8 guidelines on your code.
Flake8, though, seems to be a much better
option to use these days. It integrates both of PEP8 and PyFlakes and even
combines it with a cyclomatic complexity checker (which is irrelevant for the
Vim plugin, by the way). To install Flake8, simply use:
$ pip install flake8
After installing the plugin in Vim, you can add the following command to your
.vimrc file to have it executed after every save of a Python source file.
autocmd BufWritePost *.py call Flake8()
To avoid specific error messages from being reported, put a # noqa comment at
the end of that line.
Assuming you already use vim-pathogen
(which you really should), you can simply install the plugin by cloning the
repository into the ~/.vim/bundle
folder.
Lately I’ve been getting sick of working with datetimes and timezones in
Python. The standard library offers many different conversion routines, but
does not prescribe a best practice way to deal with them. Luckily, Armin
Ronacher did in his article [Dealing with Timezones in
Python](http://lucumr.pocoo.org/2011/7/15/eppur-si-muove/).
Show full content
Lately I’ve been getting sick of working with datetimes and timezones in
Python. The standard library offers many different conversion routines, but
does not prescribe a best practice way to deal with them. Luckily, Armin
Ronacher did in his article Dealing with Timezones in
Python.
The summary is to never ever work with local datetimes. When a local datetime
is input, immediately convert it to universal time and only ever store or
calculate with those. Only when presenting datetimes to the end user, convert
them to local time again.
This seems simple enough, alright. But to actually do it in Python, you still
have to think about how to implement it correctly. Every. Single. Time.
pytz does help a bit here, but it still isn’t trivial. It should be.
Meet Times, a very small Python library to
deal with conversions from universal to local timezones and vice versa. It’s
focused on simplicity and opinionated about what is good practice.
Imagine you’re building a web app that allows your users to set an alarm. Say
that someone in the Netherlands sets an alarm to 9:30 am. You can use times
to simplify this:
I’ve added the ability to create universal times from two other sources: UNIX
timestamps and date strings. To use any of these, simply pass them to the
to_universal function, like so:
Note that UNIX timestamps must be in UTC (which the output of time.time()
is). Local UNIX timestamps are not accepted.
To create universal times from string representations, Times uses the
advanced parser from the python-dateutil library. Time zones are
automatically recognized if such info is encoded in the string representation.
In any other case, you are required to provide it explicitly. Two examples to
illustrate both variants:
I released my first iPad app to import/manage chords and lyrics.
Show full content
It’s been quite a while since I took the time to update this blog. Many things
have happened in the meanwhile, though. The most important happening for me is
that I launched an iPad app and I founded a company called 3rd Cloud last
week.
An annoying problem amateur musicians might be familiar with is that chords or
tablature websites all look very differently, format their song data in various
formats, and oftentimes are just plain ugly. Oh, and they’re generally paved
with ads, too. So to scratch our own itches, I teamed up with
@jr00n from
StudioWolff to create Chords + Lyrics.
Chords + Lyrics is a simple music manager for your iPad that allows you to
easily import songs and lyrics from your favorite chords website (that means:
any website—it recognizes the chords semantically):
Once imported, the chords become editable objects in the form of bubbles, which
makes it easy for you to edit or finetune the imported songs. A carefully
selected choice of fonts is used to create a readable and uniform look and feel
for your song’s chords and lyrics:
Furthermore, once the editing is done, you can simply take Chords + Lyrics on
stage with your band or solo performance and leave your stack of sheet music at
home. When the device is rotated into landscape orientation, the user interface
transforms into a big music stand that arranges the songs of choice as a stack
of virtual music sheets in the order of your playlist:
You can get a sneak peak of the app by watching this video:
The app is sold at $5.99. Check us out in the App Store.
Jeroen and I met during iOS Dev Camp last March
and immediately were excited with the idea for Chords + Lyrics. We even
received the Most Likely to Succeed award from Dom
Sogolla and Mike Lee at the
end of those two days, which got us even more excited about the project.
We started hacking away at it in our spare time for the next few months. While
at the same time, by some kind of lucky coincidence, the same Mike Lee started
building an awesome developer community for app developers:
Appsterdam.
Many of our thanks therefore go out to all of the Appsterdammers, in particular
Mike and Judy, for inspiring us at the moments
we got stuck and for the helpful pieces of advice we got from them. Without the
Appsterdam community, the project may have never seen the light of day, or be
much less awesome.
Mr. [Dave Bock](http://www.davebock.com/) of Code Sherpa’s put together a nice
screencast demonstrating a few of the most important git-flow features on their
[publications](http://www.codesherpas.com/portfolio/publications) page.
Show full content
Mr. Dave Bock of Code Sherpa’s put together a nice
screencast demonstrating a few of the most important git-flow features on their
publications page.
Where I lay out the recent changed I made to my Vim setup.
Show full content
A few weeks ago, I felt inspired by articles from Jeff
Kreeftmeijer and Armin
Ronacher. I took some
time to configure and fine-tune my Vim environment. A lot of new stuff made it
into my .vimrc file and my .vim directory. This blog post is a summary
describing what I’ve added and how I use it in my daily work.
Before doing anything else, make sure you have the following line in your
.vimrc file:
" This must be first, because it changes other options as side effect
set nocompatible
Before starting configuring, it’s useful to install
pathogen. Plugins in
Vim are files that you drop in subdirectories of your .vim/ directory. Many
plugins exist of only a single file that should be dropped in .vim/plugin,
but some exist of multiple files. For example, they come with documentation, or
ship syntax files. In those cases, files need to be dropped into .vim/doc and
.vim/syntax. This makes it difficult to remove the plugin afterwards. After
installing pathogen, you can simply unzip a plugin distribution into
.vim/bundle/myplugin, under which the required subdirectories are created.
Removing the plugin, then, is as simple as removing the myplugin directory.
So, download pathogen.vim, move it into the .vim/autoload directory (create
it if necessary) and add the following lines to your .vimrc, to activate it:
" Use pathogen to easily modify the runtime path to include all
" plugins under the ~/.vim/bundle directory
call pathogen#helptags()
call pathogen#runtime_append_all_bundles()
Next, I’ve remapped the leader key to , (comma) instead of the default \
(backslash), just because I like it better. Since in Vim’s default
configuration, almost every key is already mapped to a command, there needs to
be some sort of standard “free” key where you can place custom mappings under.
This is called the “mapleader”, and can be defined like this:
" change the mapleader from \ to ,
let mapleader=","
Once that is done, this is a little tweak that is a time-saver while you’re
building up your .vimrc. Here, we start using the leader key:
One particularly useful setting is hidden. Its name isn’t too descriptive,
though. It hides buffers instead of closing them. This means that you can
have unwritten changes to a file and open a new file using :e, without being
forced to write or undo your changes first. Also, undo buffers and marks are
preserved while the buffer is open. This is an absolute must-have.
set hidden
These are some of the most basic settings that you probably want to enable,
too:
set nowrap " don't wrap lines
set tabstop=4 " a tab is four spaces
set backspace=indent,eol,start
" allow backspacing over everything in insert mode
set autoindent " always set autoindenting on
set copyindent " copy the previous indentation on autoindenting
set number " always show line numbers
set shiftwidth=4 " number of spaces to use for autoindenting
set shiftround " use multiple of shiftwidth when indenting with '<' and '>'
set showmatch " set show matching parenthesis
set ignorecase " ignore case when searching
set smartcase " ignore case if search pattern is all lowercase,
" case-sensitive otherwise
set smarttab " insert tabs on the start of a line according to
" shiftwidth, not tabstop
set hlsearch " highlight search terms
set incsearch " show search matches as you type
There is a lot more goodness in my
.vimrc file, which is put in
there with a lot of love. I’ve commented most of it, too. Feel free to poke
around in it.
Also, I like Vim to have a large undo buffer, a large history of commands,
ignore some file extensions when completing names by pressing Tab, and be
silent about invalid cursor moves and other errors.
set history=1000 " remember more commands and search history
set undolevels=1000 " use many muchos levels of undo
set wildignore=*.swp,*.bak,*.pyc,*.class
set title " change the terminal's title
set visualbell " don't beep
set noerrorbells " don't beep
I don't like Vim to ever write a backup file. I prefer more modern
ways of protecting against data loss.
set nobackup
set noswapfile
There have been some passionate responses about this in comments, so a warning
may be appropriate here. If you care about recovering after a Vim or terminal
emulator crash, or you often load huge files into memory, do not disable
the swapfile. I personally save/commit so
often
that the swap file adds nothing. Sometimes I conciously kill a terminal
forcefully, and I only find the swap file recovery process annoying.
Vim can detect file types (by their extension, or by peeking inside the file).
This enabled Vim to load plugins, settings or key mappings that are only useful
in the context of specific file types. For example, a Python syntax checker
plugin only makes sense in a Python file. Finally, indenting intelligence is
enabled based on the syntax rules for the file type.
filetype plugin indent on
To set some file type specific settings, you can now use the following:
autocmd filetype python set expandtab
To remain compatible with older versions of Vim that do not have the autocmd
functions, always wrap those functions inside a block like this:
Somewhat related to the file type plugins is the syntax highlighting of
different types of source files. Vim uses syntax definitions to highlight
source code. Syntax definitions simply declare where a function name starts,
which pieces are commented out and what are keywords. To color them, Vim uses
colorschemes. You can load custom color schemes by placing them in
.vim/colors, then load them using the colorscheme command. You have to try
what you like most. I like
mustang
a lot.
if &t_Co >= 256 || has("gui_running")
colorscheme mustang
endif
if &t_Co > 2 || has("gui_running")
" switch syntax highlighting on, when the terminal has colors
syntax on
endif
In this case, mustang is only loaded if the terminal emulator Vim runs in
supports at least 256 colors (or if you use the GUI version of Vim).
Hint:
If you’re using a terminal emulator that can show 256 colors, try setting
TERM=xterm-256color in your terminal configuration or in your shell’s .rc
file.
When you write a lot of code, you probably want to obey certain style rules.
In some programming languages (like Python), whitespace is important, so you
may not just swap tabs for spaces and even the number of spaces is important.
Vim can highlight whitespaces for you in a convenient way:
set list
set listchars=tab:>.,trail:.,extends:#,nbsp:.
This line will make Vim set out tab characters, trailing whitespace and
invisible spaces visually, and additionally use the # sign at the end of
lines to mark lines that extend off-screen. For more info, see :h listchars.
In some files, like HTML and XML files, tabs are fine and showing them is
really annoying, you can disable them easily using an autocmd declaration:
autocmd filetype html,xml set listchars-=tab:>.
One caveat when setting listchars: if nothing happens, you have probably not
enabled list, so try :set list, too.
Every Vim user likes to enable auto-indenting of source code, so Vim can
intelligently position you cursor on the next line as you type. This has one
big ugly consequence however: when you paste text into your terminal-based Vim
with a right mouse click, Vim cannot know it is coming from a paste. To Vim, it
looks like text entered by someone who can type incredibly fast :) Since Vim
thinks this is regular key strokes, it applies all auto-indenting and
auto-expansion of defined abbreviations to the input, resulting in often
cascading indents of paragraphs.
There is an easy option to prevent this, however. You can temporarily switch to
“paste mode”, simply by setting the following option:
set pastetoggle=<F2>
Then, when in insert mode, ready to paste, if you press <F2>, Vim will switch
to paste mode, disabling all kinds of smartness and just pasting a whole buffer
of text. Then, you can disable paste mode again with another press of <F2>.
Nice and simple. Compare paste mode disabled vs enabled:
Another great trick I read in a reddit
comment
is to use <C-r>+ to paste right from the OS paste board. Of course, this only
works when running Vim locally (i.e. not over an SSH connection).
While using the mouse is considered a deadly sin among Vim users, there are
a few features about the mouse that can really come to your advantage. Most
notably—scrolling. In fact, it’s the only thing I use it for.
Also, if you are a rookie Vim user, setting this value will make your Vim
experience definitively feel more natural.
To enable the mouse, use:
set mouse=a
However, this comes at one big disadvantage: when you run Vim inside
a terminal, the terminal itself cannot control your mouse anymore. Therefore,
you cannot select text anymore with the terminal (to copy it to the system
clipboard, for example).
To be able to have the best of both worlds, I wrote this simple Vim plugin:
vim-togglemouse. It maps <F12> to
toggle your mouse “focus” between Vim and the terminal.
Small plugins like these are really useful, yet have the additional benefit of
lowering the barrier of learning the Vim scripting language. At the core, this
plugin exists of only one simple function:
fun! s:ToggleMouse()
if !exists("s:old_mouse")
let s:old_mouse = "a"
endif
if &mouse == ""
let &mouse = s:old_mouse
echo "Mouse is for Vim (" . &mouse . ")"
else
let s:old_mouse = &mouse
let &mouse=""
echo "Mouse is for terminal"
endif
endfunction
The following trick is a really small one, but a super-efficient one, since it
strips off two full keystrokes from almost every Vim command:
nnoremap ; :
For example, to save a file, you type :w normally, which means:
Press and hold Shift
Press ;
Release the Shift key
Press w
Press Return
This trick strips off steps 1 and 3 for each Vim command. It takes some
times for your muscle memory to get used to this new ;w command, but once you
use it, you don’t want to go back!
I also find this key binding very useful, since I like to reformat paragraph
text often. Just set your cursor inside a paragraph and press Q (or select
a visual block and press Q).
" Use Q for formatting the current paragraph (or selection)
vmap Q gq
nmap Q gqap
If you are still getting used to Vim and want to force yourself to stop using
the arrow keys, add this:
If you like long lines with line wrapping enabled, this solves the problem that
pressing down jumpes your cursor “over” the current line to the next line. It
changes behaviour so that it jumps to the next row in the editor (much more
natural):
nnoremap j gj
nnoremap k gk
When you start to use Vim more professionally, you want to work with multiple
windows open. Navigating requires you to press C-w first, then a navigation
command (h, j, k, l). This makes it easier to navigate focus through windows:
Tired of clearing highlighted searches by searching for “ldsfhjkhgakjks”? Use
this:
nmap <silent> ,/ :nohlsearch<CR>
I used to have it mapped to :let/=“”, but some users kindly pointed out
that it is better to use:nohlsearch@, because it keeps the search history
intact.
It clears the search buffer when you press ,/
Finally, a trick by Steve
Losh for when
you forgot to sudo before editing a file that requires root privileges
(typically /etc/hosts). This lets you use w!! to do that after you
opened the file already:
In order to make the article not any more longer than it already is, here’s
a list of other plugins that are really worth checking out (I use each of them
regularly):
localrc: lets you
load specific Vim settings for any file in the same directory (or
a subdirectory thereof). Comes in super handy for project-wide settings.
CtrlP: lets you open files or switch
buffers quickly using fuzzy search. I'd highly recommend it.
Note: This blog post has been written in 2010, and a lot has changed since
then. I’m not using any of this anymore. The post is still available for
posterity, but do take this into account when reading this.
Finally, I’ve made the move to a static blog engine! I’m using
nanoc now (bye bye WordPress). nanoc is a very
flexible and customizable static site generator, written by Denis
Defreyne.
As with all static site generators, nanoc lets you write your source files in
a simple markup language. Out of the box, nanoc offers you the choice of using
Markdown, Textile, reStructuredText or plain HTML (with or without embedded
Ruby). In fact, nanoc is nothing more than a generator honoring a Rules-file
that tells it how to compile, layout and route the site’s items.
An “item” is a file on your website. It can be any kind of file, like a web
site page (HTML), an image, a JavaScript or CSS file or an RSS feed. During
the compile phase, you specify which sequential actions should be performed on
the content of that item. These actions are called filters. Some examples
of filters are an embedded ruby filter,
a Textile-to-HTML converter,
a less compiler, or minify
CSS. Filters can be chained, for
example:
compile '/static/css/*/' do
# compress CSS :)
filter :less
filter :rainpress
end
Which turns .less-files into compressed CSS:
Any filter you can imagine, nanoc can handle. nanoc comes with a lot of filters
out of the box,
but even writing your
own
filters really is a piece of cake.
After compiling (i.e. transforming content through filters) comes the routing
of the items. This is a means of assigning file names to compiled content.
nanoc calculates default files names from the input, but you can use this to
influence the default naming. A special case is where you set the route to
Nil which doesn’t write the file at all. I use this to test draft posts
locally, like this (oh, did I mention the Rules file is 100% Ruby?):
route '/posts/*/' do
if $include_drafts or @item[:published] then
'/posts/' + @item.slug + '/index.html'
end
end
Finally, layouts are applied. Layouts are kind of templates that can be used to
“frame” the item’s contents. This is typically used for HTML files only, but
isn’t limited to it. For example, the blog posts are compiled into (partial)
HTML, and the layout rules put the site’s container HTML around it, adding CSS
styling, jQuery scripts, the header, sidebars and footer and Google Analytics
tracking (these go for each page). There’s a special extra layout rule for blog
post pages, which additionally adds Disqus comments.
In short, now nanoc is fully configured to my wishes, I can simply focus on
writing blog content, without preparing image content (it is done
automatically), and without having to choose between either a “WYSIWYG” editor
or writing HTML manually. And I can do it in an offline fashion, too, which was
one my main complaints about WordPress.
So I’m happy.
Oh, and since I have been converting my blog anyway, I also created a new look
and feel for it. I hope you like it. Feel free to comment.
Last week, I silently tagged gitflow 0.2. These are the most important
changes since 0.1.
Show full content
Last week, I silently tagged gitflow 0.2. The most important changes
since 0.1 are:
Order of arguments changed to have a more “gitish” subcommand structure. For
example, you now say: git flow feature start myfeature
Better initializer. git flow init now prompts interactively to set up
a gitflow enabled repo.
Added a command to list all feature/release/hotfix/support branches, e.g.:
git flow feature list
Made all merge/rebase operations failsafe, providing a non-destructive
workflow in case of merge conflicts.
Easy diff’ing of all changes on a specific (or the current) feature branch:
git flow feature diff [feature]
Add support for feature branch rebasing: git flow feature rebase
Some subactions now take name prefixes as their arguments, for convenience.
For example, if you have feature branches called “experimental”,
“refactoring” and “feature-X”, you could say: git flow feature finish
ref And gitflow will know you mean the “refactoring” feature branch.
These actions are: finish, diff and rebase.
Much better overall sanity checking.
Better portability (POSIX compliant code)
Better (more portable) flag parsing using Kate Ward’s
shFlags.
Improved installer. To install git flow as a first-class Git subcommand,
simply type: sudo make install
After the overwhelming attention and feedback on the Git branching model
post, a general consensus was that this workflow would benefit from some
form of proper scriptability. This post proposes the initial version of
a tool I called git-flow.
Show full content
After the overwhelming attention and feedback on the Git branching model
post, a general consensus was that
this workflow would benefit from some form of proper scriptability. The
workflow works seamlessly if you perform the steps involved manually, but hey…
manually is manually, really.
UPDATE 2/4/2010:
I recommend NOT USING this very early release, but to jump on the current
develop tip, which is much more mature. Release 0.2 is coming very
soon.
An assisting tool (dubbed gitflow) was therefore created to provide simple,
high-level commands to adopt the workflow into your own software development
process. It’s free and it’s open source. Feel free to contribute to it if you
like.
The gitflow script essentially features six subcommands: paired start/finish
commands for managing the different types of branches from the originating
article:
Feature branches:
gitflow start feature <myfeature>
gitflow finish feature <myfeature>
Release branches:
gitflow start release <version-id>
gitflow finish release@ <version-id>
Hotfix branches:
gitflow start hotfix <version-id>
gitflow finish hotfix <version-id>
Each of these scripts exactly reports what actions were taken and what
follow-up actions are required by the user. This output will be polished in
future versions to improve the UX . An example output:
$ gitflow finish feature foo
Branches 'develop' and 'origin/develop' have diverged.
And local branch 'develop' is ahead of 'origin/develop'.
Switched to branch "develop"
Your branch is ahead of 'origin/develop' by 12 commits.
Merge made by recursive.
README | 2 ++
1 files changed, 2 insertions(+), 0 deletions(-)
Deleted branch foo (cd3effb).
Summary of actions:
- The feature branch 'foo' was merged into 'develop'
- Feature branch 'foo' has been removed
- You are now on branch 'develop'
This model was conceived in 2010, now more than 10 years ago, and not very
long after Git itself came into being. In those 10 years, git-flow (the
branching model laid out in this article) has become hugely popular in many
a software team to the point where people have started treating it like
a standard of sorts — but unfortunately also as a dogma or panacea.
During those 10 years, Git itself has taken the world by a storm, and the
most popular type of software that is being developed with Git is shifting
more towards web apps — at least in my filter bubble. Web apps are typically
continuously delivered, not rolled back, and you don't have to support
multiple versions of the software running in the wild.
This is not the class of software that I had in mind when I wrote the blog
post 10 years ago. If your team is doing continuous delivery of software,
I would suggest to adopt a much simpler workflow (like GitHub
flow) instead of trying to
shoehorn git-flow into your team.
If, however, you are building software that is explicitly versioned, or if
you need to support multiple versions of your software in the wild, then
git-flow may still be as good of a fit to your team as it has been to people
in the last 10 years. In that case, please read on.
To conclude, always remember that panaceas don't exist. Consider your own
context. Don't be hating. Decide for yourself.
In this post I present the development model that I’ve introduced for some of
my projects (both at work and private) about a year ago, and which has turned
out to be very successful. I’ve been meaning to write about it for a while now,
but I’ve never really found the time to do so thoroughly, until now. I won’t
talk about any of the projects’ details, merely about the branching strategy
and release management.
For a thorough discussion on the pros and cons of Git compared to centralized
source code control systems, see the
web. There are plenty of flame
wars going on there. As a developer, I prefer Git above all other tools around
today. Git really changed the way developers think of merging and branching.
From the classic CVS/Subversion world I came from, merging/branching has always
been considered a bit scary (“beware of merge conflicts, they bite you!”) and
something you only do every once in a while.
But with Git, these actions are extremely cheap and simple, and they are
considered one of the core parts of your daily workflow, really. For example,
in CVS/Subversion books, branching and merging
is first discussed in the later chapters (for advanced users), while in
everyGitbook, it’s already covered in chapter
3 (basics).
As a consequence of its simplicity and repetitive nature, branching and merging
are no longer something to be afraid of. Version control tools are supposed to
assist in branching/merging more than anything else.
Enough about the tools, let’s head onto the development model. The model that
I’m going to present here is essentially no more than a set of procedures that
every team member has to follow in order to come to a managed software
development process.
The repository setup that we use and that works well with this branching model,
is that with a central “truth” repo. Note that this repo is only considered
to be the central one (since Git is a DVCS, there is no such thing as a central
repo at a technical level). We will refer to this repo as origin, since this
name is familiar to all Git users.
Each developer pulls and pushes to origin. But besides the centralized
push-pull relationships, each developer may also pull changes from other peers
to form sub teams. For example, this might be useful to work together with two
or more developers on a big new feature, before pushing the work in progress to
origin prematurely. In the figure above, there are subteams of Alice and Bob,
Alice and David, and Clair and David.
Technically, this means nothing more than that Alice has defined a Git remote,
named bob, pointing to Bob’s repository, and vice versa.
At the core, the development model is greatly inspired by existing models out
there. The central repo holds two main branches with an infinite lifetime:
master
develop
The master branch at origin should be familiar to every Git user. Parallel
to the master branch, another branch exists called develop.
We consider origin/master to be the main branch where the source code of
HEAD always reflects a production-ready state.
We consider origin/develop to be the main branch where the source code of
HEAD always reflects a state with the latest delivered development changes
for the next release. Some would call this the “integration branch”. This is
where any automatic nightly builds are built from.
When the source code in the develop branch reaches a stable point and is
ready to be released, all of the changes should be merged back into master
somehow and then tagged with a release number. How this is done in detail will
be discussed further on.
Therefore, each time when changes are merged back into master, this is a new
production release by definition. We tend to be very strict at this, so that
theoretically, we could use a Git hook script to automatically build and
roll-out our software to our production servers everytime there was a commit on
master.
Next to the main branches master and develop, our development model uses
a variety of supporting branches to aid parallel development between team
members, ease tracking of features, prepare for production releases and to
assist in quickly fixing live production problems. Unlike the main branches,
these branches always have a limited life time, since they will be removed
eventually.
The different types of branches we may use are:
Feature branches
Release branches
Hotfix branches
Each of these branches have a specific purpose and are bound to strict rules as
to which branches may be their originating branch and which branches must be
their merge targets. We will walk through them in a minute.
By no means are these branches “special” from a technical perspective. The
branch types are categorized by how we use them. They are of course plain old
Git branches.
May branch off from:
develop
Must merge back into:
develop
Branch naming convention:
anything except master, develop, release-*, or hotfix-*
Feature branches (or sometimes called topic branches) are used to develop new
features for the upcoming or a distant future release. When starting
development of a feature, the target release in which this feature will be
incorporated may well be unknown at that point. The essence of a feature branch
is that it exists as long as the feature is in development, but will eventually
be merged back into develop (to definitely add the new feature to the
upcoming release) or discarded (in case of a disappointing experiment).
Feature branches typically exist in developer repos only, not in origin.
The --no-ff flag causes the merge to always create a new commit object, even
if the merge could be performed with a fast-forward. This avoids losing
information about the historical existence of a feature branch and groups
together all commits that together added the feature. Compare:
In the latter case, it is impossible to see from the Git history which of the
commit objects together have implemented a feature—you would have to manually
read all the log messages. Reverting a whole feature (i.e. a group of commits),
is a true headache in the latter situation, whereas it is easily done if the
--no-ff flag was used.
Yes, it will create a few more (empty) commit objects, but the gain is much
bigger than the cost.
Release branches ¶
May branch off from:
develop
Must merge back into:
develop and master
Branch naming convention:
release-*
Release branches support preparation of a new production release. They allow
for last-minute dotting of i’s and crossing t’s. Furthermore, they allow for
minor bug fixes and preparing meta-data for a release (version number, build
dates, etc.). By doing all of this work on a release branch, the develop
branch is cleared to receive features for the next big release.
The key moment to branch off a new release branch from develop is when
develop (almost) reflects the desired state of the new release. At least all
features that are targeted for the release-to-be-built must be merged in to
develop at this point in time. All features targeted at future releases may
not—they must wait until after the release branch is branched off.
It is exactly at the start of a release branch that the upcoming release gets
assigned a version number—not any earlier. Up until that moment, the develop
branch reflected changes for the “next release”, but it is unclear whether that
“next release” will eventually become 0.3 or 1.0, until the release branch is
started. That decision is made on the start of the release branch and is
carried out by the project’s rules on version number bumping.
Release branches are created from the develop branch. For example, say
version 1.1.5 is the current production release and we have a big release
coming up. The state of develop is ready for the “next release” and we have
decided that this will become version 1.2 (rather than 1.1.6 or 2.0). So we
branch off and give the release branch a name reflecting the new version
number:
$ git checkout -b release-1.2 develop
Switched to a new branch "release-1.2"
$ ./bump-version.sh 1.2
Files modified successfully, version bumped to 1.2.
$ git commit -a -m "Bumped version number to 1.2"
[release-1.2 74d9424] Bumped version number to 1.2
1 files changed, 1 insertions(+), 1 deletions(-)
After creating a new branch and switching to it, we bump the version number.
Here, bump-version.sh is a fictional shell script that changes some files in
the working copy to reflect the new version. (This can of course be a manual
change—the point being that some files change.) Then, the bumped version
number is committed.
This new branch may exist there for a while, until the release may be rolled
out definitely. During that time, bug fixes may be applied in this branch
(rather than on the develop branch). Adding large new features here is
strictly prohibited. They must be merged into develop, and therefore, wait
for the next big release.
When the state of the release branch is ready to become a real release, some
actions need to be carried out. First, the release branch is merged into
master (since every commit on master is a new release by definition,
remember). Next, that commit on master must be tagged for easy future
reference to this historical version. Finally, the changes made on the release
branch need to be merged back into develop, so that future releases also
contain these bug fixes.
The first two steps in Git:
$ git checkout master
Switched to branch 'master'
$ git merge --no-ff release-1.2
Merge made by recursive.
(Summary of changes)
$ git tag -a 1.2
The release is now done, and tagged for future reference.
Edit: You might as well want to use the -s or -u <key> flags to sign
your tag cryptographically.
To keep the changes made in the release branch, we need to merge those back
into develop, though. In Git:
$ git checkout develop
Switched to branch 'develop'
$ git merge --no-ff release-1.2
Merge made by recursive.
(Summary of changes)
This step may well lead to a merge conflict (probably even, since we have
changed the version number). If so, fix it and commit.
Now we are really done and the release branch may be removed, since we don’t
need it anymore:
May branch off from:
master
Must merge back into:
develop and master
Branch naming convention:
hotfix-*
Hotfix branches are very much like release branches in that they are also meant
to prepare for a new production release, albeit unplanned. They arise from the
necessity to act immediately upon an undesired state of a live production
version. When a critical bug in a production version must be resolved
immediately, a hotfix branch may be branched off from the corresponding tag on
the master branch that marks the production version.
The essence is that work of team members (on the develop branch) can
continue, while another person is preparing a quick production fix.
Hotfix branches are created from the master branch. For example, say version
1.2 is the current production release running live and causing troubles due to
a severe bug. But changes on develop are yet unstable. We may then branch off
a hotfix branch and start fixing the problem:
$ git checkout -b hotfix-1.2.1 master
Switched to a new branch "hotfix-1.2.1"
$ ./bump-version.sh 1.2.1
Files modified successfully, version bumped to 1.2.1.
$ git commit -a -m "Bumped version number to 1.2.1"
[hotfix-1.2.1 41e61bb] Bumped version number to 1.2.1
1 files changed, 1 insertions(+), 1 deletions(-)
Don’t forget to bump the version number after branching off!
Then, fix the bug and commit the fix in one or more separate commits.
$ git commit -m "Fixed severe production problem"
[hotfix-1.2.1 abbe5d6] Fixed severe production problem
5 files changed, 32 insertions(+), 17 deletions(-)
When finished, the bugfix needs to be merged back into master, but also needs
to be merged back into develop, in order to safeguard that the bugfix is
included in the next release as well. This is completely similar to how release
branches are finished.
First, update master and tag the release.
$ git checkout master
Switched to branch 'master'
$ git merge --no-ff hotfix-1.2.1
Merge made by recursive.
(Summary of changes)
$ git tag -a 1.2.1
Edit: You might as well want to use the -s or -u <key> flags to sign
your tag cryptographically.
Next, include the bugfix in develop, too:
$ git checkout develop
Switched to branch 'develop'
$ git merge --no-ff hotfix-1.2.1
Merge made by recursive.
(Summary of changes)
The one exception to the rule here is that, when a release branch currently
exists, the hotfix changes need to be merged into that release branch, instead
of develop. Back-merging the bugfix into the release branch will eventually
result in the bugfix being merged into develop too, when the release branch
is finished. (If work in develop immediately requires this bugfix and cannot
wait for the release branch to be finished, you may safely merge the bugfix
into develop now already as well.)
While there is nothing really shocking new to this branching model, the “big
picture” figure that this post began with has turned out to be tremendously
useful in our projects. It forms an elegant mental model that is easy to
comprehend and allows team members to develop a shared understanding of the
branching and releasing processes.
A high-quality PDF version of the figure is provided here. Go ahead and hang it
on the wall for quick reference at any time.
Update: And for anyone who requested it: here’s the
gitflow-model.src.key of the main diagram image (Apple Keynote).
A few months ago, I wrote about [automatically generating classes for your Core
Data entities][prev] and how to automate Xcode using users scripts, such that,
when your model changed, you only needed to run your custom script again and
your intermediate model files would reflect the new situation.
Show full content
A few months ago, I wrote about automatically generating classes for your Core
Data entities and how to automate Xcode using users scripts, such that,
when your model changed, you only needed to run your custom script again and
your intermediate model files would reflect the new situation.
Well, the guys from the mogenerator project have come up with
a far superior solution in the mean time. The newest version of mogenerator
comes with an Xcode plugin named Xmo’d, which monitors your *.xcdatamodel
file for changes and, as soon as it changes, regenerates all of the neccessary
files.
This means that there is officially no more reason not to use mogenerator.
To set it up, download the installer package from their (improved) project
website and install it. (Before
installing, please read the important release note about the renamed method
+newInManagedObjectContext:.)
When installed, all you need to do is Command-click your *.xcdatamodel file,
click Get Info, switch to the Comments tab and add the string “xmod” to the
comment field. This is the trigger for Xmo’d to start (re)generating your
machine-classes (the underscored class files) when the data model changes.
Brilliant!
Oh, the default location at which the generated files will be emitted, is in
a folder named after your project, right next to where your *.xcdatamodel
already sits:
When designing a Core Data data model for your Xcode projects, you can choose
to create Objective-C object wrappers for your entities, so that you can profit
from type-safe code. The normal, tedious, workflow for this is that you select
each entity from the model designer, select all of its attributes and
relationships, Ctrl-click it and from the contextual menu first select “Copy
Obj-C 2.0 Method Declarations To Clipboard”, paste it into the appropriate
class header file, then do the same thing for the method implementations in the
class implementation file. Waaaaaay too much work. Not to mention the manual
copy-pastes are really hard to keep in sync once you start adding functionality
to these class files, since you don’t want to overwrite those additions, but
you want to keep replacing everything else.
Show full content
When designing a Core Data data model for your Xcode projects, you can choose
to create Objective-C object wrappers for your entities, so that you can profit
from type-safe code. The normal, tedious, workflow for this is that you select
each entity from the model designer, select all of its attributes and
relationships, Ctrl-click it and from the contextual menu first select “Copy
Obj-C 2.0 Method Declarations To Clipboard”, paste it into the appropriate
class header file, then do the same thing for the method implementations in the
class implementation file. Waaaaaay too much work. Not to mention the manual
copy-pastes are really hard to keep in sync once you start adding functionality
to these class files, since you don’t want to overwrite those additions, but
you want to keep replacing everything else.
Fortunately, there is a great way for automating this process, using
mogenerator. The tool can be downloaded as a DMG
installer (Aral Balkan’s blog mentions
a workaround for older Xcode versions, but for Xcode 3.1.3 it worked out of the
box for me), or you can checkout the sources from
github and build it yourself.
The mogenerator command line tool eases this generation process by reading the
*.xcdatamodel file and generating both class files and intermediate class
files for each entity. The intermediate classes (called machine classes) are
continuously overwritten by subsequent regenerations, so you should never edit
the contents of these files. The actual model object classes (called human
classes) inherit from those intermediate classes with a default empty
implementation, allowing for all manual extensions.
For example, when you design a model with two entities Foo and Bar, mogenerator
can be invokes as follows:
$ mogenerator -m MyDocument.xcdatamodel -M Entities -H Model
The flag -m sets the input model file, while -M and -H specify the output
directories where the machine and human classes should be generated
respectively.
This does a few things:
In the Entities subdirectory, there will be generated header and
implementation files for NSManagedObject subclasses called _Foo and _Bar;
In the Model subdirectory, there will be generated classes called Foo and
Bar—respective subclasses of _Foo and _Bar. These are only created if
not available yet. Otherwise, they are left as is.
The trick of how mogenerator works is that you can run the script as often as
you want. After every change in your model, you’ll want to re-run the
generation again to update the machine classes. You could easily leave Xcode,
switch over to Terminal and issue the command above. But you’ll get quite tired
of that after a few times.
Therefore, I’ve written a custom user script that can be added to Xcode (see
figure), which does the following:
You can configure the output directories in the first lines of the script.
There is no per-project configuration, so choose them as you would like to
use them with all your projects;
Mind that these generated files are not automatically included in your Xcode
project. Drag them there once and ideally put the machine generated classes
into a group under “Other resource”, so you never have to see them again.
Whenever you add a new class to your model, new files will be generated, so
again you must drag the new files to reference those, of course!
The script can be run with any file in the project opened. It starts out with
that file and walks up the directory tree to search for your Xcode project.
If found, it executes all the rest from your project directory. (Suggestions
are welcome, I could not find a better implementation since a variable like
%%%{PBXProjectPath}%%% does not seem to exist.)
It invokes mogenerator to generate all model classes for the project. It is
smart enough to detect whether you are using Brian Webster’s
BWOrderedManagedObject
in your project. If so, your generated machine classes will inherit from
BWOrderedManagedObject instead of NSManagedObject.
To add this script to Xcode, open the menu Scripts (the icon) > Edit User
Scripts… Click the “+”-button on the bottom-left and select “New shell script”.
Set the values for Input, Directory, Output and Errors as in the screenshot
above, then copy-paste the script below into the code window. Add a nice
keyboard shortcut to this action to top it off :-) I’ve chosen ⌥⌘G for this.
Please feel free to leave any comments if this helped you.
#!/bin/sh
#
# Automatic (re)generation of model classes for all *.xcdatamodel files.
# Written by Vincent Driessen
#
# You are free to use this script in any way.
# The original blog post is http://nvie.com/archives/263
#
# Define output directories
MACHINE_DIR="Entities"
MODEL_DIR="Model"
# Look for the Xcode project directory for this file
cd `dirname "%%%{PBXFilePath}%%%"`
while [ `ls -d *.xcodeproj 2>/dev/null | wc -l` -eq 0 ]; do
cd ..
if [ "`pwd`" = "/" ]; then
echo "No Xcode project found."
exit 1
fi
done
echo "Project directory is `pwd`"
#
# Check to see whether the base class is just a default (NSManagedObject) or
# maybe Brian Webster's excellent BWOrderedManagedObject.
# http://fatcatsoftware.com/blog/2008/per-object-ordered-relationships-using-core-data
#
# NOTE:
# The check really is quite arbitrary: if there exists a file called
# BWOrderedManagedObject.h somewhere below the project root directory, we
# assume that we want to use this as the base class for all generated classes.
#
EXTRA_FLAGS=
if [ `find . -name BWOrderedManagedObject.h | wc -l` -gt 0 ]; then
EXTRA_FLAGS+="--base-class BWOrderedManagedObject"
fi
# Generate the model classes using mogenerator
for model in `find . -name '*.xcdatamodel'`; do
# The output directories have to exist, so create them
mkdir -p "${MACHINE_DIR}" "${MODEL_DIR}"
mogenerator ${EXTRA_FLAGS} -m "${model}" -M "${MACHINE_DIR}" -H "${MODEL_DIR}"
done
The Core Data framework rules, and its API is really really powerful. But
really, why does the Core Data API require us to write so much boilerplate
code? Simple things need to be simple.
Show full content
The Core Data framework rules, and its API is really really powerful. But
really, why does the Core Data API require us to write so much boilerplate
code? Simple things need to be simple.
Why is the deletion of a managed object from the NSManagedObjectContext so
easy:
Matt Gallagher has written an excellent
article about
how to further enhance NSManagedObject for adding simple, one-line fetch
support. Be sure to check it out.
Cocoa offers a nice visual editor for editing NSPredicate objects templates,
called NSPredicateEditor. The NSPredicateEditor can be set up using code or in
Interface Builder, which is preferable for simple use. The setup is fairly easy
once you know how to do it. In this tutorial, we’ll be building a simple
predicate editor example which shows the basic functionality of the predicate
editor.
Show full content
Cocoa offers a nice visual editor for editing NSPredicate objects templates,
called NSPredicateEditor. The NSPredicateEditor can be set up using code or in
Interface Builder, which is preferable for simple use. The setup is fairly easy
once you know how to do it. In this tutorial, we’ll be building a simple
predicate editor example which shows the basic functionality of the predicate
editor.
Begin by creating a new Xcode project (⌘⇧N). Name your project wisely
and create a new class in the Classes group, called AppDelegate.
Switch to the header file and declare two IBOutlets for the main window and the
sheet on which we’re going to display the editor in a few minutes. Also, add
two IBActions called -openEditor: and -closeEditor:. Finally, add an ivar
that holds the NSPredicate we’re going to be editing.
Next, we’re going to fire up Interface Builder to build the UI. Double click on
the MainMenu.xib file under the Resources group.
Drag an NSObject object from the Library into the XIB and call it App Delegate.
Hit ⌘6 and make it a subclass of the AppDelegate class we just created.
Then, hook it up to the delegate property of the File’s Owner.
Drag a new NSWindow to the XIB-file and call it Sheet. Make sure the checkbox
“Visible At Launch” is deselected or the sheet will not display properly at
runtime. Open the main window and add a NSButton and a NSTextView to it. To the
sheet window, drag a NSPredicateEditor and a NSButton. They should look
somewhat like this now:
Now, we can hook up the outlets and actions as usual. Hook up the Edit
Predicate button on the main window to -openEditor: and the OK button on the
sheet window to closeEditor:. Then hook up the mainWindow and sheet outlets of
the AppDelegate class to the respective NSWindow objects.
Once we have all of the connections between Xcode and Interface Builder set up,
we can continue to configure the predicate editor itself, which is actually
what this tutorial is all about. An NSPredicateEditor control uses a list of
NSPredicateEditorRowTemplate objects that can handle individual (simple)
NSPredicate objects. Combining these row templates enables the
NSPredicateEditor to edit compound predicates. There is no limitation to the
depth of nested compound predicates, although nesting too deep would not be
advisable from a usability perspective.
In the edit window, click a few times until the “name contains” row template is
selected. In this row template, you define which key paths are supported.
Supported here means two things:
matching—given an existing predicate with this key path in it on the
left-hand side, this row template can be used to alter the predicate;
generation—when using the editor to create new predicates, adding a new
rule for this key path will generate a predicate for this key path.
A small gotcha, at least one that initially put me on the wrong foot, is that
there is quite a difference between the rows that you see design-time in
Interface Builder and the rows that are available run-time. At design-time, you
define the NSPredicateEditorRowTemplate objects while at run-time you see
instances of them. Hence, the number of rows at design-time is the number of
different row templates available. At run-time, however, the number of rows is
the number of predicates within the compound predicate (which each has an
associated row template instance that handles it). Subtle difference.
In short, in Interface Builder, create a row template for each type of match
that you want to allow. Typically, this means for each data type that you want
to support. In our example, we have the following setup:
Row template #1 is for all string matches. Here, we have defined it for the
key paths “firstname”, “lastname”, “address.street” and “address.city”. They,
per definition, have the same allowed operators. If we want to have an other
set of operators for a specific key path, we need to define a separate row
template for it.
Row template #2 is for date matches, i.e. our “birthdate” key path.
Row template #3 is for all integer matches, i.e. our “address.number” key
path.
The result looks like this:
Using bindings to connect the predicate to the UI ¶
Next up, we simply connect both the text view from the main window and the
predicate editor from the sheet window to the predicate key path using Cocoa
bindings. In order to do so, select the NSPredicateEditor (first click the
control to select the scroll view, then click again to select the inner
NSPredicateEditor), hit ⌘4. Then, unfold the “Value” binding and hook it up to
the App Delegate’s “predicate” key path.
Do the same for the text view in the main window, but this time hook it up to
the “predicate.description” key path (since only strings can be displayed in
a text view). When you do this, make sure that the text view is read-only,
since the description property of objects should never be set.
In the -init: method, we initialize the AppDelegate by setting and retaining
a reference to a rather complex default predicate. When the XIB is loaded at
run-time, the textbox shows exactly this predicate and it can be edited by
invoking the edit sheet.
The actual implementation of the -openEditor: and -closeEditor: methods
aren’t too exciting.