GeistHaus
log in · sign up

Tuomas Kareinen's blog

Part of feedburner.com

stories
Exclusive scheduled jobs using database locks
Show full content

Using locks implemented as rows in an SQL database enables running scheduled background jobs in an application, providing a best-effort guarantee that only one application instance at a time runs a particular job. This is one possible solution, and it’s an appealing one because an SQL database usually serves as the primary database for the application – you don’t need any additional infrastructure services. The implementation achieves fault tolerance and is easy to understand and operate, but sacrifices load balancing as a trade-off. I describe how to do it and provide SQL operations as examples.

Before I go into the solution, it’s worth emphasizing that the characteristics and requirements of running background jobs should drive any design. There are surprisingly many aspects to think about. Among the considerations are how jobs get created (are they triggered by events or a schedule), whether the system should attempt to run a particular job only once, need for fault tolerance, load balancing, and scalability; followed by computing resource requirements, and whether it is acceptable to share the computing resources with the application’s primary workload. See the Background jobs section of Microsoft Azure Well-Architected Framework for a good overview.

Exclusive job implementation

The concept of an exclusive job models the permission to run a particular job on only one application instance. I represent the model as a record having the following fields:

  • job_id: Identifies a particular job. For example, DeleteOldTransactionIds or CheckRemoteServiceHealth.

  • job_instance_id: A static identifier of the application instance. An application instance can simply use a static UUID generated in memory at instance startup for all job_ids.

  • lock_expires_at: A timestamp in the future indicating when an acquired lock can be treated as expired. The value is calculated by the application. I’ll describe its meaning shortly.

A scheduler triggers all the application instances to compete for the permission to run a certain job. The application relies on the database to provide atomic operations so that only one instance may insert or update a row for a job_id value.

The exclusive_job table stores the rows. Here is the schema for PostgreSQL:

create table exclusive_job (
  job_id text primary key not null check (char_length(job_id) between 1 and 255),
  job_instance_id uuid not null,
  lock_expires_at timestamptz not null
);

The application defines the tryAcquireLock operation, which performs the following query to either acquire (insert row) or conditionally update the lock (update existing row). I use named parameters (such as $job_id) instead of positional parameters for easier reading (this is invalid SQL syntax as PostgreSQL allows positional parameters only):

insert into exclusive_job as ej (
  job_id,
  job_instance_id,
  lock_expires_at
)
values ($job_id, $job_instance_id, $lock_expires_at)
on conflict (job_id) do update
set
  job_instance_id = excluded.job_instance_id,
  lock_expires_at = excluded.lock_expires_at
where
  ej.lock_expires_at < now() or
  (ej.job_id = excluded.job_id and ej.job_instance_id = excluded.job_instance_id)

As said, a scheduler triggers the application code to run a job identified by job_id for each application instance at approximately the same time. The instances compete to acquire the lock for the job using the tryAcquireLock operation. The first instance to execute the query will win. The row count of the query’s result set signals either winning the lock (count > 0) or losing the lock (count = 0). The instance that has acquired the lock gets the permission to run the job; other instances back off.

The lock_expires_at column gives the ability to run the same job_id again in the next scheduled trigger. By relying on scheduled triggers and expiring locks, the design attains fault tolerance. The clocks of the application instances must be synchronized to make this work, but a small clock skew is tolerable.

A lock expiry value should be large enough that all application instances have a window to compete for the lock simultaneously, and that the winning instance has sufficient time to complete the job before expiration. On the other hand, the value should be small enough to allow the instances to compete again in the next scheduled trigger. Getting lock expiry right is the delicate part of this design. For example, if a job takes at most a minute to complete and you schedule running the job once per hour, then a value of 5 minutes might be suitable.

The tryAcquireLock operation is reentrant, meaning the same application instance already holding the lock may acquire the lock again.

Use the updateLock operation to guard against overlapping executions of the same long-running job. Overlapping might happen when the initial expiry time is too small compared to the amount of work anticipated: usually the job might take a few minutes to complete, but sometimes there’s so much work that the job is still running when it’s time to trigger the next scheduled run. In that case, the application should split the job into parts and update the expiry time just before running each part. Here’s the SQL query for updateLock:

update exclusive_job
set lock_expires_at = $lock_expires_at
where job_id = $job_id and job_instance_id = $job_instance_id

The query allows only the application instance that has acquired the lock to update the expiry time. A positive row count of the result set indicates if the update was applied. Note that the query allows the instance holding the lock to update the lock even if the lock is already expired.

An application instance that fails applying the update implies a problem where the lock has been expired and another instance has acquired it. This might be a symptom of using too small a value for lock_expires_at. I recommend aborting the job in a fail-fast manner if that happens.

One could also define the releaseLock operation for the application instance holding the lock to release the lock explicitly:

delete from exclusive_job
where job_id = $job_id and job_instance_id = $job_instance_id

But using it would be safe only if you can guarantee that there cannot be other application instances still competing to acquire the lock for the same job_id in the same scheduled moment. An example scenario to avoid would be the following: instance A acquires a job’s lock, completes the job quickly, and releases the lock; this is followed by instance B acquiring the lock for the same job. Now instance B would run the same job again. Instead, it’s safer to just let the lock expire.

Each of the tryAcquireLock, updateLock, and releaseLock operations must be wrapped in a dedicated database transaction to obtain exclusive access to the guarded job. Don’t include other database queries inside those transactions.

Intended usage scenario

The implementation is designed for relatively lightweight jobs triggered by a scheduler. Each job should move affected state toward the desired state (eventual consistency and idempotent operations). The database manages state, making it easy to replicate the application to have many instances. You might run the application as a Deployment in Kubernetes, for example.

Design trade-offs

For design trade-offs, you lose the following:

  1. Cannot guarantee that exactly one application instance runs a job per trigger. After acquiring a lock, but just before executing the job, an instance may get paused long enough for the lock to expire and another instance to acquire the lock, resulting in running the job twice. A stop-the-world pause by a garbage collector is one example. See How to Do Distributed Locking by Martin Kleppmann for an example and more interesting details. This is acceptable because the design described here uses locking as an efficiency optimization, not for correctness.

  2. No load balancing. There’s no mechanism to distribute load intelligently among application instances. But you can distribute jobs randomly by delaying the call to tryAcquireLock with a small random duration. This also protects against the application instance having the greatest positive clock skew from always acquiring the lock.

  3. The application runs background jobs alongside its primary workload. This resource sharing might harm the availability of the application.

  4. You cannot scale resources for different background job types. The reason is the same as for not having load balancing.

  5. You need to synchronize the clocks for all application instances. This shouldn’t be a big problem in practice by using NTP.

  6. It won’t work with event triggers. Fault tolerance relies on periodically repeated triggers.

You gain the following:

  1. Best-effort job exclusivity as an efficiency optimization. The tryAcquireLock operation ensures that only one application instance at a time may acquire a lock for a job. Connection pool usage does not affect lock handling.1 Using a large enough lock expiry time, the instance holding the lock should have time enough to complete the job for the common case. The nature of the job should allow running it many times, possibly concurrently, in the worst case.

  2. Fault tolerance: if an application instance terminates ungracefully while running a job, the lock will expire eventually and another instance will retry the job the next time the job gets triggered.

  3. Easy to understand and to operate: the implementation uses one SQL table for storing state. If something goes wrong in operations, you can either delete the contents of the table or just let the locks expire.

Conclusion

I believe the exclusive jobs described here are interesting for simple background jobs that need to be repeated. Examples include cleanups, storing the health-check results of remote services in the database, and propagating state from one application to another in an eventually consistent fashion.

You get quite a lot for one SQL table and some application code around it. The accidental complexity of this approach is low.

I’m sure what I’ve presented is one variation of many similar tried-and-true solutions. My motivation was to document this particular variation, as I haven’t found any articles describing something like it.

Finally, I’ve talked about the required atomic operations in the context of SQL databases, but there’s actually nothing specific to SQL here. For example, the design can be implemented on a document-oriented database, such as MongoDB. Further, the update functionality of tryAcquireLock can be removed if the database supports deleting entries automatically upon lock expiry (see TTL indexes for MongoDB). So you have even more options for applying the design!

  1. Specifically, there’s no need for the advisory locks of PostgreSQL. Using an advisory lock on the session level can cause problems with a connection pool. For example, if the application got connection A from the pool to acquire the advisory lock, it wouldn’t be able to update the lock if it got connection B from the pool. 

https://tkareine.org/articles/exclusive-scheduled-jobs-using-db-locks
Suppressing duplicate requests in web services
Show full content

Suppressing duplicate request processing prevents some forms of data corruption, such as data duplication and lost data. Achieving suppression together with client retries effectively establishes exactly-once request processing semantics in the system.1 In this article, I present an imaginary web service built on the microservice style design, inspect that and its clients together as a system, define the duplicate request processing problem, and the general solution to it. Finally, I’ll show one possible way to implement the solution.

The microservices involved use synchronous requests to pull and push data so that any part of the overall state is managed centrally in one place (the conventional approach to microservices communicating with the REST/RPC/GraphQL protocols).

Web service as a system

The imaginary web service manages education related entities: students, employees, facilities, and so on.2 Typical management operations are creating, reading, updating, and deleting entities. Here, we focus on employees, their employment contracts, and related information. Collectively, we’ll call those as “staff” entities and dedicate an application named “Staff service” for processing them. There’s also an identity provider (IdP) service that is used to authenticate both students and employees. Because the IdP is provided as an external service, we have another application, called “Users service”, that maps our user identifiers to the IdP’s user identifiers. Finally, an API gateway node serves as the reverse HTTP proxy for all inbound traffic to APIs.

Here’s a diagram of the web service from the viewpoint of the Staff API:

An imaginary web service comprising of the client, three internal services, and one external service

Because we need to allow employees to login to the service, the Staff service needs to associate a user entity for an employee. This happens by calling the API of the Users service, which hides the complexity of the IdP’s User management API.

Looking at the diagram, we can identify the following components and group them:

  1. The servers behind the public APIs: the API gateway, the IdP service, microservices, and databases.

  2. The clients accessing the public APIs: browsers running webapps and integration clients that synchronize education related entities between this system and another.

  3. Network components: the internet where the clients connect from, the private network of the web service, and virtual networks within the hosts that run microservices inside containers.

The operations of the client-server APIs the microservers expose both internally and externally can be grouped into:

  1. Queries, which are for reading data. A query request does not inflict any externally visible change to state of the server. Examples are the HTTP GET and HEAD methods and the query operations of GraphQL.

  2. Mutation, which are for writing data and triggering effects. A mutation request causes externally visible change to the state of the server. Examples are the HTTP POST/PUT/PATCH/DELETE methods and the mutation operations of GraphQL.

Considering some of the typical technologies we use building web services like this, we’ll likely use TCP as the connection-oriented protocol for transferring data between hosts. When two hosts have agreed to establish a TCP session, the protocol protects against packet duplication with sequence numbers and acknowledgements, and data integrity with checksums on IP packets. But a TCP session protects data transfer only between two hosts. For instance, creating a new employee involves adding a new user entity to the IdP service. When looking at the communication path of that API operation, there will be four separate TCP sessions (the numbers in red circles in the previous diagram):

  1. Between the browser and the API gateway: call the public Staff API to create an employee

  2. Between the API gateway and the Staff service: forward the call to the microservice

  3. Between the Staff service and the Users service: call the Users API to create a new user entity to associate with the employee

  4. Between the Users service and the IdP service: call the User management API to create the user entity to the IdP in order to allow the employee to login, and provide user identifier mapping between systems

Another technology in general use is database transactions, especially for SQL databases which usually come with the ACID properties. A connection from the application server to the database sits on top of TCP usually, and the database server guarantees that if the transaction commits successfully, the app’s modifications to the data leave the database in a consistent state. It’s another safeguard against data corruption, but again between two components only. The creation of a new employee in our web service involves two SQL transactions (the letters in gray circles in the diagram above):

  1. Staff service: add a row for the new employee

  2. Users service: add a row for the new user

Turns out that any technology protecting only parts of the whole communications path is not sufficient in protecting the whole path. Let’s look at some possible problems.

Examples of problems caused by not protecting the whole communications path

Broken data integrity: Even though a TCP session uses checksums to ensure two hosts transfer data unchanged over the communications channel, it does not guard the application server from reading data received or writing data to be sent via malfunctioning hardware memory. Data corruption can occur.

Broken data confidentiality: A client that serves to integrate an external and our imaginary web service sends the login password of the employee along with the data in the request to create the employee to the Staff API. TLS does protect the communication channel between any two hosts with encryption, but it does not prevent the application server from reading the password in clear text. Any process in the server can read the password, actually.

Broken duplicate request processing suppression: A client requesting to create a new employee using the Staff API encounters either a timeout or receives a timeout related error response. What happens if the client attempts to send the request again? From the client’s perspective, any of the following might have happened to the original request:

  1. The API gateway received the request, but the gateway crashed and never sent the response to the client. The gateway might or might not have forwarded the request to the Staff service before crashing.

  2. The API gateway received the request, but the Staff service is down, not accepting connections. The timeout for the expectation to receive data in the client is lower than in the API gateway for the overall connection attempts to the Staff service, and so the client closes this request attempt.

  3. The Staff service received the request and used an SQL transaction to encompass sending its own request to the Users API for creating the associated user entity. The Staff service received the success response from the Users service, updated the employee row, and committed the SQL transaction. But the Staff service crashed just before responding back to the API gateway. Eventually, the gateway times out the connection to the Staff service and sends a 504 Gateway Timeout error response to the client.

  4. Like previously, but just after opening the encompassing SQL transaction, the Staff service enters stop-the-world garbage collection phase, which effectively pauses the whole service. This makes the API gateway respond with 504 Gateway Timeout to the client. After the garbage collection phase is over, the service continues processing like nothing would have happened.

  5. Like cases 2, 3, or 4, but it was the Users service that failed.

All the five situations above are different forms of timeouts. In cases 3 and 4, the request was processed completely, but the client does not get to know about it. If the client retries the original request, there could be 0, 1, or 2 employees in the system. Here we presume, for the sake of general argument, that the employee data the client sends does not contain data that has uniqueness constraints (the username attribute might be such, for example). It’s clear that TCP’s data correctness mechanisms alone cannot guarantee that a request traversed over many hops would be processed only once.

In case 5, the system was left in an illegal state: there’s a user entity in the IdP and a corresponding identifier mapping in the Users service, but no associated employee entity in the Staff service. This demonstrates that database transactions alone cannot guarantee that the overall system was left in correct state.

Both TCP and database transactions helped to ensure data correctness between two components, but they didn’t guarantee that the overall system was left in correct state.

Even though I’m focusing on the duplicate request processing suppression problem in this article, the general solution to all of them is the same.

The end-to-end argument

The end-to-end argument is a design principle that guides where to locate the implementation of a function for the benefit of a distributed system. The function in question can be duplicate message suppression, data integrity, or data confidentiality, for example. Saltzer, Reed, and Clark articulated the argument in 1981, and it goes as follows:

The function in question can completely and correctly be implemented only with the knowledge and help of the application standing at the end points of the communication system. Therefore, providing that questioned function as a feature of the communication system itself is not possible. (Sometimes an incomplete version of the function provided by the communication system may be useful as a performance enhancement.)

Put other way, correct implementation of the function requires that the client and the server at the ends of the communication path work together in achieving the function.

Going back to the earlier example problems, a way to guarantee data integrity is to make the client to compute a hash over the request’s payload data and to include the hash in the request. The application servers, upon receiving the request, compute the hash and compare it to the expected one in the request. If the computed hash equals the expected hash, the server may process the request.

Data confidentiality can be achieved by using end-to-end encryption.

There’s no established way to suppress duplicate request processing. In the Designing Data-Intensive Applications book, Martin Kleppmann describes one approach. The system must be designed so that it holds up exactly-once semantics for processing requests, and an effective way to achieve this is to make operations idempotent.3

Considering our earlier grouping of the operations of client-server APIs into queries and mutations, we can ensure that queries are idempotent by making sure they never affect state so that possible change in state is visible to the client (for instance, request logging would be permitted). Usually this is trivial to achieve with read database queries if returning the data based on the current database state is enough. This does forgo the ability for the client to request data about the state in earlier moments, however; solving that would require storing versioned data snapshots in the database.

For mutations, Kleppmann proposes to include an operation identifier in the request originating from the client. Upon receiving the request, the server can query its database to see if an operation with this identifier has been processed already. The server processes the request only if there’s no existing row having the identifier. When processing is about to finish, the server adds a row containing the identifier indicating that the request has been completed. The operation identifier can either be generated or derived from the input data, whichever is more convenient for the business logic.

Applying Kleppmann’s approach to suppress duplicate request processing, in the context of the imaginary web service presented earlier, is the last part of this article.

Applying duplicate request suppression

Let’s establish API design principles to support idempotency. I’ve chosen to use GraphQL as the application-level protocol here, but the principles are the same regardless of using another protocol, such as REST or RPC.

  1. GraphQL query operations return the data based on the current state of the server. It’s expected that a query with certain input requested over time may return different data as output, reflecting changes in the current state of the service by mutations.

  2. All GraphQL mutation operations must include operation identifier as input, in a parameter called transactionId. Two requests with the same transactionId value indicate that the requests are duplicates. The client must generate the identifier as a random UUIDv4 value.

  3. The server must apply the operation only once for a particular transactionId value, the first time the server receives a request with a transactionId it hasn’t processed yet.

  4. The response to a GraphQL mutation operation with a particular transactionId value must always produce the same logical output. If the server processed the mutation successfully, the response must signal success for all requests having the same transactionId value. Similarly, if the server completed processing with a failure, all the responses to the same transactionId must signal that failure. In particular, a success response may contain output reflecting the current state of the data, but that output might be different when the client requests the same mutation again (another mutation may have changed the data).

  5. The same transactionId value must be passed as-is to dependent services.

The principles apply to both public and internal APIs alike.

I’ll go through the principles one-by-one, except for the first, which should be self-sufficient.

The 2nd principle enables distinguishing between two requests and to tell whether they are for the same purpose, even if the input payload would be otherwise be the same. This allows creating different user entities sharing their name, for instance.

As an example, here’s the GraphQL mutation to create a new employee in the Staff API:

mutation {
  createEmployee(
    input: {
      transactionId: "addb372c-046f-43e8-c91f-1df1a30caaa1"
      data: {
        firstName: "Albert"
        lastName: "Wesker"
        employment: {
          startsAt: "2021-08-12"
          personnelTypeCodes: [MANAGEMENT]
          # etc…
        }
        # etc…
      }
    }
  ) {
    id
  }
}

The 3rd principle implements idempotency in the server logic, but it isn’t enough for the client to implement retries for timed out requests. That is covered by the 4th principle: it allows the client retry the request until it gets to see the response.

I think supporting client retries is one of the main selling points of idempotency. It also explains why uniqueness constraints on entity attributes are not enough to support duplicate request suppression. A constraint on an attribute, such as username, does prevent clients from creating duplicate user entities, but client retries are broken. The following sequence diagram shows why:

A sequence diagram showing how the client can receive an error when retrying a mutation operation

In the diagram, the client requests creating a new employee with a certain username. The service enforces that the username must be unique. The request propagates via the API gateway to the application service, and the service processes the request with success. But then the API gateway crashes before it forwards the response to the client. Eventually, the retry timeout in the client triggers and the client sends the same request again. This time the client receives the response, but it’s a failure: an employee with the supplied username exists already. This is unexpected from the client’s perspective.

An implementation of the 3rd and 4th principles in the server is an SQL table for storing the outcomes of processed mutations. The database schema could be like the following for PostgreSQL:

create type transaction_operation as enum (
  'CREATE_EMPLOYEE',
  'UPDATE_EMPLOYEE',
  'DELETE_EMPLOYEE'
);

create table transaction (
  id uuid not null primary key,
  operation transaction_operation not null,
  target jsonb,
  error_msg text,
  created_at timestamptz not null default now(),
  check (target is not null or error_msg is not null)
);

Here are some example rows to support further discussion:

id operation target error_msg created_at addb372c-046f-43e8-c91f-1df1a30caaa0 CREATE_EMPLOYEE ["549d9715-0949-4a57-b9fb-1c56eb8e5029"] 2021-12-03 11:59:30.085 +0200 abdb372c-026f-43e8-c91f-2df1b30d8aa1 UPDATE_EMPLOYEE ["549d9715-0949-4a57-b9fb-1c56eb8e5029"] 2021-12-03 12:00:14.290 +0200 abdb372c-026f-43e8-c91f-2df1b30d8aa2 UPDATE_EMPLOYEE ["549d9715-0949-4a57-b9fb-1c56eb8e5029"] invalid email 2021-12-03 12:03:43.110 +0200 11d36de7-0e36-475a-ae01-baa634010aa3 DELETE_EMPLOYEE ["549d9715-0949-4a57-b9fb-1c56eb8e5029"] 2021-12-03 12:18:11.507 +0200 addb372c-046f-43e8-c91f-1df1a30caaa4 CREATE_EMPLOYEE duplicate employee username 2021-12-03 13:52:52.067 +0200

The id column stores the transactionId of a processed mutation request. The operation and target columns together enable storing different kind of completed mutations in this single table; the operation signifies the type of the mutation operation performed, and the target column stores the primary key of the target entity as a JSON array. For the employee entity, the primary key is just a single UUID value, but for another entity type, such as school position, the primary key might be the pair of employment id and school id. We can store any primary key tuple, regardless of their component data types, by encoding them as JSON arrays.

A null value in the error_msg column tells that the operation was a success. A string value present means that the operation in question failed. For example, the third operation (the transactionId ending with a2) completed with a failure to update a particular employee entity, because email validation failed. The error_msg column makes it possible to resend the same error back to the client if the client retries the operation with the same transactionId value. An error_msg value can exist without an entity target id as well: the last operation (the transactionId ending with a4) was a failure to create a new employee. We may publish the identifier of a new entity only after succeeding in entity creation.

In general terms, the Staff service utilizes the transaction table as follows:

  1. Upon receiving a new mutation request, the service opens a database transaction.

  2. If the transaction table contains a row with the same transactionId value as in the request, the service knows that the request has been processed already, and now it only rebuilds the response for the original processing outcome back to the client. The response is either a success or failure, depending on if the error_msg column is populated or not:

    • If error_msg is present, the service builds a failure response with a description why the mutation failed.

    • If error_msg is not present, the service builds a success response. The response might include data about the entity after the mutation is completed. If so, the service includes data about the current state of the entity. Because a later mutation might have changed the entity after reconstructing the response for an older mutation, we settle for showing the current data available (which might be nothing if the entity is already deleted). This is what I meant earlier by including the same logical output in the 4th API design principle.

  3. If the transaction table didn’t contain a row with this transactionId value, the service knows that now is the first and only time to process the request. The service must execute any calls to remote services (doesn’t matter who owns them) within the context of the open database transaction and expect errors. But rollbacking the whole database transaction upon remote call error is not the right way do it either: the service must still be able to append a new row to the transaction table in the end of the database transaction.

    This is where the ROLLBACK TO SAVEPOINT SQL command is very useful. Mark a savepoint within the transaction just before the point of doing anything that you expect to raise an error. It an error does happen, handle it gracefully, rollback to the savepoint, and remember the error for the next step.

  4. Now the service has completed processing the mutation either with success or failure. The service appends a new row to the transaction table accordingly.

  5. The service commits the database transaction and responds to the client.

The 5th API design principle concerns the ability to track the propagation of change across services. If the Users service, coming after the Staff service in the communication path of processing clients’ mutation requests, has completed a request with a certain transactionId, but the Staff service isn’t, we know that the Staff service is malfunctioning.

Continuing the earlier example of creating a new employee in the Staff API (the createEmployee mutation), the Staff service might send a GraphQL mutation like this to the Users service in order to create a user entity to associate with the employee:

mutation {
  createUser(
    input: {
      transactionId: "addb372c-046f-43e8-c91f-1df1a30caaa1"
      data: {
        firstName: "Albert"
        lastName: "Wesker"
        # etc…
      }
    }
  ) {
    id
  }
}

This would be a call to a remote service in the 3rd step of the usage description of the transaction table we just went through. The service making remote calls should utilize retries for timed out connection attempts.

Now I can justify my choice of naming for the transactionId parameter. I think duplicate request processing suppression and database transactions share some of their goals. Especially, both aim to protect against data corruption by guaranteeing that processing takes effect at most once. But duplicate request suppression is not a form of distributed transactions either. For example, it’s possible that the Staff service might crash while processing the createEmployee mutation, just after the User service has completed processing the createUser mutation received from the Staff service. In that situation, the Users service will have a row in its transaction table indicating completed request processing, but the same table in the Staff service won’t contain a corresponding row for the createEmployee mutation. The system will be left in an inconsistent state unless the client retries the request until receiving a response.4

Note that because the transactionId parameter is user input, its value must be treated as unsafe and potentially malicious. Clients might generate values that are not truly random, even though the values might conform to the UUID format. This is why services must enforce authorization for clients accessing their data.

Communicating with external services

End-to-end wise, the IdP service is the last service in the communication path of creating a new employee. It’s part of the system, but, being an external service, we cannot enforce our API design principles to it. Is there anything we can do to prevent duplicate request processing?

Uniqueness constrains on entity attributes enforced by the API of the external service do help, even if they don’t behave nicely with request retries. For example, the IdP service in my imaginary web service might enforce unique usernames for user entities. That effectively acts as a duplication suppressor for request retries when attempting to create a new entity. If you route all requests to the external service via your own service acting as a facade, you can anticipate username constraint errors on retries and check if the user was created successfully on an earlier attempt after all. In addition, you should have a mechanism to suppress duplicate request processing in the client-facing side of the facade service, especially if the service stores state about some of the data in the external service (entity identifier mapping, like in the Users service, for instance).

Summary

The longer your web service operates and the more requests it handles, the more important suppressing duplicate request processing becomes. Faults can and eventually will happen in the components of your system. Some of those faults will trigger your services receiving duplicated requests. Idempotent request processing constitutes that requesting the same operation with the same input many times over applies the effect in the service only once. The trick is in the identification of the input data, and I’ve shown one way to implement it with the transactionId parameter.

There are many ways to go about this. In considering any approach, I’d inspect it from the viewpoint of the client: how can you ensure that it’s safe for the client at the start of the communications path to retry requests, and that the response, when it finally arrives, has the same content as the response to the first request that was actually processed?

  1. In exactly-once semantics, a system processes a message so that the final effect is the same as if no faults occurred, even if the operation was retried due to some fault. Thinking web services, that theoretically necessitates the client to support infinite number of retries, because the service might be unreachable when the client sent its request. In turn, the server must guarantee at-most-once semantics in processing received requests: the server must detect duplicate requests and process only the first of them either completely or not at all. 

  2. An entity is an object that is defined primarily by its identifying attribute (such as UUID). Two different entities might have the same descriptive attributes (such as name). 

  3. An operation is idempotent if, given the same input, you apply it many times, and the effect is the same as if you applied it only once. 

  4. Request processing can be made more reliable between two services with a message broker: the source service publishes requests as messages to the broker, while the destination service consumes messages and acknowledges consumed messages after completing processing them. This is possible with Apache Kafka, for example. 

https://tkareine.org/articles/suppressing-duplicate-requests-in-web-services
Standalone Git branch from subdirectory
Show full content

Imagine you have a subdirectory of generated files and you want to store them into Git as a standalone (orphan) branch. For example, you have generated html files and associated CSS and JavaScript assets, with the intention of publishing them as a GitHub Pages site from the gh-pages Git branch of the project. Git’s plumbing commands allow automating storing the generated files into the gh-pages branch, recreating the branch each time you publish. Here’s a Bash oneliner to do that:

git add --force build \
  && git update-ref refs/heads/gh-pages "$(git commit-tree -m 'Generated build' "$(git write-tree --prefix=build/)")" \
  && git reset

For that Bash command list, I presume that

  • the generated files are already in the build subdirectory,
  • you include build in the .gitignore file of the project, and
  • the Git index (staging area) is clear currently.

Let’s go over the parts of the command list.

1. Stage the generated files into Git index
git add --force build

I’ll use the --force switch to allow Git to add ignored files.

2. Create a Git tree object from the current index
git write-tree --prefix=build/

A Git tree object groups Git objects and stores the paths of the objects. The tree object will be used to create a commit object in the next step.

The --prefix=build/ option makes Git to treat the build directory as the root directory for the files within the directory. For example, a file with the build/dir/index.html path gets recorded with the dir/index.html path inside the tree object.

The command prints the name of the tree object to stdout (I’ll use the $tree shell variable for that in the next step).

3. Create a Git commit object from the tree object
git commit-tree -m 'Generated build' $tree

The command prints the commit object id to stdout (let’s put it into the $commit variable).

4. Set a branch to refer to the commit object
git update-ref refs/heads/gh-pages $commit

This overwrites the gh-pages branch, if it exists already.

5. Reset index to the current HEAD
git reset

This needs to be done for clearing the index.

Now the target branch, gh-pages, contains a single orphaned commit, using the build subdirectory as the root directory of the files.

If you want to, you can store the previous commit of the gh-pages branch as the parent of the next commit, but then there are edge cases to consider: you’ll need to detect if the target branch exists already and whether the new contents of the build directory differ between the next and the previous commit (it doesn’t make sense to create a new commit with an empty diff compared to the parent commit). Covering them would require elaborate scripting compared to the Bash oneliner I went through.

As a real example, the Hacker’s Tiny Slide Deck project uses this trick in storing the generated slides (an html file) and the JavaScript bundle of the project into the gh-pages branch of the project, from where GitHub Pages publishes them. The relevant Git commands are in package.json.

Here’s a screenshot of GitUp app’s map view of the Git repository of Hacker’s Tiny Slide Deck, showing what the standalone gh-pages branch looks like:

A map view from the GitUp app showing the master and gh-pages branches

The chapter titled Git Objects from the Pro Git book is a great resource for learning more about Git internals.

https://tkareine.org/articles/standalone-git-branch-from-subdirectory
Lightweight Node.js version switching
Show full content

Recently, I’ve been paying attention to the time it takes my shell’s init script to complete. Bash is notoriously slow, but since it’s popular in scripting use, I keep using it. This leaves me to optimize my ~/.bashrc.

I program with Node.js frequently, so a Node.js version manager is an essential tool. Upon investigating the execution time of my .bashrc, I found that loading nvm takes a lot of time:

$ ls ~/.nvm/versions/node
v10.11.0	v8.12.0		v9.11.2

$ export NVM_DIR="$HOME/.nvm"

$ time source "$HOME/brew/opt/nvm/nvm.sh"

real	0m0.397s
user	0m0.270s
sys	0m0.134s

$ nvm --version
0.33.11

400 ms for sourcing nvm.sh is a way too big share of the time budget I’d like to allocate for starting Bash in interactive mode. It’s a pity, because nvm is a quite nice tool.

An alternative for nvm is nodenv:

$ ls ~/.nodenv/versions
10.11.0	8.12.0	9.11.2

$ time eval "$(~/brew/bin/nodenv init -)"

real	0m0.070s
user	0m0.034s
sys	0m0.034s

$ nodenv --version
nodenv 1.1.2

I can manage with 70 ms. This is the tool I chose to use as my Node.js version manager for a while. But because nodenv utilizes shims to wrap the executables of the selected Node.js version, a couple of problems arise. The first is that after installing a new executable from global npm package, you must remember to run nodenv rehash to rebuild the shims. Otherwise you can’t run the executable. The second is that you lose access to the manual pages of the wrapped executables: a shim is an indirection for the actual executable, causing man’s manual page search to miss the page. A demonstration of the problems:

$ npm ls -g --depth=0
/Users/tkareine/.nodenv/versions/10.11.0/lib
`-- npm@6.4.1

$ npm install -g marked
/Users/tkareine/.nodenv/versions/10.11.0/bin/marked -> /Users/tkareine/.nodenv/versions/10.11.0/lib/node_modules/marked/bin/marked
+ marked@0.5.1
added 1 package from 1 contributor in 0.586s

$ command -v marked

$ nodenv rehash

$ command -v marked
/Users/tkareine/.nodenv/shims/marked

$ man -w node
No manual entry for node

$ man -w marked
No manual entry for marked

I keep forgetting to run nodenv rehash and I do would like to access the manual pages of the executables of the selected Node.js version.

nvm and nodenv have a lot of features. While they are useful in some scenarios, such as continuous integration setups, I’d be satisfied with less in my development environment. The ability to install specific Node.js versions and to switch between them easily, independently per shell session, would be enough.

In the Ruby community, ruby-install and chruby tools provide just these features, and nothing more. The former is for installing Rubies and the latter for switching between them. What’s great about this arrangement of separate tools is that the switcher, chruby, is very lightweight.

node-build, part of nodenv project, is a dedicated Node.js installer. It checks the digest of the downloaded Node.js package and allows you to unpack it to any directory. This is good and I’ll keep using it.

For the version switcher, I didn’t find anything I liked. sh-chnode is written in the same spirit as chruby, but includes some design decisions I didn’t like personally.

I ended up writing my own version switcher, even though there’s already so many of them. But this one is fast to load, does one thing well, and is suitable for me. :) Naming is hard, so I just call it chnode. Let’s see it in action:

$ ls ~/.nodes
node-10.11.0	node-8.12.0	node-9.11.2

$ time source ~/brew/opt/chnode/share/chnode/chnode.sh

real	0m0.007s
user	0m0.004s
sys	0m0.003s

$ chnode node-10

$ chnode
 * node-10.11.0
   node-8.12.0
   node-9.11.2

$ npm ls -g --depth=0
/Users/tkareine/.nodes/node-10.11.0/lib
└── npm@6.4.1

$ command -v marked
/Users/tkareine/.nodes/node-10.11.0/bin/marked

$ man -w node
/Users/tkareine/.nodes/node-10.11.0/share/man/man1/node.1

$ man -w marked
/Users/tkareine/.nodes/node-10.11.0/share/man/man1/marked.1

$ chnode --version
chnode: 0.2.0

For me, chnode is the tool comparable to chruby for Node.js versions. Like chruby, the primary mechanism of chnode is to modify the PATH environment variable to include the path to the bin subdirectory of the selected Node.js version. But unlike chruby, chnode does not modify any Node.js specific environment variable (there’s no need).

I didn’t implement auto-switching to chnode. The feature would switch Node.js version to the version specified in the .node-version file if the current working directory, or its parent, would have the file. You might put such a file at a project’s root directory. chruby has the feature, but because I don’t use it, I dropped it.

chnode supports GNU Bash and Zsh, has good test coverage, and allows you to display the selected Node.js version in the shell prompt with ease. It’s MIT licensed. See the README for more.

Finally, the total execution time of initializing my Bash setup in interactive mode, including selecting a Node.js version with chnode:

$ time bash -i -c true

real	0m0.337s
user	0m0.240s
sys	0m0.083s
https://tkareine.org/articles/lightweight-nodejs-version-switching
Programming with minimum number of dependencies
Show full content

When you’re concentrating on the essentials of your current programming task, you’ll want to avoid sidetracks as much as possible. Encounter a tricky subtask and you’ll start searching the web for a 3rd party code solving it for you. When you’re introducing additional dependencies without much further thought, you’re not considering their burden on maintenance.

How often do you check for outdated dependencies (and actually do upgrade, doing all the required client side changes)? Do you have local modifications to any of the dependencies (I hope not, but if you do, how do you track the modifications in order to reapply them)? Once you’ve updated the dependencies, how do you make sure your code still works as intended (how comprehensive are your tests)? And then there’s the hell of library incompatibilities.

You can think that dependencies are like loans you’ll have to take care of for the whole lifespan of the project. The interest rate varies for each dependency, so it pays off to justify having each of them.

Sometimes, you can go a surprisingly long way without additional dependencies. I take occasional delight in programming with as few library dependencies as possible.1 It’s fun for the challenge, and it makes you think about your design.

Recently, I needed a program to fix outdated identifiers in my customer’s MongoDB documents. These identifiers referred to documents in external service A. Each identifier was paired with another identifier for external service B, and luckily the latter ones were still correct. By querying the REST API of service A with service B identifier, I could find out the correct service A identifier and update the document in MongoDB.

Because this was for an occasional maintenance need, I decided not to include the program as part of the application’s code. A command line tool felt better. Observing that both MongoDB and service A’s REST API speak JSON, all I essentially needed was the ability to communicate and handle JSON. For the REST API, the communication happens over socket connection. For MongoDB, you could use a driver for your programming language to talk with the database. But there was an alternative: because the query and insert operations the tool needed were simple, I could attach the tool to Mongo’s shell with a Unix pipe, evaluating database commands in JavaScript and reading the results as JSON via the pipe.

I went to write the tool in Ruby. It turned out that I didn’t need external libraries at all. Ruby’s standard library has a decent (though verbose to use) HTTP library, a JSON parser and generator, and a great set of tools to work with processes and IO (just take a look at how versatile Kernel.spawn is!) I embedded the small amount of JavaScript needed for database operations straight into the program. User input escaping within the JavaScript commands was easy: just encode the input into JSON.

Because there are no 3rd party libraries, there’s no need for dependency management. The tool is ready to use as long as you’ve Ruby installed.2

To demonstrate the implementation, I wrote a similar toy program. This one is for searching term definitions: if there are definitions matching the search term in MongoDB, the program shows them. Otherwise, the program searches the definition from DuckDuckGo’s Instant Answer API, stores the new definition to MongoDB, and shows it.

Healthy dependency management balances the risks and benefits. This article is not about doing it all by yourself, avoiding dependencies for the sake of it. Instead, you should consider each dependency, think what you benefit from it, and how it fits to the whole project.

  1. jekyll-minibundle is an example of this. 

  2. Ruby belongs to the customer’s development environment already, so it’s not a new dependency by itself. 

https://tkareine.org/articles/programming-with-minimum-number-of-dependencies
Remote work calls for active participation
Show full content

Remote work is sometimes a necessity. Be it for your circumstances or for whatever reason, you have weighted with your team that you doing remote work has more benefits than drawbacks. But working alone behind the wire steers you easily to isolation. You concentrate only on your tasks, forgetting what the rest of the team is doing.

That’s no teamwork. Without participating to the team you are doomed to slow down your team, and, ultimately, your project.

Being a remote team member calls for active participation. Listen to the others, discuss, and help them as soon as possible. Do you understand your task thoroughly? If not, do not rely on assumptions – ask for more details. Let others know what’s your status. It’s up to you to keep others informed, because nobody else knows what you are really doing. Be consistent and others will trust your doings.

Team practices gain another level of importance for remote work. I can’t imagine a team functioning without daily meetings. Or deciding what are the next most important tasks to do without story planning, or improving as a team without retrospectives. Having such practices on a regular basis ensures at least a minimum level of communication between team members.1

As Inayali de Leon writes in Becoming better communicators, we are emotional creatures, and digital communication removes the emotional cues that mitigate worry. We have to be aware of this. The better you know your team member, the better chance you have interpreting her feelings correctly.

Because we tend to misunderstand messages, you should use a communication medium that reduces emotional cues the least. Prefer video over voice, and voice over text. Use at least voice communications for daily meetings. I have found out that dedicating a laptop for constant video connection to the rest of the team with Google Hangout is very beneficial.

But even video communications is not a replacement for meeting people in person. You gain tremendously more understanding by taking part in story plannings and retrospectives in person. Meet the team regularly and make it a habit. The team evolves their own vocabulary and fellowship, and you must share it in order to be part of the team.

Be active.

  1. Pair programming is great for sharing techniques and practices, and solving problems together. It is possible to do that remotely with screen sharing tools like TeamViewer

https://tkareine.org/articles/remote-work-requires-active-participation
Asset bundling with Jekyll
Show full content

How do you ship the stylesheets and JavaScript sources of your Jekyll-built site? Shipping them as is, source file for source file, works, but causes the browser to request each of them separately from the backend. You want to consider concatenating all the stylesheet sources specific to your site into one file and then minifying that file. This is called asset bundling. And you should apply the same for JavaScript sources, too. This reduces the number of requests the browser does upon initial page load, shortening the time it takes to load the page.

Asset bundling has a related problem: caching. Generally, when the assets of your site change, you want the browser to fetch the latest version from the backend. The problem is in detecting when to use the version in the browser’s cache and when to refetch the latest version of the asset from the backend. This can be solved by setting the HTTP response headers so that html files are considered to be dynamic resources, refetched when changed. The asset files are regarded as static resources, having long caching period. Whenever the contents of an asset changes, we refer to a new static resource in the html file. The latter is called cache busting. There’s two techniques to it: using a URL query parameter or a fingerprint in file path.1

It is up to you to solve both asset bundling and cache busting with Jekyll. Out of the box, Jekyll just copies each source as is to the generated site directory. This does not help asset bundling. And you handle the references to the assets in html files manually without any cache busting mechanism. Let’s see what we can do about these.

Jekyll with GitHub pages

With GitHub pages, you let the service generate the site from your sources. The tradeoff is that GitHub runs Jekyll with --safe switch, disabling plugins. This means you have to do with what Jekyll has by default.2

For bundling assets, there are two options. Either combine the assets manually, or use an external tool for it. I just put all JavaScript codes in a single file when there’s not too much of it. The latter option is the one I prefer for stylesheets, because I don’t want to write CSS by hand, anyway. I use Compass to author stylesheets with Sass markup and direct Compass to compile the resulting CSS into a single compact file. The bad thing is that I have to add the compiled CSS files to git.

Update (9 March 2013): Another alternative is to concatenate assets with include tags, as shown here.

Then you have to solve cache busting. Here’s one way to do it. In your content file, add a query parameter to the URL of the asset:

<link href="{{ site.baseurl }}assets/styles/screen.css?bust={{ site.time | date: '%s' }}" rel="stylesheet" media="screen, projection">

When you generate the site, the bust parameter will have a timestamp from the moment of site generation.

<link href="/assets/styles/screen.css?bust=1419885211" rel="stylesheet" media="screen, projection">

Update (29 December 2014): Changed the example above to use plain Unix timestamp as the cache bust value.

The timestamp will change upon each site generation, likely updating too often compared to the frequency of changes you have for screen.css. Assets need timestamps to update only when the contents of the assets change. But at least timestamp generation is automatic, so the tradeoff might be okay. I guarantee you won’t remember to do it manually every time when needed.

This was a path of compromises. But for a small number of stylesheets and JavaScript sources, I don’t think it is all that bad.

Jekyll with jekyll-minibundle

In order to address the compromises discussed above, you have to use Jekyll with plugins. If you look at the Jekyll plugin page and search for “asset”, you will find many plugins written for handling asset bundling.

But for my own preferences, I found most of the existing plugins too complex to use. Neither did like to install a lot of transitive gem dependencies. So, I decided to write my own: jekyll-minibundle. The plugin has no gem dependencies and it works with any minification tool supporting standard unix input and output.

Let’s go through bundling JavaScript sources.

First, you need to choose your minification tool. UglifyJS2 is a fast one. Install the tool of your choice and set the path to its executable in $JEKYLL_MINIBUNDLE_CMD_JS environment variable. For example:

$ export JEKYLL_MINIBUNDLE_CMD_JS='/usr/local/share/npm/bin/uglifyjs --'

Then, install jekyll-minibundle with

$ gem install jekyll-minibundle

and place the following line to _plugins/minibundle.rb:

require 'jekyll/minibundle'

Place your JavaScript sources to _assets/scripts directory in the site project.

In your content file where you want the <script> tag to appear, place a minibundle Liquid block:

{% minibundle js %}
source_dir: _assets/scripts
destination_path: assets/site
assets:
- scrolling_menu
- program_table
- some_sharing
{% endminibundle %}

Here we specify that the output will be a JavaScript bundle with scrolling_menu.js, program_table.js, and some_sharing.js as input sources from _assets/scripts directory. These will be fed to the minifier in the given order. The output will be stored to _site/assets/site-<md5digest>.js. The plugin will insert the MD5 digest over the contents of the bundle as the fingerprint to the filename:

<script type="text/javascript" src="assets/site-9a93bf1d8459c9a344a36af564b078a1.js"></script>

The plugin supports the same mechanism for stylesheets. However, I still like to use Compass for stylesheets, because it has so many other benefits. Because Compass can handle bundling, the plugin only needs to copy the file and add a fingerprint to the filename.

In order to do this, tell git to ignore _tmp directory, and configure Compass to place the output to _tmp/screen.css. Then, add this line to your content file for including the path to the bundle:

<link href="{% ministamp _tmp/screen.css assets/screen.css %}" rel="stylesheet" media="screen, projection">

The resulting filename fill have the MD5 digest of the file as the fingerprint:

<link href="assets/screen-2ef6d65c7f031e021a59eb5c1916f2f2.css" rel="stylesheet" media="screen, projection">

This approach works with RequireJS optimizer, too!

Both the fingerprinting and asset bundling mechanisms work in Jekyll’s auto regeneration mode.

The plugin has one more trick in its sleeves. If you set environment variable $JEKYLL_MINIBUNDLE_MODE to development, the plugin copies asset files as such to the destination directory, and omits fingerprinting. This is useful in development workflow, where you need the filenames and line numbers of the original asset sources.

I have shown how to automate asset bundling and fingerprinting for cache busting with the plugin. In addition, we have gotten rid off all the compromises we had when using vanilla Jekyll: there is no need to store generated bundle files in git, and asset fingerprints change only when the contents of the assets change.

You can read more about the plugin at its project page in GitHub. Also, you might be interested in a site that uses the plugin just like described above.

  1. Google recommends cache busting with fingerprinting over using a query parameter. Some old proxy caches do not cache static files at all if the URL contains query parameters. 

  2. However, you can work around this by generating your site locally and then pushing the generated files to GitHub. Then you’re not locked to Jekyll’s safe mode. 

https://tkareine.org/articles/asset-bundling-with-jekyll
Why JavaScript needs module definitions
Show full content

Me and my colleague Eero Anttila are working in a project where we are using Eero’s Continuous Calendar plugin for jQuery in the frontend. The plugin utilizes a set of date handling functions for formatting, parsing, and so on. The functions are grouped into objects (DateTime, DateFormat, DateRange, and Locale) which are injected into the global window object. A very useful aspect of the functions is that they are immutable. For example, dateTimeObj.firstDateOfMonth() returns a new instance of DateTime.

We found out that we could benefit from these functions in the application generally, needing date handling also elsewhere than in the calendar component.

Our frontend loads with RequireJS, and we’ve been happy composing our application from small modules. Now, in order to get access to the date handling functions in our modules, we need either to ensure that Continuous Calendar gets loaded before our application’s modules, or we need to introduce optional AMD support for the date functions. Because it doesn’t make sense to load the whole Continuous Calendar just to get access to the functions, we decided add AMD support to them.

The AMD community has devised common patterns for making a JavaScript module1 to work simultaneously with AMD loaders, CommonJS, and traditional browser script loading. They are called as Universal Module Definition (UMD) patterns. Essentially, we are talking about inserting bootstrap code in the beginning of a module’s source file.

Here’s an example how DateTime global object supports AMD loaders and traditional browser script loading:

DateTime.js
(function(root, factory) {
  if (typeof define === 'function' && define.amd) {
    // AMD loading: define module named "DateTime" with no dependencies
    // and build it
    define('DateTime', [], factory)
  } else {
    // traditional browser loading: build DateTime object without
    // dependencies and inject it into window object
    root.DateTime = factory()
  }
}(this, function() {
  // above, `this` refers to window, the second argument is the factory
  // function

  // build DateTime and return it
  var DateTime = {}
  return DateTime
})

DateTime factory executes without external dependencies. This is communicated in the code by define call having empty array as its second argument for the AMD case, and the factory function call having no arguments in the traditional browser loading case.

However, for building DateRange, we need jQuery, DateFormat, and DateTime:

DateRange.js
(function(root, factory) {
  if (typeof define === 'function' && define.amd) {
    // AMD loading: define module named "DateRange" with dependencies
    // and build it
    define('DateRange', ['jquery', 'DateFormat', 'DateTime'], factory)
  } else {
    // traditional browser loading: build DateRange object with
    // dependencies and inject it into window object
    root.DateTime = factory(root.jQuery, root.DateFormat, root.DateTime)
  }
}(this, function($, DateFormat, DateTime) {
  // above, `this` refers to window, the second argument is the factory
  // function with dependencies

  // build DateFormat with the help of $, DateFormat, and DateTime, and
  // return it
  var DateRange = {}
  return DateRange
})

What happens here? With AMD loader, such as RequireJS, the if block of the bootstrap code executes. There we call define, specifying a module named DateRange (the first argument), needing jQuery, DateFormat, and DateTime as its dependencies (the array as the second argument). Eventually, after loading all the specified dependencies, the AMD loader calls the factory function (the third argument) with the dependencies as the arguments to the function.

If were are not using an AMD loader, but loading the script in the browser traditionally with <script> tag, the else block of the bootstrap applies. Before that, however, we have to ensure that we load modules in such an order that the dependencies of each module exist at the evaluation time of the module. That can be satisfied by careful organization of <script> tags or bundling the modules in a single source file. In this case, jQuery, DateFormat.js, and DateTime.js must be loaded before loading DateRange.js. When the browser evaluates DateRange.js, it calls the factory function with dependencies fetched from the global window object.

I really like the factory function spelling out the dependencies as parameters to the function.2 We get to know the dependencies just by looking at the function signature. In addition, we have located the change made to the global window object (if any) in one predefined place (the else block). If we’re using an AMD loader, we avoid polluting the global window object altogether!

The UMD pattern drives the module author to make at most one addition to the global window object. That’s a great guideline for organizing modules.

Of course, it is up to the module author to play by these rules. There’s nothing preventing the factory function from referring to the window object for other dependencies or polluting the global window object. But why would the author want to surprise the users of the module?

  1. Module meaning a JavaScript source file defining functionality that can be used elsewhere. 

  2. The factory function is an application of Module Pattern with import mixins

https://tkareine.org/articles/why-javascript-needs-module-definitions
Readable tests
Show full content

When I go to explore unfamiliar code, I dig up its tests first. I hope the tests introduce me gently to the purpose of the code, covering the common use cases first, followed by edge conditions and more peculiar cases. I expect tests to reveal me the general behavior and purpose of the code. I don’t expect other documentation.1

Later, when I change the code by refactoring and adding new features, I don’t expect to modify most of the tests. Finding the place for writing new tests for the added feature is intuitive, because the structure of the tests guides me to proper location.

That’s what good tests are like. The implied characteristics are introduction, documentation, and rigidity against changes. The fact that such tests protect you against regression bugs is almost an afterthought.

I think readability is a good term for covering these features. Here’s a few guidelines for writing such tests.

Setup state, make claims about it

Say you have a class or webpage that needs to be tested in certain state. It is important to clearly separate state setup code from test assertions. The former answers to the question “where are we?”, while the latter answers to “what is it like?” I use terms system-under-test to denote the state to be tested.

The actual tests are just claims about the state of system-under-test. They cause no changes to the state (no side-effects). I use term test claim for that.

Together, a system-under-test and its test claims form a “test context”.

An example of such a test, written in JavaScript and using Mocha test framework:

cart_page_spec.js
description('shopping cart page', function() {
  description('when page is loaded', function() {  // system-under-test
    before(function(done) {
      App.loadCart(done)
    })

    it('shows checkout button', function() {       // test claim
      expect($('.cart button.checkout')).to.be.visible
    })

    it('has no payment options', function() {
      expect($('.cart .payment .method')).to.be.empty
    })

    // more claims...
  })

  description('when choosing to pay', function() {
    before(function(done) {
      App.loadCart(function() {
        TestHelpers.clickCheckoutButton(done)
      })
    })

    it('hides checkout button', function() {
      expect($('.cart button.checkout')).to.be.hidden
    })

    it('has payment options', function() {
      expect($('.cart .payment .method')).not.to.be.empty
    })

    // more claims...
  })
})

Prepare the states of system-under-tests (describe blocks) in their setup codes (before blocks in the example above). Make sure to reset everything the tests depend on. This ensures that each system-under-test gets a fresh start, avoiding state leaks from others.

For cleaning your tracks, you could have teardown code to be run after the tests of a specific test context. It is best to avoid teardowns, however, because they are easy to forget to write. It is better to write your setup code so that it ensures the world is in proper state for your tests to run.

Try to separate tests so that each assertion makes a specific claim. You can use multiple assertions for a specific claim, however. Custom matchers help you with this, especially if you test a specific thing more than once.

When you write your tests like this, you gain two benefits: you can run your tests in any order, and you get the choice to run the setup code for a system-under-test only once. For instance, Ruby’s MiniTest runs tests in random order, helping to catch tests that have side-effects in their claims. Mocha has before block for running setup code for a system-under-test only once (beforeEach runs setup for each claim). This speeds up test execution.

In addition, prefer active clauses for describing a system-under-test and its claims. An active clause clearly identifies that the claim is about the system-under-test. Also, words should and must are just noise: compare “it has payment options” against “it should have payment options”.

Test state transitions

Note that the states of the two test contexts in cart_page_spec.js (above) differ only by the clicking of the checkout button. Why didn’t I just take the state of the first test context and modify that for the purposes of the latter test context? I chose to reset the world between them, because it gives us orthogonality (state changes in test context A do not get reflected in test context B). After a few state transitions, it becomes hard to keep track of the state changes happened so far. Ideally, you want to see the whole state of the current system-under-test in one glimpse. You achieve that by initializing the whole state in the setup block of the system-under-test.

Now I can also reorder test contexts as I like. I can move the most common cases to the top of the test file and edge cases to the bottom.

But sometimes it is useful to have state transitions between test contexts. For example, such a case might occur for input validation before checkout confirmation:

cart_page_validation_spec.js
description('cart page validation', function() {
  description('when entering invalid credit card number', function()
    before(function(done) {
      App.loadCart(function() {
        TestHelpers.clickCheckoutButton(function() {
          $('.cart .payment .creditcard .number').val('lolbal')
          done()
        })
      })
    })

    it('highlights credit card number as invalid', function() {
      expect($('.cart .payment .creditcard .number')).hasClass('invalid')).to.equal(true)
    })

    it('disables confirmation', function() {
      expect($('.cart button.confirm')).to.be.disabled
    })

    description('and then entering valid credit card number', function()
      before(function(done) {
        $('.cart .payment .creditcard .number').val('4012888888881881')  // not mine, mind you
        done()
      })

      it('does not highlight credit card number as invalid', function() {
        expect($('.cart .payment .creditcard .number')).hasClass('invalid')).to.equal(false)
      })

      it('enables confirmation', function() {
        expect($('.cart .payment button.confirm')).to.be.enabled
      })
    })
  })
})

Essentially, here you test that validation mechanism handles the case of revalidating invalid input.

I prefer to nest test contexts that depend on earlier ones. That communicates the intent of dependence clearly. It also keeps the number of nestings in check, because three or more nesting levels makes the test context difficult to read as whole.

Group tests by semantics

If a set of tests are similar in semantics, you should group them together so that it is easy so see the difference between them:

date_format_spec.js
describe('date formatting', function() {
  _.each([
    { desc: 'non-date string', args: ['lolbal'] },
    { desc: 'empty object',    args: [{}] },
    { desc: 'number':          args: [1] }
  ], function(spec) {
    it('throws exception if given ' + spec.desc, function() {
      expect(function() { Format.date.apply(null, spec.args) }).to.throw(/^Invalid date: /)
    })
  })
})

Those tests were about input argument validation. I would separate them from testing the happy path:

date_format_spec.js (continued)
describe('date formatting', function() {
  _.each([
   { desc: 'Date object, with long weekday',                  args: [new Date(2010, 2, 2), {weekday: 'long'}],  expected: 'Wednesday May 2, 2010' },
   { desc: 'Date object, with short weekday',                 args: [new Date(2010, 2, 2), {weekday: 'short'}], expected: 'Wed May 2, 2010' },
   { desc: 'Date object, without weekday',                    args: [new Date(2010, 2, 2), {weekday: false}],   expected: 'May 2, 2010' },
   { desc: 'String presentation of date, with long weekday',  args: ['2010-03-02',         {weekday: 'long'}],  expected: 'Wednesday May 2, 2010' },
   { desc: 'String presentation of date, with short weekday', args: ['2010-03-02',         {weekday: 'short'}], expected: 'Wed May 2, 2010' },
   { desc: 'String presentation of date, without weekday',    args: ['2010-03-02',         {weekday: false}],   expected: 'May 2, 2010' }
  ], function(spec) {
    it('formats ' + spec.desc, function() {
      expect(Format.date.apply(null, spec.args)).to.equal(spec.expected)
    })
  })

  // input argument validation tests are here
})

By putting the expected input and output of each test case to its own line, possibly with a short description how the case differs from others, you can easily compare them and spot missing tests for edge conditions.

When you adhere to writing a test claim for each test case, it becomes easy to see which particular test fails when you run the test suite.

If your test framework of choice has expression syntax for test claim definition, you can avoid repeating the boilerplate code for each test claim. First, think a group of test cases and see what is common to them. Then, put the varying parts of the cases to a collection. Lastly, iterate the collection so that the body of the iteration becomes the test claim definition. This is what I did in the examples above.

I think this improves readability a lot, because now I can put each test case to its own line, without the boilerplate code between them. This is a manifestation of Don’t Repeat Yourself (DRY) principle.

But don’t take DRY to the extreme. You should aim for making tests readable, not as short as possible. This is why I separated the group of happy path tests from the group of argument validation tests.

On test abstraction levels

Choosing the most suitable abstraction level for testing your code is hard. There are many characteristics at play, some of which are at odds with each other: coverage, simplicity, execution speed, and maintenance. For example, if you choose the application user interface as the abstraction level for all your tests, you gain easier test code maintenance (architectural refactorings do not cause changes to tests), but lose in execution speed (all the application components will be used).

Of course, it is about balance. Choose the characteristics that you desire most for testing a particural part of your application.

I’d write tests for a date formatting component at the unit level, like in date_format_spec.js. It makes no sense to launch the whole application in order to test dates get formatted as expected: the user interface might change during development, and covering all the inputs makes the execution speed slow for such a low level component.

On the other hand, if I had an application with Model-View-Controller architecture, I wouldn’t write tests for controllers, models, and views alone. Writing tests for a specific controller only would require using dummy implementations of associated models and views. Maintaining tests across refactorings would be laborious, because changes in the interfaces of controllers, models, or views would propagate to many tests. Instead, I would raise the abstraction level and write tests at the functional level. In cart_page_spec.js, the web page with the related behavior is the functional level.

You need tests to have confidence that everything works as expected. Isolate your tests from external interfaces of which output you cannot control. Otherwise, you lose that confidence. You can use fake or stub implementations for external interfaces. A fake implementation is easier to put in place if you first abstract the external interface behind your own component:

rest.js
define(['environment'], function(environment) {
  if (environment.production) {
    return createProductionAPI()
  } else {
    return createTestAPI()
  }

  function createProductionAPI() {
    return {
      postCheckout: function(callback) {
        $.ajax(/*...*/).success(callback)
      }
    }
  }

  function createTestAPI() {
    return {
      postCheckout: function(callback) {
        callback(stubs.postCheckoutResponse)
      }
    }
  }
})

Here I have a component of the frontend part of a web application, abstracting the REST API of the backend part. All the REST API calls in the frontend go through this component. In test environment, the component returns canned responses without actually sending requests. It is not a big leap to change the dummy response to fit a particular test’s needs, either.

I dislike using mocks in tests and guiding code design. They end up being a maintenance burden everywhere I’ve worked with them.

Summary

Like good code, writing good tests is hard and takes many iterations. I use these guidelines to steer me when I write tests, but I wouldn’t hesitate to drop following a particular guideline if it makes the end result more readable.

  1. The emphasis is on expectation. For example, a hack in the middle of self-documenting code is unexpected. Thus, you should document any unexpected code. You can even isolate the hack to its own function with a descriptive name. 

https://tkareine.org/articles/readable-tests