On Thursday June 7th 2018 a new customer emailed us reporting that they were unable to submit the work they'd done on our platform. The cause of this error was the deployment of an outdated version of our frontend. Other customer sites and customers who use our on-premise offering were not affected.
LightTag is a tool to make the work of labeling text easy. Being unable to submit work to our servers is unacceptable. In this postmortem we will examine what went wrong at both a technical and process level and explain the steps we've taken to prevent this from happening in the future.
We're very inspired by other postmortems we've seen, notably GitLab's postmortem of their database outage. We hope that by being transparent about our errors and the steps we are taking to mitigate them we can continuously improve and maintain the trust that our customers put in us and our systems.
Our frontend application is served from a container that runs on a Docker Swarm cluster. When we onboarded the customer who reported the issue we mistakenly deployed the frontend image tagged latest (instead of the image tagged with the highest semantic version) from our container repository.When deploying a stack, Docker Swarm will not pull an image onto a host if that image already exists on the host. Thus when we deployed our customers stack they ran an outdated frontend, since an image tagged latest already existed on the host.
The technical resolution for this issue is deploying images based on their semantic version tag. We knew this before the issue ocurred and so realize that this issue is caused by a gap in process, not in technology. Our deployment process did not have sufficient controls and monitoring over precisely which image is deployed.
LightTag provides a text annotation platform as either a hosted solution (SaaS) or on-premise installation. In order to keep complexity down our SaaS deployment is (almost) equivalent to our on-premise deployment. In particular the components that make up LightTag, such as a database, backend server and machine learning processes are each run inside of a container orchestrated by Docker Swarm.
In our hosted offering, each customer is allocated their own set of containers (a stack in the Docker Swarm parlance) which lets us offer our customers full isolation while remaining cost efficient. One of the components that runs in it's own container is our frontend ReactJS application which is statically built and served by our proxy server.
Previously we deployed each new customer in a manual process and we recently migrated to a fully automated self service on-boarding process, the relevant part of which defines and deploys a stack of containers into our Swarm cluster. The customer that reported the error was the first to go through this automated deployment process, hence why they were the only one to experience this issue.
We've been aware of Docker Swarm's behavior (not pulling an image if it exists on the host) for some time. During our days of manual deployment this was a frustrating point which entailed an extra step, updating a service and forcing it to pull after deployment. While frustrating, it was on our deployment checklist and so was not missed.
To generally improve our engineering practices and particularly to get more visibility into what we are deploying, we embraced semantic versioning. In particular we adopted the Semantic Release library as part of our CI/CD pipeline. This library calculates the latest SemVer based on commit messages, and we tag each new image with the calculated semantic version.
Our deployment service takes advantage of this, querying our container repository (example) and selecting the image:tag with the highest semantic version which it deploys to the customers stack. This is done for each of the containers in a customers stack.
Technically, this issue occurred because a bug in our deployment script did not check for the frontend's highest semantic version tag, instead it deployed the latest tag, a relic from when we prototyped the deployment service.
A better internal process would have detected this error. Here are the gaps in our process that we've recognized as facilitating this error
If you think there are additional measures we can take to prevent incidents like this please let us know