7 golden rules for docker images
Creating a good Docker image is an art. There are no fixed rules that can be applied in every situation. Instead, we need to look at the pros and cons of every decision. We can however provide guidance.
Here are 7 golden rules for docker images. Following these rules, you can improve the containers you build, making them more reusable, more efficient and more stable.
The prime requirement for all scalable containers is to never keep track of a state. Every action should be executed in it’s own context, without the need to store long-term information anywhere inside the micro-service. This means that permanent storage, like databases, data files or caches should not be living inside the container. When no data lives inside the container, it means that requests can be executed by any copy of the micro-service, so we can load-balance the requests, or recycle malfunctioning containers.
Statelessness is hard to achieve. It requires that the software you try to encapsulate is written in way that allows statelessness. When done properly, the software should allow you to push the state to an external resource, such as a database. As a hint, you could try to put your datafiles in central folders that can be mounted as external shared volumes (watch out for file-locks and concurrency on the files), make use of external, shared caches etc.
If the software doesn’t allow a stateless runtime, you might be able to use clustering features of the software. This is not advised, because it puts extra requirements on the network setup of your runtime, but it can allow you to run load-balanced.
My request should be handled by any container, independent of previous requests.
When your container travels through the development landscape towards production, it is downloaded and uploaded many times. Even when it has hit production, it will be copied and unpacked every time a new instance is started. Even though disk, memory and cpu might be cheap, the total sum can add up. Look at the follow examples:
- When an micro-service fails, we need a replacement instantly and all the delay during unpacking should be avoided.
- After a disaster, all services might need to start at the same time, congesting the bandwidth for downloading image and other resources for unpacking the docker image.
- Your artifact repository might hold hundreds of versions of you images. Even with good housekeeping rules, the diskspace might grow beyond what is available.
What can you do about this?
- Clean up you package repository cache after using it to install. Package manager like yum and apt download a copy of the version information of all available packages. You won’t need that inside your running container, so clean up after installing.
- Make sure you clean up at the end of every line in your Dockerfile. Docker creates a filesystem snapshot after every line, and stores the diff as a layer. Cleaning up on the next line will not reduce space, but instead use more space.
Load-balancing and high-availability depend on the ability to react to changing conditions in a timely manner. If a container fails, you wish to have it replaced now, not one hour later. When sales are peaking on your website, you wish to have extra capacity now, without any lead time, or you might miss some revenue.
Your micro-service should be able to start fast, and without dependencies to external systems. You don’t want your service to download extra packages from a repository system at startup, but instead all the packages should be part of the docker image. A service should not register itself on a central server, other services should be able to connect to your service by using a well-known url or name, such as a openshift service name. There should never be an external license server that needs to authorize your instance, and that becomes a show-stopper if it is unavailable.
Configuration is all about being able to change behavior without rebuilding the image
Think about how others will use your micro-service. What flexibility can we give them? Does your container need a server name and port to access an external database, or can we provide a full jndi url which allows more fine-tuning? Try not to restrict the use of your container. Containers are for IT-experts, not for end-users, so give them a powerful interface. It will be used by competent people that are trying to make it work in a situation you might not have foreseen, so try to give them the tools.
There are multiple ways to inject configuration into your container.
- Environment variables are the easiest to use. They are clear, easy to find and well understood. They are however immutable once the container has started. The process that runs inside the container gets a copy at startup.
- Configuration files are a bit more complex. The format depends on your software, and the may be scattered all over your file-system. However, good software package are able to detect changes at runtime and can re-load the file. Also, files can be mounted, so multiple images can use the same central configuration file, which makes it easier to maintain the settings.
Other solutions, such as storing settings in the central database are possible, but not advisable in a micro-service landscape. If you are running a large openshift or kubernetes environment, you wish to see config changes, and hiding them in a database is not advisable.
When configuration is all about changing the runtime behavior of the container within the bounds of the software, extendability is about being able to enhance software behavior by building on top of the image.
Many container images on docker.io are build according to this principle. When you use an image of an application service such as glassfish, you can choose to start the container and to upload your application modules to the running instance. This is however is a painful process that needs to be repeated every time you start the container. Instead you can choose to build a new docker image on top of the application service image with the packages pre-installed. When you do so, you have extended the original micro-service with your code, creating a new micro-service.
When you design your micro-service, think about how others can extend it. Can they add classes, libraries or other things that enhance the behavior? Maybe you should split your container into two parts: a reusable base image and a customized extension to that image for your single purpose.
Docker containers are build in layers. A layer is basically a diff: we take the previous layer and apply a number of changes to it, in order to arrive at a new situation. Each layer adds a piece of the final image, and each layer depends on a previous layer.
We already mentioned that we should avoid adding unnecessary data to the layers, but even when we avoid that, we still need to look critically at the layer structure. When you create a docker image, you are focused on the end result: making it work. This is your primary goal. Once you have it working, you should review your Dockerfile, and see if you need to make changes.
Docker tries to be smart about the layer structure. Whenever you use a docker image, the layers are downloaded and cached locally. A cached layer can be re-used when 1) The chain of layers from the root layer until this layer is exactly the same and 2) this layer itself has not changed. As soon as you make a change to one layer, it invalidates the current layer and all layers that come after it. All these layers will need to be downloaded again, even-though the previous version of the layer was cached and the layer itself has not changed.
When you look at the order of the layers, you should follow the following guideline:
large before small, stable before volatile
When the order of two lines in your docker file is not defined by any dependencies, you should consider the above rule. Ideally the line that constitutes to the larges layer in size, should be the one that is on top. Also lines that are not subject to change in a next version should come before lines that will change, such as lines referring to a specific version of a package. These two rules allow docker to use its cache more effectively, reducing memory, disk space and bandwith significantly. As a bonus, the build-time for images is also reduced anytime you make a change to one of the volatile layers.
You should use explicit versioning, always.
Dockerfiles are code, just like any other language. You put the in source-control and use a compiler (docker) to build it into an executable (the container). You want this process to be predictable and repeatable. You don’t want it to break suddenly when a dependency is updated. Imagine your container is in production and happily running for more than a year. A small change is requested and you agree to it. You take the Dockerfile out of source control, only to find out that it fails to build. Now you are left with an investigation that is preventing you from meeting your deadline.
What are the pieces you need to version?
- The docker image in the FROM header. This is quite obvious.
- The software package you encapsulate in the container, still a no-brainer.
- The libraries and packages used by the software.
- And finally, the tools you install with apt, yum etc. to prepare the docker image.
This last line is often forgotten. If you use a packaging system from a distribution, these packages also change. The behavior or interface may change slightly. The newer versions might be incompatible with the old distribution you are using through your FROM image. Make sure you version the tools, and that the tools remain available for download, for example by copying them to an artifact repository under your control.
There are many things to take into account when we create a container image. This makes the creation of a good image an art in its own. Good design might not be apparent at first. If the program inside the container runs correctly, who will complain? Only when an image is used extensively, will the flaws become visible. By following the steps in this article, you can remain clear of many of the pitfalls.