As always, this started out as a presumptuous tweet:
maintaining a million puppet modules has taught me one thing: not a single piece of software was written with automation in mind.— The Wrath of me™ (@hirojin) December 2, 2015
Of course, this was met with a certain skepticism. So here are some lessons learned from ~9 years of putting applications into production, and being woken up by them falling over.
don’t invent your own configuration format
version numbers are a contract
This contract extends to the API, the ABI, your configuration, your utilities. You can break the contract between major versions, but please, for the love of Knuth don’t break the method of determining the version
providing a tool to modify the configuration…
use the underlying systems’ facilities
Rather than bootstrapping a daemon and a watchdog, use (smf|systemd|etc…) Use the OS package manager, or your programming environment’s package manager (gem|pip|war|etc). Or your container‘s package manager. This makes installation really easy, and an atomic transaction.
Do not reinvent a new package manager
If you accidentally did anyway, consider providing a way of finding out if the package manager has already ran successfully.
Put packages into repositories
This makes installation actually easy, and dependency management possible within the above mentioned atomic transaction.
Oh, your package repository should not be (only) github.
If your software needs to scale, it’s easier when it’s stateless
If you need to replicate state, how will you do that to 3 nodes? How about 300 nodes?
These are the basics. Your software is now installed, configured and running! But is it running correctly?
Provide an easy way to get metrics from the software
This isn’t restricted to health metrics of the application. But can be liberally extended to the health of the business.
And finally, if you want to be really nice to your admin, not just your admin’s automation software:
Provide debuggable errors in the log
Can someone who didn’t write the software find out why it crashed from the logs? Can they fix it?
Document each of the above
How else do you think they’ll actually discover any of that functionality!?
That’s all folks!