Is email your monitoring and alerting mechanism?
There is no monitoring solution and you wait on your users to let you know that your service is unresponsive
Does your infrastucture uses uses a lot more self-signed certificates than you think?
For any N applications, at most N/2+1 use the same certificate bundle
You end up using shell for "complex stuff" because it's easier that way
Your /etc/hosts is all colored with various rules
If you face a situation where the person who knows the script/ procedure to resolve an issue is on vacation
You often hear - "We've always done it this way."
"Prod" is just another name for "staging".
Nobody knows what exactly it is you do.
If a post-mortem follow-up task is not picked up within a week, it's unlikely to be completed at all.
Your quarterly planning has no meaning when the next re-org rolls around.
Most of your actual work is not covered by your OKRs
To read more about OKR - https://www.netmeister.org/blog/okr-distractions.html
Management will always happily spend $$$ on outside consultants to tell them what you've been saying for years.
Management will much rather invest in inventing a new, square wheel than fixing an old round one.
Your usage of Restricted shells are not as restricted
Your network team has a way into the network that your security team doesn't know about.
Do you think "One in a million is next Tuesday?"
You can read about One in a million next Tuesday here - https://docs.microsoft.com/en-us/archive/blogs/larryosterman/one-in-a-million-is-next-tuesday
Quite often, you see "human error" as the root cause
If your culture is - "If you break it, you own it - for now; if you fix it, you own it - forever."
You cannot turn things off permanantly as there is no one who can "approve" of it.
And so it keep lingering around
The most critical services are maintained by handful of people and others do not dare to go near these services
You're bound by the CAP theorem much more often than you may think. Halting Problem's a bitch, too.
Read more about CAP theorem here https://en.wikipedia.org/wiki/CAP_theorem and about Halting problem here https://en.wikipedia.org/wiki/Halting_problem
If your runbook to fix things say - "Turn on and off" and you should see it working again.
All the documentation that's there is README.md file
A document points to another document, which it turn takes you to some other document
And this goes on forever till you give up
The document is marked obsolete and no reference to any other
And there is no reference or availability of the current document which you can consume
Somewhere, somebody ran into this exact problem, but they never bothered to post a solution.
The source you are looking at is not the code running in Production environment
The condition of any backup is unknown until a restore is attempted
That completely automated solution you set up requires at least three manual steps you didn't document.
Help Us Improve!
If you have any suggestions to improve this checklist, please let us know by filling out