Detailed checklist | Incorrect DevOps

Checklist to assess if you got DevOps wrong. It's a way to understand how far are you along to implement correct and scalable DevOps practice in your organization.

Tooling

Is email your monitoring and alerting mechanism?

There is no monitoring solution and you wait on your users to let you know that your service is unresponsive

Does your infrastucture uses uses a lot more self-signed certificates than you think?

For any N applications, at most N/2+1 use the same certificate bundle

You end up using shell for "complex stuff" because it's easier that way

Your /etc/hosts is all colored with various rules

Culture

If you face a situation where the person who knows the script/ procedure to resolve an issue is on vacation

You often hear - "We've always done it this way."

"Prod" is just another name for "staging".

Nobody knows what exactly it is you do.

Leadership

If a post-mortem follow-up task is not picked up within a week, it's unlikely to be completed at all.

Your quarterly planning has no meaning when the next re-org rolls around.

Most of your actual work is not covered by your OKRs
To read more about OKR - https://www.netmeister.org/blog/okr-distractions.html

Management will always happily spend $$$ on outside consultants to tell them what you've been saying for years.

Management will much rather invest in inventing a new, square wheel than fixing an old round one.

Security

Your usage of Restricted shells are not as restricted

Your network team has a way into the network that your security team doesn't know about.

Confidence

Do you think "One in a million is next Tuesday?"
You can read about One in a million next Tuesday here - https://docs.microsoft.com/en-us/archive/blogs/larryosterman/one-in-a-million-is-next-tuesday

Quite often, you see "human error" as the root cause

Ownership

If your culture is - "If you break it, you own it - for now; if you fix it, you own it - forever."

You cannot turn things off permanantly as there is no one who can "approve" of it.
And so it keep lingering around

The most critical services are maintained by handful of people and others do not dare to go near these services

You're bound by the CAP theorem much more often than you may think. Halting Problem's a bitch, too.
Read more about CAP theorem here https://en.wikipedia.org/wiki/CAP_theorem and about Halting problem here https://en.wikipedia.org/wiki/Halting_problem

Documentation

If your runbook to fix things say - "Turn on and off" and you should see it working again.

All the documentation that's there is README.md file

A document points to another document, which it turn takes you to some other document
And this goes on forever till you give up

The document is marked obsolete and no reference to any other
And there is no reference or availability of the current document which you can consume

Somewhere, somebody ran into this exact problem, but they never bothered to post a solution.

Code

The source you are looking at is not the code running in Production environment

The condition of any backup is unknown until a restore is attempted

That completely automated solution you set up requires at least three manual steps you didn't document.

Sources

https://www.netmeister.org/blog/ops-lessons.html

Help Us Improve!

If you have any suggestions to improve this checklist, please let us know by filling out this form.