Disappointed that one of the lessons was not “don’t deploy on Friday and then im...

0xbadcafebee · on Aug 24, 2018

Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.

Also, if something goes down in a way that requires a human to work on the weekend, it should result in a postmortem, and all of the components in the deployment chain related to the failure should be evaluated, with new tasks to fix their causes. If it happens multiple times, all project work should stop until it's fixed.

This of course is balanced against how much failure your business can tolerate. If the service goes down and nobody loses money, do you really need your engineers working overtime to fix it?

Angostura · on Aug 24, 2018

> Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.

Or being afraid to deploy last thing on Fridays is an admission that maybe... just maybe... you're not infallible

protomyth · on Aug 24, 2018

I always assumed that if a deploy was scheduled for Friday, it meant we were all working the weekend because no one had confidence it would go well.

minor3rd · on Aug 24, 2018

My team deploys probably 5 times in a given day, including Friday. They are all small deploys, and none can happen at the end of the day. If shit hits the fan, we rollback and maybe people figure out the root cause over the weekend but they aren't sweating bullets.

rjzzleep · on Aug 25, 2018

I guess all the people commenting are much more competent than the people I've worked with.

Even with tests, I've seen startups do double charges on accounts and whatnot on deployments which didn't show up until the next day. I've also seen ops people updating OES where the storage service would segfault a day later. How does DevOps and OES go together in one sentence, you ask? it doesn't, but it just means not all ops people are pure wisdom either. The guy caused others to waste 72 hours of compute resources, because of this. So it's not limited to Dev. And yes, the first company did learn from the double spending bug, but why learn on a saturday?

So even if your DevOps practices are amazing and you have 70% test coverage on all your components, that doesn't mean you can't deploy faulty components where the deployment itself appears successful. Now what? Things aren't failing, they appear just fine. Someone has to go in and debug the problem, it may affect multiple components, it may be critical, and a simple rollback may not cut it.

Friday deployments are fine for certain components, but surely not as a general rule for everything? Friday deployments are like Monday morning or Friday meetings. You can do them, and most of the times they'll be fine, but maybe out of respect for your colleagues you shouldn't anyway.

protomyth · on Aug 26, 2018

I've worked in places that required a checklist worthy of NASA to deploy software, and also had blackout periods where no changes could be made to the system without an executive team sign-off[1]. The thought of any of them deploying 5 times a day is just beyond anything they would be allowed to do. I would expect most enterprises are closer to the event deploys then rapid, multiple deployments.

In that type of "event" style deployments, week night deploys probably are safe, but anything scheduled for Friday is trouble.

1) common in agriculture during harvest season

klodolph · on Aug 24, 2018

> Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.

Our philosophy is that if nothing ever breaks in production, you are being too conservative with your controls and development. Or if you look at it another way, you can allocate resources towards stability and new features, and the (near) 100% testing/verification/auto healing/rollback coverage means that too much of your resources are allocated to stability and not enough towards new features. Running a service with uptime too close to 100% uptime also causes pathologies in downstream services, and if your never have to fix anything manually the skills you need to fix things manually will atrophy.

Or, for our service,

- There should be a pager with 24 hour coverage, because our service is critical,

- That pager should receive some pages but not too many, so operations stays sharp but not burdened,

- Automation and service improvements should eliminate the sources of most pages, and new development should create entirely new problems to solve,

- If the service uptime is too high, it should be periodically taken down manually to simulate production failures, and development controls should be reevaluated to see if they are too restrictive.

Eliminating all the production errors takes a long time and a lot of effort. Yes, we are spending that effort, but the only way that this process will actually “finish” is if the product is dead and no more development is being done. The operations and development teams can then be disbanded and reallocated to more profitable work. A healthy product lifecycle, in general (and not in every case), should see production errors until around the team is downsized to just a couple engineers doing maintenance.

Google calls this an "error budget". We have something similar where I work. https://landing.google.com/sre/book/chapters/embracing-risk....

You can phrase it as “afraid to deploy on Friday”, but I think “afraid to cause outages in production” indicates that the blast radius of your errors is too large or that you’re being too conservative.

ASalazarMX · on Aug 24, 2018

Since that pager seems to prevent sleep (24 hour coverage receiving some pages but not too many), I mustn't be very popular among employees.

I prefer my midnight emergencies at a minimum.

klodolph · on Aug 25, 2018

> Since that pager seems to prevent sleep…

The product has 24/7 pager coverage, but that does not mean that one person has the pager the whole time! At any given time the pager is covered by two or three people in different time zones. The way my team is structured, I will only get paged after midnight if someone else drops the page. And I only have a rotation for one week every couple months or so.

There are definitely employees who don’t enjoy having the pager, but we get compensated for holding the pager with comp time or cash (our choice). The comp time adds up to something like 3 weeks per year, and yes, there are people who take it all as vacation. No, these people are not passed over for promotions. No, this is not Europe.

So the trade off is that seven weeks a year you carry your laptop with you everywhere you go, maybe do one or two extra hours of work those weeks, and don’t go to movies or plays, and then you get three extra weeks off. Yes, it's popular. People like pager duty because they get to spend extra time with their families, because they like to go camping, or because they want the extra cash.

I have once been paged after midnight.

mmt · on Aug 25, 2018

> People like pager duty because they get to spend extra time with their families, because they like to go camping, or because they want the extra cash.

Adequately compensating on-call is, of course, the right way to do it. All sorts of considerations that were, otherwise, problems, such as how to ensure a "fair" rotation, magically go away [1].

Unfortunately, it's vanishingly rare, at least among "silicon valley" startups (and maybe all tech companies). I suspect it's one of those pieces of Ops wisdom that's vanished from the startup ecosystem because Ops, in general, is viewed as obsolete, especially by CTOs who are really Chief Software Development Officers.

Insofar as it's a prerequisite to all your other suggestions, it makes them non-starters in such companies.

[1] Although I suppose if the compensation is too generous, there may still end up being complaining about unfairness in allocation

ASalazarMX · on Aug 28, 2018

This is not how I usually expect a company to handle pager duty. Nice work.

dwaltrip · on Aug 24, 2018

Interesting philosophy and principles (e.g. don't let "manual fixing skills" atrophy). Definitely something to consider. Thanks for sharing

Johnny555 · on Aug 25, 2018

Actually, what I learned is that being afraid to deploy on Friday means you're lacking in testing, verification, auto healing and rollback processes.

I worked in a shop like that -- they had such great testing policies that they did continuous deployment, code went from commit to production as soon as the tests passed.

Until the holiday weekend when two code changes had an unexpected interaction and ended up black-holing all new customer activity that weekend. (existing customers were fine, they only lost data for new customers).

They could have recovered the data from a log on the front-end servers, but one of the admins noticed an unusual amount of disk space used on the front ends monday morning and just replaced them all (since their auto-healing allowed this without any interruption of service)... and since those logs were only used for debugging problems, they weren't persisted anywhere.

It turns out that tests aren't perfect - they only test what you think you need to test.

If the service goes down and nobody loses money, do you really need your engineers working overtime to fix it?

Money is not the only way to value a service.... but if the service goes down and no one cares, why run the service at all?

nojvek · on Aug 25, 2018

Well ... there are different kinds of deploys with different kinds of risks.

I usually don’t like a blanket don’t deploy on Friday rule.

We can usually rollback with one command easily, have good monitoring and health checks so even though something makes it into prod, it’s super easy to go back.

Unless you have changes like you mentioned, weird side effects, db scheme changes, config changes that affect machine configuration. Those are unknown unknown changes. Good practice to hold them back.

As to web and assets changes to a css file or a self contained js change, that should only re-deploy the files that changed and generally low risk.

jjeaff · on Aug 24, 2018

Please tell us about this mythical company whose processes are so airtight, that bad code is never deployed.

0xbadcafebee · on Aug 25, 2018

Oh no no, bad code would be deployed all the time =) We just didn't notice it, because auto healing brought back the working site so quickly. Occasionally something would break design parameters and that would require a manual fix, but then that manual fix was added into auto healing...

It takes a good deal of design work to get a high level of resiliency, but it's completely within the realm of possibility. Most shops just don't dedicate the effort to it, because they're more worried about shipping new features, and this is understandable. Just different priorities.

ultrasaurus · on Aug 24, 2018

+1 Blocking deploys on a Friday is a symptom of a lot of room to grow your operation maturity: the on-call person should be on-call, and able to handle most issues (in this case, probably a rollback) and that's if your tests didn't block the deploy in the first place

Code freezes (and that's what blocking deploys are) are a great tool, but primarily for managing your on-call more effectively.

hellllllllooo · on Aug 24, 2018

This is what I was thinking. If one person is able to break a system with a small mistake then processes are at least as much to blame as the person. Mistakes will always happen because programmers are fallible so a deployment process designed around infalliblity is destined to fail.

sebastianmarr · on Aug 24, 2018

The lesson was 'be kind'.

saudioger · on Aug 24, 2018

In an ideal world you'd never deploy on a Friday, but we've probably all done it.

RobertRoberts · on Aug 24, 2018

We have a company policy to never deploy on Friday, and only under ideal circumstances on Thursday. This protects everyone from unneeded overtime and panicking clients during off hours.

It's one of the central policies to ensure happy clients and smooth running operations that gets regular review and questions from clients when they are in a hurry. But when it was implemented, stress levels across the board plummeted. And only a small amount of client education was needed before they agreed it was a good policy.

There are emergency circumstances that override this rule of course.

jeremyjh · on Aug 24, 2018

Deploying on a Friday is fine for many businesses - but it becomes riskier if you are leaving to a remote location with no internet access.

klodolph · on Aug 24, 2018

Even if your automation is great, there’s often a series of events that goes like this:

    Error discovered -> Call person responsible -> Roll back

It’s not just about whether you have the ability to fix the error, it’s about whether deploying on Friday is likely to disrupt people’s personal lives. Or, put another way, it’s not kind to deploy on Friday, it’s selfish. It looks good if you push out features quickly, and if it blows up someone else has to take time out of their weekend. On my team ops controls releases and if you miss the Thursday build you’re not getting anything in production until Monday.

barbecue_sauce · on Aug 24, 2018

A weekend of downtime?

If you deploy on Friday, run out the door, and soon find out that your contribution to a deployment caused an outage, wouldn't you immediately return to work to at least give the appearance of personal responsibility? (on any day of the week, even)

Also, wouldn't you just do a rollback to the last viable build?

orev · on Aug 25, 2018

I guess you didn’t read the post. “Deploy then run out the door and drive 3 hours out into the woods without your laptop.”