Overworked IT Guy: IT Accountability: Avoiding Murphy

Amongst technology experts, Murphy is someone we all try to avoid. Murphy's Law states "Anything that can go wrong, will". Every IT person has met Murphy, regardless of how hard they try to not meet him. We try to avoid Murphy through many methods, some work and some don't. Recently I've been doing a lot of research on a certain type of historical events that should give all IT professionals some tools to potentially reduce their chances of running into the problematic Mr. Murphy. Even if history does not directly relate to the IT field, we can still learn from it. (I promise to keep the history lessons to a minimum!)

In the middle of the night on Saturday 26th 1986, a team was executing a planned testing event. The test went wrong, the plan was flawed, and this led to the worst nuclear power accident in human history. This is of course the accident at the No. 4 reactor at the Chernobyl Nuclear Power Plant. Years after the accident and the mystery and misinformation around the accident was lifted, there were many failures that happened at human and process levels, ignoring the underlying nuclear technology issues.

Lesson 1: At the start, the test could not be conducted during the day with the trained day shift experts due to outside political influence. This led to engineers that were not properly trained conducting the test that led to the accident. The "senior engineer" was found to be in his 20s with limited experience.

Lesson learned: All IT shifts should be equally trained for all actions, or at a minimum must be equally trained for the actions they will be performing.

Lesson 2: The engineers on the night shift were told they were running this test at the start of their shift, with no prior notifications. When looking at the playbook they found that the steps were not clear and the engineers had to sometimes guess as to what the right thing to do was.

Lesson learned: Playbook accuracy is critical, as are dry runs through those playbooks. Every IT change should have a playbook that is peer reviewed, as well as reviewed by the executors, before the change event.

Lesson 3: When the engineers were executing their steps, things were not going to plan in a major way: the reactor was not behaving as it was predicted to (this was the start of the loss of control). Despite the unexpected results, the team (including the plant supervisor) kept pushing forward in the playbook. When they did...well...it went BOOM and the results are now well known around the world as the Chernobyl disaster.

Lesson learned: If your playbook is not going as planned, stop. The playbook should be detailed enough to show what success and failure looks like at each step, as well as a timeline that should be adhered to. If something isn't going to plan, stop and regroup with the team to determine how to proceed. Never be afraid to escalate. Never be afraid to rollback.

Lesson 4: After / during the accident, people were sent to investigate what was going on. When those people reported back, their reports were deemed "impossible" and were ignored. As a result, the work people were doing either did nothing or made the situation worse.

Lesson learned: During a change/implementation/etc, everyone matters and everyone should be listened to. Different people have different expertise and experience. If someone says "this doesn't look right to me", the group should evaluate as a whole and determine if it is indeed a cause for concern, resulting in proof that if is not.

Lesson 5: During the accident investigation it was determined that one of the many contributors of the accident was a design flaw that was not shared with everyone (for political/secrecy reasons). If that flaw was known, the engineers could have acted differently and the accident may have never happened.

Lesson learned: Everything in IT is a team sport. Everyone must share and communicate with everyone else, and do not keep any potential causes for concern to yourself. Document and share all issues in any systems. You never know when that documented flaw may cause some random problem, and without that shared knowledge someone may spend hours or days digging to find a fix.

Lesson 6a: You can't escape the inevitable, so plan for it. After the loss of control, engineers were scrambling to do virtually anything to control the plant. Some actions were futile because of how bad the situation was. Nature had taken over and practically said "you didn't follow my laws, so now I'm running the show". There was no plan for "what do we do when the reactor explodes" because such a thing was never conceived. There were also no plans for any of the other small issues that led to the explosion either.

Lesson learned: When assessing risk of a change or action, always consider "what happens if this fails" for every step along the way. Also consider that at the end of the playbook. Know what to do exactly, even if it's "stop and call for help". Murphy is hiding around every corner of the IT world, and just like Nature, he demands payment. It is our job to hold him at bay for as long as possible with control processes. (to be continued in lesson 6b...)

Fast forward to March 11, 2011. At 14:46 a massive earthquake occurred of the coast of Honshu Island near Japan, which created a massive tsunami. This natural event kicked off a chain reaction that led to a series of failures that led to the 2nd worse nuclear accident in history, the meltdown of multiple reactors at the Fukushima Daiichi Nuclear Power Plant.

Lesson 6b: The triggering event, the tsunami, was "never conceived" as a potential threat (more on that next). Murphy, or in this case Nature, struck again. The tsunami destroyed backup power systems that were critical for keeping the plant safe. The designers thought they were safe, but they did not think outside of the box and Nature won. There were not enough safeguards to keep the system safe in the event of a failure, resulting in the accident.

Lesson learned: Just like in Chernobyl, "what happens if this fails" was again not considered. When we as IT professionals thing about "what happens if this fails" we should also be thinking "what happens when this fails". If we can plan for failure, then we can plan for avoiding failure.

Lesson 7: In the investigations that followed, it came to light that the plan had design flaws that was called out by various review boards. The flaws were not acted on again due to political / secrecy reasons. If those reports were taken seriously, the accident would have never happened. Here's the problem (and it was a problem with Chernobyl too): Accountability. The plant at Chernobyl was owned, operated, and reviewed by the USSR government, resulting in a large conflict of interest and no accountability. The plant at Fukushima was owned and operated by a company that the Japanese government had a large stake in. The government was the auditing organization, and thus enters the conflict of interest. In the end with both of these accidents, the plants were really held accountable to only Nature.

Lesson learned: This is why in IT we have change reviews / change boards, that should be led by a 3rd party to the change itself. That group is the outside reviewer that provides oversight and approval to execute for all changes to ensure that proper risk was assessed, processes are being followed, etc. If we don't do change reviews, that's when Murphy starts getting hungry for blood. Inevitably someone will skip an SOP, go off book, or not do proper risk assessment. That will lead to an accident that could have any number of bad results.

The point of this article is not to show how safe or dangerous nuclear power is. If that's what you took away from this, go back and re-read it. My goal was to use history of disasters to explain how we can minimize risk or avoid a mistake all together. As technologists we can and should learn from history from not just IT (which has not been around for all that long), but also learn from the overall history of man and nature. IT professionals have the capability to cause accidents that can cause their employer damage so vast that it could cause the company to go out of business. If we can delay our meeting with Murphy, or avoid him all together through the use of risk avoidance techniques, it makes life as an IT Professional much better!

4 comments:

William KJuly 19, 2021 at 11:20 PM
Some good points and well put. Thank you for sharing. When I saw “Avoiding Murphy” my mind immediately went to scheduled maintenance, I’m surprised you didn’t touch on that. I’ve seen what happens when management is more concerned about budget cuts than why a certain interval is suggested for bulb or filter replacement. The failures are much more costly than the savings from not doing what was recommended.
S6 TechnologiesNovember 10, 2021 at 4:31 PM
Great job for publishing such a nice article. Your article isn’t only useful but it is additionally really informative. it support HoustonThank you because you have been willing to share information with us.
GeorginaMarch 27, 2022 at 12:22 PM
Prices for 2022 Lotus Elise from $ 70,600 for basic equipment Elise Sport Convertible 220 to $ 118,580 for premium Elise CUP 250 Final Edition convertible.
ITGOLD SolutionsJune 13, 2022 at 3:18 AM
I got some valuable points through this blog. Thank you sharing this blog. IT Support in Logan

Sunday, July 11, 2021

IT Accountability: Avoiding Murphy

4 comments:

IT Accountability: Avoiding Murphy