March 19, 2013
The Devil Is In The Details
by David Wagner
If highly trained NASA scientists and engineers can make costly mistakes, your project team can too. Avoid disaster with six project management lessons from a real catastrophe.
Movies about space missions that result in catastrophe can teach us a lot about how not to manage a project (the “successful failure” of Apollo 13 comes to mind). Yet there are actual space mission catastrophes — the loss of the 1999 Mars Climate Orbiter (MCO), for example — that also offer valuable lessons in preventing fundamental mistakes.
The MCO was the major part of a $328 million NASA project intended to study the Martian atmosphere as well as act as a communications relay station for the Mars Polar Lander. Famously, after a nine-month journey to Mars, the MCO was lost on its attempt to enter planetary orbit. The spacecraft approached Mars on an incorrect trajectory and was believed to have been either destroyed or to have skipped off the atmosphere into space. The big question naturally was: What caused the loss of the spacecraft?
After months of investigation, the primary cause came down to the difference between the units of output from one software program and the units of input required by another. How, the media asked, could one part of the project produce output data in English measurements when the spacecraft navigation software was expecting to consume data in metric?
Those of us involved in expensive and high-risk projects would ask the similar question: How could this happen? What follows are a few findings from the Executive Summary of the Mars Climate Orbiter Mishap Investigation Board (MCO MIB), with lessons for us all.
- The root cause of the loss of the MCO spacecraft was the failure to use metric units in the coding of a ground software file used in trajectory models. Specifically, thruster performance data in English units were used instead of metric units in the software application code.
- An erroneous trajectory was subsequently computed using this incorrect data. This resulted in small errors being introduced in the trajectory estimate over the course of the nine-month journey.
That erroneous trajectory was the difference between a successful mission and failure. Lockheed Martin Astronautics, the prime contractor for the Mars craft, claimed some responsibility, stating that it was up to its company’s engineers to assure that the metric systems used in one computer program were compatible with the English system used in another program. The simple conversion check was not done. “It was, frankly, just overlooked,” said their spokesman.
Just overlooked? Those of us in project management know that large-scale projects require the ability to see not only the big picture — the goals and objectives of the project — but also the details.
While not as prominent as space exploration, insurance software development also has millions of dollars at stake. Insurance products can be very complex, and the interactions required in business systems along with the calculations involved are all critical to producing accurate results.
Errors in the way in which calculations are derived can produce problems ranging from failure to comply with the company’s obligations under its filings to loss of revenue. Even apparently simple matters such as whether to round up or down on a calculation can have profound impacts on a company’s bottom line.
Although the failure to address the difference between English and metric measurements was identified as the root cause of the problem with the MCO, the real issue at hand is what caused that failure. How was it missed?
Taking a project management perspective requires asking the question, “Why?” Why was a key element overlooked? What led an experienced team to miss a crucial detail?
In the search for answers, it’s interesting to look deeper inside the report by the Mars Climate Orbiter Mishap Investigation Board (MCO MIB). In addition to the root cause of failure to use standard units of measurement across the entire project, the report found a series of other issues that also contributed to the catastrophe. The following are other lessons of the MCO mission and how they can be applied more widely to project management.
Lack of shared knowledge. The operations navigation team was not familiar enough with the attitude controls systems on the spacecraft and did not fully understand the significance of errors in orbit determination. This made it more difficult for the team to diagnose the actual problem they were facing.
It is likewise common for insurance software projects to have mutually dependent complex areas — for example, between the policy administration system and the billing system. If one team does not fully understand the needs of the other, there can be costly gaps in understanding.
The MCO MIB recommended comprehensive training on the attitude systems, face-to-face meetings between the development and operations team, and attitude control experts being brought onto the operation’s navigation team. Similarly, face-to-face meetings between the policy experts and the billing experts, between the business side and the technology side, will go a long way toward a successful project. In the world of e-mail and instant messaging, I think all of us spend less face-to-face time. Nonverbal communication is 60% of our communication and is often very helpful; there’s zero face time when we rely on electronic communication.
Lack of contingency planning. The team did not take advantage of an existing Trajectory Correction Maneuver (TCM) that might have saved the spacecraft, since they were not prepared for it. The MCO MIB recommended that there be proper contingency planning for the use of the TCM, along with training on execution and specific criteria for making the decision to employ the TCM.
The need for contingencies in insurance software development is important too. Strong project management will consider project risks and therefore contingencies. And contingency plans are important at every stage — development, implementation, and once the system is live. Issues must be dealt with rapidly and effectively since they have an impact on the entire business. Regular reviews of the contingency plans are also useful.
Inadequate handoffs between teams. Poor transition of the systems’ engineering process from development to operations meant that the navigation team was not fully aware of some important spacecraft design characteristics.
In complex insurance software projects, there are frequent handoffs to other teams, and the transition of knowledge is a critical piece of this process. These large, complex projects should have a whole team dedicated to ensuring knowledge transfer occurs. No matter how good the specifications, once again, it’s vital to get face to face.
Poor communication among project teams. The report stated there was poor communication across the entire project. This lack of communication between project elements included the isolation of the operations navigation team (including lack of peer review), insufficient knowledge transfer, and failures to adequately resolve problems using cross-team discussion. As the report further notes:
“When conflicts in the data were uncovered, the team relied on e mail to solve problems instead of formal problem resolution processes. Failing to adequately employ the problem-tracking system contributed to this problem slipping through the cracks.”
This area had one of the largest set of recommendations from the MCO MIB, including formal and informal face-to-face meetings, internal communication forums, independent peer review, elevation of issues, and a mission systems engineer (aka really strong program or project manager) to bridge all key areas. Needless to say, this kind of communication is a critical part of any insurance software project, and these lessons are easily applied. Zealously hold project reviews (walk-throughs). Do them early and often. The time spent will pay you back with success.
The Operations Navigation Team was inadequately staffed. The project team was running three missions simultaneously — all of them part of the overall Mars project — and this diluted their attention to any specific part of the project. The result was an inability of the team to effectively monitor everything that required their attention.
Sound familiar? We just experienced this on a software implementation project where the software vendor outsold its capacity to be successful. Projects are expected to run lean because of cost considerations, but it’s always important to ensure that staff is not stretched to the point of compromising the project.
There was a failure in the verification and validation process, including the application of the software standards that were supposed to apply. As the MCO MIB noted:
“The Software Interface Specification (SIS) was developed but not properly used in the small forces ground software development and testing. End-to-end testing to validate the small forces ground software performance and its applicability to the specification did not appear to be accomplished.”
Every project manager will recognize the need to stick to protocol and agreed-upon processes during a software project. Ensuring that project team members know the project/system specifications and standards is essential to successful project delivery.
And so, the devil is in the details. My career in and around insurance technology has spanned three decades now. While I have learned much, two things are abundantly clear:
- There is no substitute for really good project management.
- There is no substitute for great business analysts.
- There is no substitute for great communication.
Okay, make that three things! It’s bonus day.