FEB 1, 20038:59 EST
– All 7 astronauts are killed
– $4 billion spacecraft is
destroyed
– Debris scattered over
2000 sq-miles of Texas
– NASA grounds shuttle
fleet for 2-1/2 years
Space shuttle Columbia,
re-entering Earth’s
atmosphere at 10,000
mph, disintegrates
3.
Columbia- The PhysicalCause
Insulating foam separates
from external tank 81
seconds after lift-off
Foam strikes underside of
left wing, breaches
thermal protection system
(TPS) tiles
Superheated air enters
wing during re-entry,
melting aluminum struts
Aerodynamic stresses
destroy weakened wing
4.
A Flawed DecisionProcess
Foam strike detected in
launch videos on Day 2
Engineers requested
inspection by crew or
remote photo imagery
to check for damage
Mission managers
discounted foam strike
significance
No actions were taken to
confirm shuttle integrity or
prepare contingency plans
5.
Seventeen Years Earlier…
January 28, 1986, the
shuttle Challenger
explodes 73 seconds
into its launch, killing all
seven crew members
Investigation reveals
that a solid rocket
booster (SRB) joint
failed, allowing flames
to impinge on the
external fuel tank
6.
Challenger…
Liquid hydrogentank explodes, ruptures liquid
oxygen tank
Resulting massive explosion destroys the shuttle
7.
The Legacy ofChallenger
The Rogers Commission, which
investigated the incident, determined:
– The SRB joint failed when jet flames
burned through both o-rings in the joint
– NASA had long known about recurrent
damage to o-rings
– Increasing levels of o-ring damage had
been tolerated over time
• Based upon the rationale that
“nothing bad has happened yet”
8.
The Legacy… continued
The Commission also determined that:
– SRB experts had expressed concerns about the
safety of the Challenger launch
– NASA’s culture prevented these concerns from
reaching top decision-makers
– Past successes had created an environment of
over-confidence within NASA
– Extreme pressures to maintain launch schedules
may have prompted flawed decision-making
The Commission’s recommendations addressed an
number of organizational, communications, and safety
oversight issues
9.
Columbia- The OrganizationalCauses
“In our view, the NASA organizational
culture had as much to do with this
accident as the foam.”
CAIB Report, Vol. 1, p. 97
NASA had received painful
lessons about its culture from
the Challenger incident
CAIB found disturbing
parallels remaining at the time
of the Columbia incident…
these are the topic of this
presentation
10.
0
Columbia Key Issues
With little corroboration, management had become
convinced that a foam strike was not, and could
not be, a concern.
Why were serious concerns about the integrity of
the shuttle, raised by experts within one day after
the launch, not acted upon in the two weeks prior
to return?
Why had NASA not learned from the lessons of
Challenger?
11.
1
1. Maintain SenseOf Vulnerability
2. Combat Normalization Of Deviance
3. Establish an Imperative for Safety
4. Perform Valid/Timely Hazard/Risk Assessments
5. Ensure Open and Frank Communications
6. Learn and Advance the Culture
Key Organizational Culture Findings
– What NASA Did Not Do
12.
2
Maintaining a Senseof Vulnerability
“Let me assure you that, as of
yesterday afternoon, the Shuttle was
in excellent shape, … there were no
major debris system problems
identified….”
NASA official on Day 8
“The Shuttle has become a mature
and reliable system … about as safe
as today’s technology will provide.”
NASA official in 1995
13.
3
Maintaining a Senseof Vulnerability
NASA’s successes (Apollo program, et al) had created
a “can do” attitude that minimized the consideration
of failure
Near-misses were regarded as successes of a robust
system rather than near-failures
– No disasters had resulted from prior foam strikes,
so strikes were no longer a safety-of-flight issue
– Challenger parallel… failure of the primary o-ring
demonstrated the adequacy of the secondary o-ring
to seal the joint
A weak sense of vulnerability can lead to taking future
success for granted… and to taking greater risks
14.
4
Combating Normalization ofDeviance
“…No debris shall emanate
from the critical zone of the
External Tank on the launch
pad or during ascent…”
Ground System Specification Book –
Shuttle Design Requirements
After 113 shuttle missions,
foam shedding, debris
impacts, and TPS tile
damage came to be
regarded as only a routine
maintenance concern
15.
5
Combating Normalization ofDeviance
“This history portrays an incremental
descent into poor judgment.”
Diane Vaughan,
The Challenger Launch Decision
Each successful mission reinforced the perception that
foam shedding was unavoidable…either unlikely to
jeopardize safety or an acceptable risk
– Foam shedding, which violated the shuttle design basis,
had been normalized
– Challenger parallel… tolerance of damage to the primary
o-ring… led to tolerance of failure of the primary o-
ring… which led to the tolerance of damage to the
secondary o-ring… which led to disaster
16.
6
Establish An Imperativefor Safety
“When I ask for the budget to be cut,
I’m told it’s going to impact safety on
the Space Shuttle … I think that’s a
bunch of crap.”
Daniel S. Goldin,
NASA Administrator, 1994
The shuttle safety organization, funded by the programs it
was to oversee, was not positioned to provide
independent safety analysis
The technical staff for both Challenger and Columbia were
put in the position of having to prove that management’s
intentions were unsafe
– This reversed their normal role of having to prove
mission safety
17.
7
Establish An Imperativefor Safety
International
Space Station
deadline
19 Feb 04
Desktop screensaver at NASA
As with Challenger, future
NASA funding required
meeting an ambitious launch
schedule
– Conditions/checks, once
“critical,” were now waived
–A significant foam strike on
a recent mission was not
resolved prior to
Columbia’s launch
–Priorities conflicted… and
production won over safety
18.
8
Perform Valid/Timely
Hazard/Risk Assessments
“Anymore activity today on the tile damage or are people just relegated to
crossing their fingers and hoping for the best?”
Email Exchange at NASA
“… hazard analysis processes are applied inconsistently across systems,
subsystems, assemblies, and components.”
CAIB Report, Vol. 1, p. 188
NASA lacked consistent, structured approaches for
identifying hazards and assessing risks
Many analyses were subjective, and many action items
from studies were not addressed
In lieu of proper risk assessments, many identified
concerns were simply labeled as “acceptable”
Invalid computer modeling of the foam strike was
conducted by “green” analysts
19.
9
Ensure Open andFrank Communications
I must emphasize (again) that severe enough
damage… could present potentially grave hazards…
Remember the NASA safety posters everywhere
around stating, “If it’s not safe, say so”? Yes, it’s that
serious.
Memo that was composed but never sent
Management adopted a uniform mindset that foam
strikes were not a concern and was not open to
contrary opinions.
The organizational culture
– Did not encourage “bad news”
– Encouraged 100% consensus
– Emphasized only “chain of command” communications
– Allowed rank and status to trump expertise
20.
0
Ensure Open andFrank Communications
Lateral communications between some NASA sites
were also dysfunctional
– Technical experts conducted considerable analysis
of the situation, sharing opinions within their own
groups, but this information was not shared
between organizations within NASA
– As similar point was addressed by the Rogers
Commission on the Challenger incident
Management pushback can discourage, even
intimidate, those seeking to share concerns.
21.
1
Learn and Advancethe Culture
CAIB determined that NASA had not learned from the
lessons of Challenger
Communications problems still existed
– Experts with divergent opinions still had difficulty
getting heard
Normalization of deviance was still occurring
Schedules often still dominated over safety concerns
Hazard/risk assessments were still shallow
Abnormal events were not studied in sufficient detail,
or trended to maximize learnings
22.
2
… An Epilog
Shuttle Discovery was launched
on 7/26/05
NASA had formed an
independent Return To Flight
(RTF) panel to monitor its
preparations
7 of the 26 RTF panel members
issued a minority report prior to
the launch
– Expressing concerns about
NASA’s efforts
– Questioning if Columbia’s
lessons had been learned
23.
3
… An Epilog
During launch, a large piece of foam separated from the
external fuel tank, but fortunately did not strike the
shuttle, which landed safely 14 days later
The shuttle fleet was once again grounded, pending
resolution of the problem with the external fuel tank
insulating foam
5
Piper Alpha
On7/6/1988, a series of
explosions and fires
destroyed the Piper Alpha
oil platform
165 platform workers and
2 emergency responders
were killed
– 61 workers survived
by jumping into the
North Sea
26.
6
The Physical Cause
It is believed that a pump
had been returned to
service with its discharge
relief valve removed for
testing
The light hydrocarbon
(condensate) that was
released formed a vapor
cloud and ignited
The resulting vapor cloud
explosion ruptured oil
export lines and ignited
fires on the platform
27.
7
The Physical Cause
Other interconnected
platforms continued
production, feeding the
leaks on Piper Alpha
Ensuing fires breached
high pressure natural
gas inlet lines on the
platform
The enormity of the
resulting conflagration
prevented any organized
evacuation
28.
8
The Organizational Causes
The official investigation report, written by Lord Cullen,
faulted the company’s management of safety on Piper
Alpha
The confusion leading to restarting the condensate
pump resulted from failures to adhere to the permit to
work (PTW) system
– Daily monitoring and periodic audits had failed to
identify the continuing dysfunction of the system
29.
9
The Organizational Causes
Inadequate shift turnovers failed to communicate the
status of the pump to the oncoming shift
– Inadequate communications (and PTW system
problems) had contributed to a fatality, and a
civil conviction for the company, but remedial
action had not been taken
The diesel fire pumps were in manual and, after the
explosion, could not be reached by staff seeking to
start them
– A prior audit recommendation to stop this
practice had not been implemented
30.
0
The Organizational Causes
Even if fire water had been available, many deluge
nozzles were plugged
– The company had been trying to resolve this
problem for at least four years, but repairs were
behind schedule
One year earlier, an engineering study had concluded
that the gas risers were vulnerable and that a massive
gas release could prevent successful evacuation of
the platform
– Management had discounted the study results
31.
1
The Organizational Causes
Other problems that audits and management reviews had
failed to identify and/or resolve included:
– Emergency response training given to workers new to the
platform was cursory and often omitted. Some workers
had not been shown the location of their life boat.
– Platform managers had not been trained on how to
respond to emergencies on other platforms (e.g., when to
stop production)
– Evacuation and emergency shutdown drills on Piper Alpha
were not conducted according to schedule
32.
2
Parallels to NASAand Columbia
Each Piper Alpha
organizational cause can be
mapped to one or more of
the NASA lessons
– Maintain Sense Of
Vulnerability
– Combat Normalization Of
Deviance
– Establish an Imperative for
Safety
– Perform Valid/Timely
Hazard/Risk Assessments
– Ensure Open and Frank
Communications
– Learn and Advance the
Culture
33.
3
Flixborough
On 6/1/1974,a massive
vapor cloud explosion
(VCE) destroyed a UK
chemical plant
Consequences:
– 28 employees died
and 36 were injured
– Hundreds of off-site
injuries
– Approx. 1800 homes
and 170 businesses
damaged
34.
4
The Physical Cause
2
23
3 4
4
6
6
20
20-
-inch
inch
bypass
bypass
125 psi
125 psi
1
1
2
2 3
3 4
4
6
6
20
20-
-inch
inch
bypass
bypass
125 psi
125 psi
1
1
Approx. 30 tons of boiling cyclohexane released from
reactor system
Most likely release cause was the failure of a
temporary piping modification
– Installed between two reactors
– Was a bypass for reactor removed for repairs
35.
5
The Physical Cause
Bellows not designed
for 38-ton thrust
Design standards for
bellows ignored
Inadequate pressure
test of installation
Inadequate vertical and
lateral support for
jumper
36.
6
The Organizational Causes
No qualified mechanical
engineer on-site
Inadequate concern with
the cause of the reactor
failure
Jumper connection
considered a routine
plumbing job
– No detailed design
for jumper
37.
7
The Organizational Causes
“Hurry up”
attitude of management
– Overworked staff
did not take time to
properly analyze
their actions
38.
8
Parallels to NASAand Columbia
Each Flixborough
organizational cause
can be mapped to one
or more of the following
NASA lessons
– Maintain Sense Of
Vulnerability
– Establish an
Imperative for Safety
– Perform Valid/Timely
Hazard/Risk
Assessments
39.
9
Could this happento us?
Optional: Paste
Company logo
here
Complacency due to our superior safety performance
Normalizing our safety critical requirements
Ineffective Risk Assessments of our systems
Reversing the Burden of Proof when evaluating safety
of operations
Employees Not Speaking Freely of their safety
concerns
Business Pressures at odds with safety priorities
Failure to Learn and apply learnings to improving our
culture
40.
0
Title for RelevantCompany Event
Use this section to briefly
summarize key aspects
of the event
– Do not addresses
causes here
– Add additional slides
if required
Paste photo related to
event in space at right, if
desired
– JPG files at 300 dpi,
provide adequate
resolution
– If photo is not
provided, drag right
border over to
expand this text box
Optional: Paste
Company logo
here
41.
1
The Physical Cause
Briefly describe the
factors that caused the
event
– Do not address
organizational
factors here
– Add additional
slides if required
Add photo to the right,
or expand the text box
as desired/needed
Optional: Paste
Company logo
here
42.
2
The Organizational Causes
Describe the organizational causes of the event
– Where feasible, lay a basis for parallels to the 6
NASA organizational culture findings
• Maintain Sense Of Vulnerability
• Combat Normalization Of Deviance
• Establish an Imperative for Safety
• Perform Appropriate and Timely Hazard/Risk
Assessments
• Ensure Open and Frank Communications
• Learn and Advance the Culture
Optional: Paste
Company logo
here
43.
3
Parallels to NASAand Columbia
If you feel that this
would add to the
emphasis of the
message, include one or
more slides that
emphasize how your
organizational causes
relate to the underlying
themes from Columbia
– Alternatively, you
may want to leave
this as an individual
or group exercise
for the audience
Optional: Paste
Company logo
here
5
…NOT Maintaining a
Senseof Vulnerability
Safety performance has been good… and you do not
recall the last time you asked “But what if…?”
You assume your safety systems are good enough
You treat critical alarms as operating indicators
You allow backlogs in preventative maintenance of
critical equipment
Actions are not taken when trends of similar
deficiencies are identified.
46.
6
…NOT Preventing
Normalization ofDeviance
You allow operations outside established safe operating
limits without detailed risk assessment
Willful, conscious, violation of an established procedure
is tolerated without investigation, or without
consequences for the persons involved
Staff cannot be counted on to strictly adhere to safety
policies and practices when supervision is not around
to monitor compliance
You are tolerating practices or conditions that would
have been deemed unacceptable a year or two ago
47.
7
…NOT Establishing An
Imperativefor Safety
Staff monitoring safety related decisions are not
technically qualified or sufficiently independent
Key process safety management positions have been
downgraded over time or left vacant
Recommendations for safety improvements are
resisted on the grounds of cost or schedule impact
No system is in place to ensure an independent review
of major safety-related decisions
Audits are weak, not conducted on schedule, or are
regarded as negative or punitive and, therefore, are
resisted
48.
8
…NOT Performing Valid/Timely
Hazard/RiskAssessments
Availability of experienced resources for hazard or risk
assessments is limited
Assessments are not conducted according to schedule
Assessments are done in a perfunctory fashion, or seldom
find problems
Recommendations are not meaningful and/or are not
implemented in a timely manner
Bases for rejecting risk assessment recommendations are
mostly subjective judgments or are based upon previous
experience and observation.
49.
9
…NOT Ensuring Openand
Frank Communications
The bearer of “bad news” is viewed as “not a team
player”
Safety-related questioning “rewarded” by requiring the
suggested to prove he / she is correct
Communications get altered, with the message
softened, as they move up or down the management
chain
Safety-critical information is not moving laterally
between work groups
Employees can not speak freely, to anyone else, about
their honest safety concerns, without fear of career
reprisals.
50.
0
…NOT Learning andAdvancing
the Culture
Recurrent problems are not investigated, trended, and
resolved
Investigations reveal the same causes recurring time and
again
Staff expresses concerns that standards of performance
are eroding
Concepts, once regarded as organizational values, are
now subject to expedient reconsideration
51.
1
“Engineering By ViewGraph”
“When engineering analyses and risk assessments are condensed to fit
on a standard form or overhead slide, information is inevitably lost…
the priority assigned to information can be easily misrepresented by its
placement on a chart and the language that is used.”
CAIB Report, Vol. 1, p. 191
The CAIB faulted shuttle project staff for trying to
summarize too much important information on too
few PowerPoint slides
We risk the same criticism here
This presentation introduces the concept of
organizational effectiveness and safety culture, as
exemplified by the case studies presented
This is only the beginning…
Editor's Notes
#1 The primary focus of this presentation are the circumstances leading up to the Columbia shuttle disaster. However, to emphasize that Columbia was not a one-of-a-kind event, and to illustrate the potential consequence of failing to acknowledge and address cultural weaknesses, the Challenger shuttle disaster (which occurred 17 years before) is described, and parallels are drawn.
Also, to provide relevance to our industry, the Piper Alpha and Flixborough disasters are described in a way that emphasizes the general applicability of the safety culture learnings from Columbia.
Most of the information in this presentation was taken from from the publicly available investigation reports. The Process Safety Culture project team feels that the lessons and organizational root causes identified in these four investigations are applicable to all organizations that operate facilities handling hazardous materials or that engage in hazardous activities.
We have included some self-assessment questions at the end of the presentation that are intended to assist the presenter in determining whether these lessons are relevant to his or her organization.
#2 On February 1, 2003, the Space Shuttle Columbia disintegrated during re-entry into the Earth’s atmosphere, killing all seven crewmembers aboard.
Following the tragedy, a 13-member Columbia Accident Investigation Board (CAIB) investigated this incident and identified both its physical cause as well as the underlying organizational causes.
The board warned that, without sweeping changes to address the underlying causes, "the scene is set for another accident.“ This incident and its findings offer many lessons that are likely to be applicable to any organization that is involved in high risk activities.
[The photos on this slide are two views, taken from the ground, of Columbia as it was breaking apart.]
#3 The physical cause leading to the loss of Columbia and its crew was a breach in the Thermal Protection System (TPS) on the leading edge of the left wing, caused by a piece of insulating foam that separated from the external fuel tank and struck the wing at 81.7 seconds after launch. The circles and arrow in the figure show the point of origination, point of impact, and likely trajectory of the foam. While the foam is lightweight, it would have been moving at a speed of about 500 mph relative to the shuttle at impact.
(The external fuel tank assembly, which includes the cryogenic liquid hydrogen and liquid oxygen tanks, is insulated to prevent the formation of ice that could break free during launch and damage the shuttle. The foam broke away from the left bipod ramp section of the tank; i.e., a “strut” that anchors the fuel tank to the shuttle during launch).
During re-entry, the breach in the TPS allowed superheated air to penetrate the left wing and progressively melt the aluminum structure of the wing. This weakened the structure until increasing aerodynamic forces caused failure of the wing, loss of control, and break-up of the Shuttle. This breakup occurred in a flight regime during which, given the current design of the Shuttle, there was no possibility for the crew to survive (the shuttle’s speed would have been on the order of 10,000 mph at the time).
#4 During routine reviews on the 2nd day of the mission, of video taken during the launch, it was observed that a piece of foam broke free from the external tank and struck the shuttle. The circle in the photo shows the cloud of debris that was produced by the collision.
Mission technical experts almost immediately began to assess the significance of the foam strike. A special Damage Assessment Team (DAT) was formed to study the matter. A series of requests were made for either direct inspection of the wing by the astronauts (which would have required an unscheduled space walk) or photographic imagery from ground- or space-based cameras operated by US military or intelligence agencies.
Shuttle Program management either refused over overruled these requests (some of which had apparently been looked upon with disfavor since they had been submitted outside of the normal organizational channels).
Shuttle Program management quickly formed the opinion that the foam strike was not a safety-of-flight issue and was, at most, a maintenance issue that would complicate preparing Columbia for its next mission. This opinion was based, in part, on the knowledge that previous foam strikes had not caused serious safety issues.
While technical experts in the engineering ranks were concerned about the potential for damage to the shuttle, management did not see this as an issue, and no initiatives were implemented for inspecting the shuttle, or for planning any alternatives to its scheduled routine return to Earth.
#5 To understand the full significance of the Columbia disaster, it is necessary to recall the details of the Challenger disaster which occurred 17 years earlier.
The Challenger space shuttle was destroyed in a massive explosion approximately 73 seconds into its launch. Subsequent investigation revealed that a field assembly joint in one of the two solid rocket boosters (SRBs) had failed, allowing hot combustion gases to contact the external fuel tank.
The two illustrations show a cross-section of a SRB field joint, and the jet of flame issuing from the joint during take-off. The joint design incorporates two o-rings to seal the joint during the firing of the rocket.
Note that the SRBs, unlike the main engines on the shuttle, cannot be “turned off”after they are ignited.
#6 The external fuel tank carries the liquid hydrogen and liquid oxygen which fuel the main engines on the shuttle.
The flame from the field joint impacted the external fuel tank at a point where the liquid hydrogen tank was located. The liquid hydrogen tank failed and the resulting explosion caused the SRB to rotate away from the external tank in a manner that resulted in the nose of the SRB puncturing the liquid oxygen tank like a can opener.
The simultaneous release of the liquid hydrogen and liquid oxygen resulted in a massive explosion that destroyed the shuttle orbiter.
[The photo on the left shows the SRBs, still burning, rocketing away from the shuttle debris just after the explosion.]
#7 The subsequent investigation confirmed that a field joint had indeed failed when flames from the SRB burned through both the primary and secondary o-rings and the putty between them. Further, it was learned that SRB o-ring damage was not unique to this mission. A pattern of o-ring damage extending over several years had preceded this mission. Even though progressively more severe degrees of damage had been observed over time, Shuttle Program management had grown increasingly complacent to o-ring damage, since the o-ring damage had not yet seriously impacted a mission.
#8 After identifying the immediate, physical cause of the incident, the Commission investigated the organizational factors that had contributed to the event.
Even though technical experts had been studying the field joint problem, and even though they expressed grave reservations about the safety of this particular launch, the nature and intensity of their concerns did not reach Shuttle Program management. The organizational culture factors that stifled effective communications are discussed in more detail in the Challenger Case History included in this package.
At the same time, the NASA organization was riding on an excess of confidence, inspired by past successes – notably the Apollo lunar program. Even though the nature of the agency’s mission, and the agency itself, had changed significantly (e.g., NASA had experienced significant budget and staffing cuts), NASA continued to view, at least implicitly, its past performance as a predictor of future success.
In addition, NASA management believed that they were in a struggle to prevent the demise of the Shuttle program. NASA had to prove the reliability of the Shuttle in order to obtain the funding necessary to sustain the program. This required that NASA meet a sustained schedule of launches that was far more ambitious than any previously demonstrated. The Commission felt that the resulting pressures on Shuttle Program management distracted attention away from the consideration of matters that, if addressed, would likely have delayed the program.
The Commission report included a number of organizational and safety culture recommendations that were felt necessary to prevent future catastrophic incidents. Some, regrettably, were not implemented.
#9 With this background, let us return to Columbia… Once the physical cause that led to the disaster was determined, the CAIB turned its attention to organizational cultural factors behind the failure, just as the Rogers Commission had.
The CAIB was already familiar with the problems rooted in the Shuttle Program’s history and culture, including the original compromises that were required to gain approval for the Shuttle, subsequent years of resource constraints, fluctuating priorities, schedule pressures, mischaracterization of the Shuttle as an operational rather than developmental or experimental vehicle, and lack of an agreed upon national vision for human space flight.
As the nest series of slides will show, the CAIB found a number of disconcerting similarities between the organizational causes of the Columbia and those of Challenger, 17 years before.
#10 Recognizing that Shuttle Program management had almost immediately made an a priori decision that the foam strike could not be a problem, the CAIB was seeking to understand two issues:
Why was it that serious concerns about the integrity of Columbia, raised by technical experts within one day after the launch, were not acted upon in the two weeks available between launch and return?
Were there cultural patterns emerging from the Columbia accident that were the same as those first identified after the Challenger tragedy (almost exactly 17 years earlier) and, if so, why were they still present?
The CAIB was interested in finding out how the “Organizational culture” contributed to the event.
#11 We have identified six key organization culture themes within the Columbia report. While these themes may be self-explanatory … the following comments are included for your information.
Maintain Sense Of Vulnerability. Since catastrophic accidents involving hazardous materials or activities are not very common, most organizations never have the unfortunate opportunity to experience one. This can create a false sense of security and decreased operating discipline, which can dull management system effectiveness. Lapses in critical prevention systems can result.
Combat Normalization Of Deviance. When established engineering or operational constraints are consciously violated, without any resulting negative consequences, an organizational mindset is encouraged that more easily sanctions future violations. Such violations are more likely to lead to a serious accident.
Establish an Imperative for Safety. This addresses a range of considerations, from showing visible support for safety through management actions, statements and priorities to soliciting and welcoming differing opinions on critical safety issues.
Perform Valid/Timely Hazard/Risk Assessment. Without a complete understanding of risks, and the options available to mitigate them, management is hampered in making effective decisions. Perfunctory assessments lead to flawed decisions.
Ensure Open and Frank Communications. Information must effectively flow both up and down the organization, and laterally between functional groups within the organization. “Bad news filters,” emphasis on “chain of command” communications, and “silo” mentalities can stifle the exchange of safety-critical information.
Learn and Advance the Culture. We must be open to learning from our mistakes (and those of others), and to making the necessary corrections… or we will repeat those mistakes.
#12 Even after it was established that a significant piece of debris had struck the orbiter during its initial ascent, a senior NASA official was in a denial mode on the eighth day of the mission.
In the 17 years since the Challenger incident, the perception that similar catastrophic events could occur had been diminished by the NASA organizational culture.
Shuttle managers were relying heavily on recent past success as a justification for their actions or inactions. Specifically, new, unforeseen issues were not subject to thorough technical analysis.
A key observation made by the Rogers Commission after the Challenger incident was that the Shuttle was, and might always remain, an experimental vehicle. The second quote, made 9 years after Challenger, illustrates how the sense of vulnerability that should have been created by the Challenger incident had not persisted. Between 1986 and 1995, the shuttle had gone from being an experimental vehicle to a “mature and reliable system.”
[The photo shows debris, recovered after the Columbia break-up, laid out on a grid pattern approximating where the debris had originated from on the shuttle.]
#13 NASA’s “can do” attitude often made it hard for individuals (even groups) to step forward and say “this can’t be done.” The imperative of “we must succeed” had overwhelmed the consideration of “we could fail.”
The belief that foam strikes did not jeopardize the Shuttle arose from a limited observation that no disasters had resulted from previous foam impacts – so, therefore, no disasters were going to occur in the future. Even though foam strikes represented a failure to comply with the design basis for the shuttle, NASA had “turned the experience of failure into the memory of success.”
#14 Early in the Shuttle Program, foam loss had been considered a dangerous problem. Design engineers were extremely concerned about potential damage to the Orbiter and the fragile TPS. The assumption that only tiny pieces of debris would strike the Orbiter was also built into original design requirements.
Over the 113 Shuttle missions flown, foam shedding and debris impacts had come to be accepted as routine and were viewed by Shuttle Program managers to be only a maintenance concern.
Having lost the sense of vulnerability, the organization succumbed to accepting events that were proscripted in the original design basis.
#15 Only limited technical analyses were performed to determine the actual risks associated with this fundamental deviation from intended design. Each successful landing reinforced the organization’s belief to the point where foam shedding and debris strikes were “normalized.”
As new evidence emerged suggesting that the Columbia foam strike was larger, and possibly more threatening, than earlier foam strikes, this information was quickly discounted by management.
Once you let a standard slip, each successive violation becomes a bit easier.
While the concept of “normalization of deviance” had been much discussed in the aftermath of the Challenger incident, the NASA culture had not been “cured” of this crucial weakness.
#16 The impact of such a statement on an organizational culture is significant -- especially when coming from a top official. It can result in people at all levels feeling less compelled to bring up safety matters. Others, at lower levels, begin to mimic what they hear from above.
Over the years at NASA, the safety organization had degraded, and had ultimately been relegated to “rubber stamping” critical safety-related decisions, rather than providing an independent assessment and strong voice that would help ensure the management of risks.
Safety personnel were present during debris assessment meetings, but their presence was passive and they did not serve as a channel for voicing concerns or dissenting views.
Most importantly, technical staff with mission safety responsibilities had their normal roles reversed. The traditional approach to a potential safety problem would have been to assume that a problem existed, then seek the sound technical evidence and analysis necessary to prove (if possible) that the problem did not exist. Technical staff concerned about the foam strike wanted to obtain photographs that would have shown whether there was damage to the Orbiter. Shuttle Program management, in effect, required proof of the damage before they would authorize the imaging necessary to prove whether there was damage.
The burden of proof should have been placed on proving that reentry was safe. Instead, technical staff were faced with the reality that reentry would proceed as planned, unless they could prove that it was unsafe… and then they were denied the tools needed to do so.
#17 NASA employees were also subjected to deadline pressures to complete the preliminary phase of the International Space Station by 19 Feb 2004. The failure to meet this schedule was perceived by NASA management to threaten the viability of the Shuttle Program.
Consequently, this deadline was heavily promoted by the most senior NASA officials and it created an environment in which some NASA workers felt reluctant to raise safety issues for fear of delaying the timetable.
Even though several substantive technical problems were encountered with the Shuttle program after this deadline was established, NASA management was unwilling to revise it.
The upper figure shows a screen saver that was installed on NASA computers. The screen saver counted down the time until the 19 Feb 2004 deadline… in seconds. Thus, NASA employees received a continuous, real-time reemphasis on the importance of schedule… which potentially overwhelmed any periodic emphasis on program safety.
#18 The CAIB concluded that a lack of consistent, structured approaches for identifying hazards and assessing risks contributed to the faulty decision-making process at NASA.
Many of the analyses that were performed contained subjective and qualitative judgments, using words and phrases like “believed” and “based on experience from previous flights this hazard is an Accepted Risk.”
Further, many of the action items emerging from these studies were not addressed.
The failure of the risk assessment process is ultimately manifested in the Columbia incident. At the time of the Shuttle launch, there was still no clear technical, risk-based understanding of the significance of foam debris impacts to the spacecraft. A very significant foam strike on a prior, recent mission had not even been assessed yet. Management had no solid information upon which to base their decisions… yet they made the decision that the Columbia foam strike was not a concern.
#19 Management had already settled on a uniform mindset that foam strikes were not a concern. Any communications to the contrary were either directly or subtly discouraged.
An organizational culture had been established that did not encourage “bad news.” This was coupled with a NASA culture that emphasized “chain of command” communications. The overall effect was to either stifle communications completely or, when important issues were communicated, to soften the content and message as the reports and presentations were elevated through the management chain.
The organizational culture encouraged 100% consensus, further discouraging the expression of dissent. (The CAIB observed that a healthy safety organization is suspicious if there are no dissenting views). Participants felt intimidated.
#20 While a significant amount of e-mail traffic occurred between individuals, official communications between NASA sites and functional groups were limited and ineffective.
#21 The organizational dysfunctions that had been identified in the Challenger incident, and which persisted through the Columbia incident, strongly suggest that NASA had not learned from its mistakes… and had not stepped up to the challenge of maturing its safety culture.
Such lessons are painful and expensive. Ignoring them risks repeating them.
#22 NASA resumed manned space flight on July 2005, nearly 2-1/2 years after the Columbia incident. NASA had commissioned a panel of experts to monitor its compliance with the CAIB report recommendations, and its preparations for return to flight (RTF). The panel issued a formal report documenting their observations and opinions. 7 of the 26 panel members issued a minority report criticizing NASA’s efforts. Some quotes follow:
“…we believe that the leadership and management climate that governed NASA’s return-to-flight effort was weak in some important ways… “
“we believe these organizational and behavioral concerns are still pervasive throughout the human spaceflight programs… “
“NASA leadership … missed opportunities to address the enduring themes of dysfunctional organizational behavior that the CAIB and other external evaluators have repeatedly found. As a result, in our view, many fundamental concerns persist… “
“…we do not believe the risk management processes in place within the Space Shuttle Program are sufficiently robust… “
“…what our concerns … point to are a lack of focused, consistent, leadership and management…”
“… roles, positions, and strength of personality often determined critical outcomes more than facts and analysis… “
“…it appears to us that lessons that should have been learned have not been… “
“NASA’s leaders and managers must break this cycle of smugness substituting for knowledge.”
#23 During the launch of Discovery, a large piece of insulating foam broke off of the external fuel tank but, fortuitously, did not strike the Orbiter.
In response to the CAIB’s recommendations, NASA now has the capability to inspect the Orbiter on orbit. The shuttle astronauts were able to confirm that Discovery had not been damaged, and a safe return to Earth was later affected.
NASA subsequently grounded the shuttle fleet, once again, to allow further studies and, hopefully, an ultimate resolution of the problems associated with the shedding of insulating foam.
#24 One or more industry case histories are presented to illustrate the relevance to our industries of the general learnings from the Columbia investigation
#25 The Piper Alpha oil platform was located in the North Sea approximately 110 miles from Aberdeen, Scotland. At the time of the incident it had 226 people on board, 165 of whom perished (in addition, two emergency response personnel died during a rescue attempt).
Many of the platform occupants retreated to the crew quarter to await evacuation instructions which never came. The died there from smoke inhalation.
The platform was totally destroyed.
[The photo shows the Piper Alpha platform prior to the explosion.]
#26 While the investigation of the incident was hindered by a lack of physical evidence, based upon eyewitness accounts it was concluded that, most likely, a release of light hydrocarbon (condensate; i.e., propane, butane, and pentane) occurred when a pump was restarted after maintenance.
Personnel restarting the pump were not aware that a relief valve (RV) in the pump discharge had also been removed for service and that a blank had been loosely installed on the piping flange (which was not readily visible from the pump vicinity) in place of the valve.
Upon restart of the pump, this flange leaked, producing a flammable hydrocarbon cloud, which subsequently found an ignition source. The resulting explosion breached crude oil lines on the platform, resulting in large hydrocarbon fires on the platform.
[The photo shows the smoke from the oil fires shortly after the initial explosion.]
#27 Piper Alpha was at the hub of a network of platforms interconnected by oil and gas pipelines. Managers on the other platforms were aware that there was a problem on Piper Alpha, but did not know its severity. They assumed that they would be instructed to shut down their operations, if needed.
However, the initial explosion had interrupted communications from Piper Alpha and considerable intervals (from 30 to 60 minutes) passed before these other platforms were shut in. In the interim, oil from these other platforms continued to fuel the fires on Piper Alpha.
The heat from the oil fires ultimately weakened large diameter, high pressure natural gas lines on Piper Alpha. These lines ruptured in succession, resulting in catastrophic explosions.
Nearby ships attempted to reach Piper Alpha to effect rescues, but were hampered by the intensity of the flames.
[The photo shows the platform during one of the explosions occurring after the rupture of a gas riser.]
#28 The incident received an extensive government investigation. In addition to identifying the most probable physical causes, the investigation identified a number of organizational causes.
It was determined that chronic problems existed with the implementation of the permit to work (PTW) system. The fact that two separate permits had been written (one for the pump and one for the RV), and that at least of them had not been properly administered, resulted in Operation’s lack of awareness that the RV was missing when the pump was restarted. Audits and reviews of the PTW system had failed to identify its weaknesses.
#29 Inadequate communications between maintenances, contractor, and operations personnel on two successive shifts also contributes to the problem. Shift turnovers were known to be problematic, but corrective actions had not been implemented.
There was no fire water available to fight the initial oil fires because the diesel fire pumps were in manual. The platform manager had a policy of putting the pumps in manual whenever divers were in the water around the platform… even if they were not working in the vicinity of the pump intakes. A prior fire protection audit recommendation to modify this policy had not been addressed. Personnel were unable to reach the pumps after the first explosion to start them.
#30 Had fire water been available, it might not have been effective since there was a known problem with salt water corrosion of the fire water lines. Many deluge heads were known to be plugged. Work to replace the supply lines and nozzles was in progress but was well behind schedule.
The structural steel on Piper Alpha had no fireproofing. The structure was overloaded and could not support the additional weight of such protection. Consequently, the platform was known to be susceptible to the effects of a massive hydrocarbon fire.
The natural gas risers were similarly unprotected from external fire exposure. An engineering study, presented a year before the incident, cautioned about the potential for riser failures, but the recommendations of the study were discounted by company management.
The study also cautioned that a massive riser leak would make a successful evacuation of the platform unlikely. One year after the report, this proved to be the case.
#31 The investigation also revealed that personal safety and emergency response training was haphazardly conducted. A visitor to the platform was to receive safety training, including training on the use and location of the lifeboats, if this was the first visit to Piper Alpha, or if more than 12 months had elapsed since the last refresher training. Some survivors recounted that such training was waived if they had ever previously been offshore anywhere else. Some were unaware of their lifeboat assignments or where to find them.
Evacuation and emergency shutdown drills were scheduled, but often deferred due to weather conditions or other conflicts.
Platform mangers had not been trained/drilled on the proper responses to major emergencies on other platforms, hence the confusion as to whether they should have suspended production after learning that there were problems on Piper Alpha. To complicate this matter, different platforms were owned by different companies.
The lack of effectiveness of the Piper Alpha platform manager during the emergency (he was one of the staff who evacuated to the crew quarters and suffocated there) lead the inquiry to question the adequacy of his training.
#32 Slides have not been included to explicitly map the parallels between Piper Alpha and the NASA learnings. This was intentional, so that the mapping could be left as a potential group or workshop exercise. Some suggestions:
Management’s dismissal of the engineering study suggesting the catastrophic potential of a gas riser failure, and management tolerance of the delays in the repair of the fire protection systems suggest that they did not Maintain a Sense of Vulnerability.
Combat Normalization of Deviance. Problems with the PTW system and shift turnovers were known, at least by personnel on the platform, but no corrections were made.
The Inquiry was critical of corporate management’s overall performance with respect to the acceptance of some known problems, and failure to identify other obvious problems. Management’s actions did not communicate An Imperative for Safety.
The decision-making processes with respect to operation of the diesel fire pump, the schedule for the fire water system repairs, the vulnerability of the natural gas risers calls into question the validity of any underlying Hazard/Risk Assessments.
Management’s failure to address employee concerns about the inadequacy of shift turnovers indicates several problems related to Open and Frank Communications.
PTW and shift turnover problems had been identified in an earlier incident that resulted in one fatality, yet corrections had not been made. In fact, corporate admissions of fault, made during the settlement of the resulting legal proceedings, had not been shared with safety staff. The organization did not evidence an inclination to Learn and Advance the Culture.
#33 The Flixborough Works of Nypro (UK) Limited manufactured cyclohexanone via the partial oxidation of cyclohexane (cyclohexanone is used to make caprolactum, a nylon intermediate). The process operated at approximate 125 psi and 155 oC.
The VCE and subsequent fires destroyed the plant. Of the 28 fatalities, 18 were in the control room building which was demolished. The fatality count would have been far higher, had the explosion not occurred on a weekend.
There was a significant number of injuries, and considerable building damage, in the nearby villages.
[The photo shows a portion of the plant after the explosion.]
#34 One of six vessels in the reactor train was removed after a six ft long crack was detected in the vessel shell. The gap between the adjacent reactors was subsequently spanned by a 20 inch diameter jumper.
The investigation determined that the most likely cause of the explosion was the failure of this jumper, which released at least 30 toms of boiling cyclohexane, which formed a vapor cloud, found an ignition source, and exploded.
[The figure illustrates how the reactors were arranged, with each successive reactor lower in elevation, to permit gravity flow of the process stream from one reactor to the next. It also depicts the doglegged shape required for the jumper to make up the elevation differences between reactors 4 and 6.]
#35 Because the reactors were staggered in elevation, it was necessary to incorporate a dogleg into this piping jumper. This jumper was fabricated and installed, supported only by scaffolding.
There were no detailed design calculations made for the jumper (other than to confirm that it would withstand the normal operating pressure).
The installed jumper was leak checked at normal operating pressure, but the system was not pressure tested in a fashion appropriate for this sort of modification to a pressurized process system. Had the jumper been tested at 1.3X the reactor design pressure (as required by British standards) it would have likely failed… hopefully preventing the explosion that subsequently occurred.
It is believed that the unbalanced forces imposed on the bends in the piping, coupled with the flexibility introduced by the expansion bellows on each end of the jumper, allowed the inadequately supported and unconstrained jumper to oscillate. Ultimately, one of the bellows failed.
[The figure shows the installed jumper, supported on scaffolding. Surprisingly, perhaps, the jumper remained in service for two months before failing.]
#36 The works engineer had resigned earlier in the year. While there were chemical and electrical engineers on staff, there was no qualified mechanical engineer employed at the site.
In the perspective of the staff involved in planning and implementing this modification, this was a routine plumbing job. No detailed design was provided for the jumper; in fact, the “design drawing” consisted of a chalk sketch on the maintenance shop floor.
Even though reactor 5 had experienced a six-ft long crack, it was decided that there was no need to inspect the other five reactors, which were exposed to similar service conditions. The investigation team suggested that the time required to make this inspection might have provided staff an opportunity to reflect on, and reconsider, their hasty plans to jumper between the reactors.
#37 The investigation team did not fault the safety attitude of company and site management, but did imply that there was a lack of appreciation of the impact on plant staff resulting from the departure of the works engineer, and the emphasis on a prompt return to production.
Workloads were increased by the absence of the works engineer, and some staff were working in areas not consistent with their areas of competence.
The investigation team concluded that these factors prevented staff from taking the time necessary to properly evaluate their plans and actions.
#38 Slides have not been included to explicitly map the parallels between Piper Alpha and the NASA learnings. This was intentional, so that the mapping could be left as a potential group or workshop exercise.
#39 Subsequent slides will detail a company-specific case history. Before proceeding, ask the audience to share their current impressions with respect to these questions. Could this happen to you?
Do we see staff becoming complacent as a result of past superior safety performance?
Are we taking chances that would not have been tolerated before? Are we maintaining the same standards that were in effect 12, 24, or 36 months ago?
Is our risk assessment process effective? Are we addressing our risks on a timely basis?
If the safety of an activity is in question, where dose the burden of proof lie? Do we require proof of safety to permit the activity to continue… or proof of danger to prevent the activity? Does it matter whether the activity, if conducted, is a profitable one?
Do our employees feel empowered to share their safety concerns freely with management? Do we “shoot the messenger” if we disagree with the message?
Do we frequently see business and customer pressures at odds with safety requirements?
Is our incident reporting system working well? Do we learn from our mistakes and apply those learnings? Can we point to evidence that our safety culture is strengthening with time?
#40 If you choose to include a company specific case history, refer to the guidance in Instructions for PowerPoint Presentation.
#41 If you choose to include a company specific case history, refer to the guidance in Instructions for PowerPoint Presentation.
#42 If you choose to include a company specific case history, refer to the guidance in Instructions for PowerPoint Presentation.
#43 If you choose to include a company specific case history, refer to the guidance in Instructions for PowerPoint Presentation, included in the package.
#44 The following slides provide a sampling of indicators that problems may exist within your organization with respect to each of the six organizational/safety culture themes. These are all self-explanatory, and no commentary is provided.
#51 The CAIB faulted NASA staff for attempting to condense technically complex topics, addressing matters of great import, onto a few (often only one) PowerPoint slides for presentation to management.
Such an approach almost guarantees that important details will be omitted or downplayed and, consequently, that management awareness of the issue will be lessened.
This presentation is a brief introduction to a very complex topic. It is meant only to create an awareness and stimulate an interest in your audience.
There are key organizational/safety culture issues that have not been touched on in this presentation… we sought only to discuss those which could be exemplified by the compelling details of the Columbia disaster. A broader perspective is provided in the document Safety Culture: What Is At Stake, included in this package.
But even this is only an introduction. Additional valuable resources on the topics of organizational effectiveness and safety culture are provided in the Bibliography.