The Future of Failure
What happens when a bridge collapses? Or a space mission ends in tragedy? You call in forensic engineers to investigate.
Host Dr Anna Ploszajski contemplates the nature of failure with Dr Sean Brady, who uses scientific and engineering principles to study structural collapses around the world, and who has acted as an expert witness numerous times.
Episode Transcript
SEAN BRADY
Imagine we have a table and we're dropping grains of sand on the table, landing at random positions on the table. We're going to slowly build up little hills, but those hills are going to be random when they appear because we don't know where we're dropping the sand. And then eventually we're going to drop one grain of sand, we're going to hit one of these hills and we're going to get an avalanche. What causes the avalanche? Well, it's the grain of sand. If the grain of sand was not dropped, we wouldn't have had the avalanche. No, no, no, think about that for a minute. We've been dropping grains of sand on this table for a long time, but they didn't cause avalanches. It's the hill that created the situation that all it took was a single grain of sand.
ANNA PLOSZAJSKI
Have you ever seen a news story about an unexpected disaster and wondered what on earth went wrong? A bridge collapses, did they use the wrong material? A nuclear power plant explodes, were the safety checks not followed? I'm Anna Ploszajski, and in this episode of Create The Future I'm going to be asking and answering these questions with forensic engineer Sean Brady.
SEAN BRADY
Bad things don’t happen because people gamble and lose. Bad things happen because the bad thing that's about to happen cannot even be imagined by those involved.
ANNA PLOSZAJSKI
Sean's been working in this field for nearly 20 years and today he runs Brady Hayward, a consultancy that helps organisations stay safe. He's also an all-round failure fanatic with an encyclopaedic knowledge of failures from across the world. Let's get into it. So Sean, it'd be great if you could introduce yourself in your own words. Who are you and what do you do?
SEAN BRADY
My name is Dr Sean Brady. I'm a forensic engineer and I'm interested in many things but really interested in why things fail from a technical and organisational perspective and how you go about preventing those sorts of failures.
ANNA PLOSZAJSKI
And can you tell me a bit more about Brady Hayward, your business?
SEAN BRADY
Yeah, we started Brady Hayward about 15 years ago and that was purely to look at forensic engineering and perform forensic engineering, particularly in this structural engineering space. It's since then broadened to really cover engineering failures in general. And we also do a lot of work in health and safety in the high hazard industries as well, which is sort of like the flip of failure where you're trying to prevent it.
ANNA PLOSZAJSKI
I did a degree in material science and a huge amount of that was dedicated to how things fail. A worrying amount, some would say. I know you started out as a structural engineer. How did you make the pivot to forensic engineering?
SEAN BRADY
So it started out as structural, I was an academic for a very, very short period of time. Came out to Australia from Ireland and started to work on bridges. But it wasn't new design. It was all around how do old bridges actually work? Because, of course, these old bridges don't comply with modern codes. What you end up having to do is really understand how the bridge is actually performing, how it's actually working, as opposed to how we design it to work. And those two things can be quite different. So we spend a lot of time doing real measurements on bridges, working out how to deflect the strains, all that sort of stuff, and then building models of the bridges, finite element models to basically explain that behaviour and understand that behaviour. And when you get to forensics, what people really want to know was what was the structure doing right before it fell down or collapsed. So you need that same skill set to be able to come in and answer those questions as you do to understand the behaviour of older structures.
ANNA PLOSZAJSKI
In my head, you're a bit like, have you ever seen the film Legally Blonde?
SEAN BRADY
In passing.
ANNA PLOSZAJSKI
It reminds me of the final scene where Elle sort of solves the case through her knowledge of perms and hair. It feels like that's you but for bridges and stress strain curves. Is that accurate, would you say?
SEAN BRADY
I don't know the movie well enough to be able to comment. But yeah, what's extraordinary about it is we collapse stuff and knock down stuff in the same way technically as we have for decades. We really don't have new ways of knocking things down.
ANNA PLOSZAJSKI
Oh, interesting. Can you tell me a bit more about that? What do you mean?
SEAN BRADY
If you look at why structures - let's stay with structures for the moment - why structures fall down? Because we get something technically wrong. Okay. What do we get technically wrong? Well, we find we get something technically wrong that we maybe got wrong 20 years ago or 30 years ago and had another failure. And it's quite interesting. You can go to a modern failure and you can almost certainly go back and find a similar failure in the last 250 years where the same issues were actually present. It's all the organizational factors that are allowing us to repeat the same technical mistakes. And I'm sure we'll get into that discussion.
ANNA PLOSZAJSKI
Okay, yeah. So that's what you mentioned at the beginning, that it's not just the structural stuff, it's also all the human factors that contribute to a failure as well.
SEAN BRADY
Yeah, I would say the organisational factors is the key thing there, because human is just a little bit limited. Really how we set up our organisations and how we run our organisations, or our construction projects, for example, they play a key role in allowing a technical issue to get to the point and culminate in an actual failure.
ANNA PLOSZAJSKI
And is that where your role as a forensic engineer really differs from a structural one is that you look at all of those different angles?
SEAN BRADY
Yeah, so there's a couple of dates to it. There's one which is just the forensic piece, which is the investigation. Then we would separate the technical skills across all the disciplines. And you really need different people for each of those, depending on what you're investigating. And then the other layer then is the organizational investigation to understand how this situation came about and why the failure wasn't prevented.
ANNA PLOSZAJSKI
Right, so you need forensics experts as well as structural experts, because you've got to look at the whole holistic picture.
SEAN BRADY
I would say you need forensic experts and you need design experts or construction experts. But it seems like the most obvious thing, someone who's designed bridges for 30 years, they're the ideal people to investigate why a bridge collapses. But I joke, and it's a terrible joke, but I joke that it's a bit like suggesting that the only way you could investigate a murder is to hire ex-murderers because they know what to do. Well, they know how to do their murder. But that's different from this because of course the processes of investigating a murder are developing hypotheses and collecting evidence and testing those evidence against the hypotheses and doing all those things. So it's a very different skill set you need to produce a forensically sound piece of work.
ANNA PLOSZAJSKI
It's a dark example, but it does help to clarify things. So thank you. So we've been talking about bridges and structures. Are you able just to give a really brief overview of the scope of failures that you would look at?
SEAN BRADY
Basically everything. So this is everything from full blown collapse to any defect that is either costing a lot of money to manage or rectify. And then it spans all the different types of engineering as well. So it could be anything from mechanical to electrical right across board.
ANNA PLOSZAJSKI
Yeah. A couple of years ago, I tried to write a mystery novel. And I did quite a lot of research on forensic engineering because I wanted it to include materials and components to the mystery. And one really interesting thing I read was in the old style, like in an incandescent light bulb like they used to have in cars, forensic engineers would be able to ascertain what happened in a car accident by looking at the shape of the filament. It freezes in a certain configuration and that configuration can tell you, was it a side impact or was it a front impact? So yeah, that always stayed in my mind as something that was just, yeah, like an engineering superpower.
SEAN BRADY
Yeah. And it's interesting in the different disciplines, people do spot the things to check. Usually in structural, we're looking for what we call the initiating event, we're looking for the piece that broke first, because usually once you break something, particularly if it's a bridge, you get a progressive failure after that. So actually identifying what broke first becomes really important to ascertaining what caused the failure. Often you'll say, well, it was a weld or it was a bolt or whatever it was. And then you get to the metallurgical level, the material level, you go, well, why did the bolt break? What was happening there? And that can bring you to a whole new level of causation and compliance. You know, the compliance, you could now be with a welding expert talking about the way you do welds and what's a compliant weld and not. So you can get right down to that level. You never really know when you start where you're going to end up at the end, but you do get quite pointy on very specific things to try and understand what ultimately caused the failure.
ANNA PLOSZAJSKI
I'd like to dig a bit deeper now into some, forgive the pun, concrete examples. And I wondered if you could tell us a story of a famous example, which was the Florida University bridge collapse.
SEAN BRADY
Yeah, so this is a bridge that was being built at the Florida International University. It was just a single span bridge. So it's simply supported, which means it just spans from one side to the other. And it's a pedestrian bridge. It's built off site, fabricated off site. One night they move it out over an eight lane highway. And then one day they are post tensioning. So they're applying some tension to a cable in one of the members. We don't have to get into the details of that. But the bridge fails. The whole bridge collapses in about point four of a second. There's workers on top of it. One is killed and the cars were backed up. So eight cars were crushed and five people in the cars were killed. Now, the interesting thing about the failure was that when they poured the bridge off site and they removed the formwork, so this had been the first time the bridge has to carry its own weight, they hear cracking sounds. They go looking. They find cracks in the bridge. The appropriate or the maximum crack width for this bridge was point four of a millimetre wide. So thickness of the nail and your little finger. Any bigger than that, you're in trouble. But these cracks were bigger. And then over the next three weeks, these cracks grew and grew and grew. And Google the photos, if you haven't seen this, the cracks are astonishing. The work continued. Despite all these cracks, people seem to continuously tell themselves that this wasn't a problem. But one of the National Transportation Safety Board investigators said this was a bridge screaming for help. The second thing that's really interesting was the failure ultimately happened because there was incredibly gross design errors made. So the whole system that we put in place as engineers to catch these errors, that's what failed. Sure there was technical mistakes made, but the real learnings from the failure is that the systems we put in place to make sure everything was okay. That's what the learnings need to tell us and we need to revisit those systems in the future.
ANNA PLOSZAJSKI
And what's the psychology behind that? I guess, you know, thinking of the people that were working on that project, is it a kind of misplaced trust? Is it willing ignorance or sort of … why do these failures in systems happen.
SEAN BRADY
So this could take us a while. So let me hit on a few high points and then we can dig into the ones you're interested in, Anna. First of all, how we set up the system is critical. Who reports to who? Who has power? How are people rewarded and penalised for how they act within that system? And what I mean by that really is how do you get paid? I mean, one of the things we can say about construction projects is subcontractors get rewarded all the time for hiding problems for as long as possible. And then people go, oh, this problem is a big problem now, but you didn't bring it to us earlier. What we definitely seem to see here is that nobody was sort of taking responsibility for the cracks. Nobody seemed to want to make the call. We need to stop work. And I think, you know, with all these big failures, we see that the warning signs are telling them there's something wrong. But to actually get your head around that that's going to cause a bridge collapse and kill six people, that's so much further to go in your mind. The problem is the immediate issue, the immediate risk, of stopping the work. That's very tangible. You know, that's usually got a dollar amount attached to it. That's a much more immediate incentive. We definitely see things like expertise bias that, you know, people believe someone knows what's going on, they'll defer to that person. So it's very famous in Canada, the Quebec Bridge collapse, which killed over 70 people. They hired one of the world's best engineers to keep an eye on everything. And you imagine that that adds safety to your job. But actually what you find is that everyone goes … his name is Theodore Cooper … in this case, if Theodore was happy, then everyone was happy. And Theodore was happy, but had got major things wrong. And that's common to all industries. The inability to speak up is a very, very common problem in these failures as well.
ANNA PLOSZAJSKI
And the other thing I was thinking of as well is, you know, you were mentioning that these are the bridges and we have built them in this same way for tens, hundreds of years. And so that must be an element of it, too. You know, we've built 20 of these before. None of them have ever collapsed.
SEAN BRADY
Yeah, and there's a really funny thing with bridges. We call it the 30 year failure cycle and nobody knows whether it's real or not. OK, every 30 years we have a major bridge problem. And what seems to happen is we find a particular bridge design and we perfect it. So it could be a cantilever bridge, could be suspension bridge. And, you know, someone does the hard yards in the first design to really make sure this is going to be OK. And then what happens is we all copy it. The problem is we all make it a little bit longer, a little bit longer, a little bit longer, a little bit longer. And we had the Quebec Bridge in the 1910s and that was cantilever bridge, longest cantilever bridge in the world right now, because people didn't build a longer one after that because the collapse was so bad. They start building suspension bridges. They were building them anyway, but they really go for it. Then we get Tacoma Narrows about 30 years after that, where we have the famous rocking bridge that tears itself apart. You go 30 years beyond that, they’re building box cantilever bridges. We have the Milford Haven Bridge in the UK. We've got the Westgate Bridge here in Australia, in Melbourne. There were multiple within a two year period box girder bridge collapses. Again, we just pushed it too far. So then we all start building cable stay bridges after that. And everyone expected around the year 2000, we'd have a cable stay bridge issue. And we were getting the warning signs. We always get warning signs that we're heading for trouble, but it never happened. Instead, what we got in 2000 was the Millennium Bridge with the horizontal vibrations from people walking on it. And what characterizes these failures, which is really fascinating, is that someone does the hard yards, they find out what matters, the primary things you need to think of, and then there's a bunch of secondary issues. And then what happens is as we make the bridges longer and more slender, because we want to make them look prettier, this secondary issue suddenly creeps up and becomes a primary issue.
ANNA PLOSZAJSKI
And I'm sure people will be thinking, if we're on a 30 year failure cycle, are we expecting a failure now around the year 2030? Or do you reckon we'll get away with it?
SEAN BRADY
Yeah, I mean, who knows whether this is real or not. But the really interesting thing is, yeah, the cable-stay bridges are still tipped to be the problem because they're the biggest ones we're building in the world right now. They're the ones that are pushing the boundaries in bridge engineering.
ANNA PLOSZAJSKI
Talk to me about the NASA example then, I guess diving a bit more into what we were talking about earlier, the psychology behind why these failures are left to happen.
SEAN BRADY
Yeah, so these are wonderful examples for anyone who believes that technical failures are all we need to understand. So in 1986, we have the Challenger disaster. Many people would be familiar with the details of it, but fundamentally, you've got these two boosters on the side of the main fuel tank on the shuttle. And you've got O-ring seals. Now these are like O-rings many people would be familiar with, but they're a huge diameter because they have to go right round the outside of these boosters and there's two of them because it's belts and braces, but you shouldn't need two. I mean, one O-ring should keep the hot gases inside the booster and not let them out the side. If the hot gases get out through the side, they hit the main fuel tank or the shuttle and you end up with losses of aircraft. O-rings failed and that's why Challenger was lost. The really interesting thing was they had evidence of O-ring failure almost from the very first shuttle flights. And they knew this because they'd go and pull these boosters out of the sea, they'd bring them back, and you would see scorched damage on the secondary O-ring. And that was never meant to happen. So hold that thought. Then we go to Columbia disaster many years later. Shuttle is re-entering the Earth's atmosphere. It burns up. Loss of shuttle, loss of crew. Why does it burn up? It burns up because there was damage to the heat shield. They can see on the footage about a briefcase sized piece of foam fell off and struck the underside of the wing on the shuttle damage, the terminal protection system essentially just weakened the inside of the wing and you tear the shuttle apart. Now, if you go pure technical engineering, there's no connection between those two failures at all. But what you see is something that was coined by Diane Vaughan in her wonderful book, The Challenger Launch Decision, which she calls normalization of deviance. So what normalization of deviance is, is that you get a deviation between your expected behaviour and your actual behaviour, and then over time you normalise the deviance. You stop treating it as a warning sign. So, what was happening in Colombia was there was an assumption on the designer's part that nothing could ever strike the heat shield. If you struck the heat shield, it could damage it. But from the very first shuttle launches in Colombia, pieces of foam were falling off and hitting the heat shield and they were coming back with these pock marks on them. And so now you've got this interesting connection between Challenger and Columbia, because in Challenger, from the beginning, they've had blow-by on the O-ring. And of course, what happens in both of these cases is the organisation says, oh, this is a concern. But meanwhile, the shuttle keeps flying and it goes up and it comes back down again and you go look at the O rings and there's more damage and you go look at the heat shield, there's more pock marks. And what happens after a while is this deviance, as we said, gets normalised. People say, well, you know what? The fact that it's gone up and came down okay means we don't have a problem anymore. So what happens is rather than analysis being the basis for deciding it's not a problem, the fact that you've got away with it multiple times tells you it's not a problem. So you get this split between essentially people in the organisation who believe it's not a problem purely based on performance versus people who are concerned. And what happens is the evidence of it not being an issue becomes overwhelming and people actually with normalisation of deviance, not only do they conclude that it has never happened. They conclude that it will never happen.
ANNA PLOSZAJSKI
And it's a really striking example. And one thing that I noticed in it is that it was the same organisation, NASA, but what, 25 years apart? There was probably quite a turnover of personnel during that time. And if I think about coming into a project and saying, oh, why do you use two O-rings on this system? And they say, oh, it's sort of how it's always been done. We know that it's safer with two. That sort of becomes accepted and maybe unquestioned. Is there an element of that where there's sort of turnover of people and things get a bit lost in the ether if it's a long term project?
SEAN BRADY
I don't know whether that specifically happened here in NASA, but we absolutely see that in other projects. What you definitely can say and what they do say with respect to Columbia and Challenger was NASA as an organisation hadn't changed sufficiently. In other words, the organisational problems that contributed to Challenger, they had not been fixed. And then they just re-manifest as a different technical result. The key thing you want is how well does bad news flow up through your organisation and how effectively is that acted upon? That's the difference between organisations who really prevent failures and organisations who are prone to failures. That's the fundamental difference. It's not that they've better rules and have more compliance and all that sort of stuff. It's have they got the feedback loops to understand where the system is not working as they believe it should?
ANNA PLOSZAJSKI
And what seems almost ironic to me is that systems, feedback loops, you know, these are concepts that engineers are very familiar with in machines or, you know, sort of complex electronics or something, and yet they don't apply the same principles to an organisation, or they may not.
SEAN BRADY
This is where it gets really interesting, because I would say developing an understanding of complex systems is probably a key thing we as a profession need to do. I think we engineers love simple systems. So simple systems are systems that we can break down and understand by understanding the components or agents in the system. Complex systems are different because we can't understand them by breaking them down into their components or agents. So classic example is you want to go and understand how an ant colony works. That's a classic complex system, capable of very sophisticated organised behaviour. If you go and you catch an ant. You will learn a lot about an ant, but you learn absolutely nothing about the system. And the reason is complex systems are defined essentially by their components or agents and how they interact with one another. A really good example, as we've said, is a plane is a complicated system, but put pilots in it, put air traffic control in it. Now it's a complex system.
ANNA PLOSZAJSKI
Can you tell us the story of the Boeing 737 as an example of this system approach, this system failure?
SEAN BRADY
So this goes back to the 737 MAX 8 when it was introduced. They have a plane crash. It was a Lion Air flight in Indonesia, crashes, everyone on board is killed. The airline is blamed, not the plane. And within, you know, sort of six months after that, we have a second plane crash, everyone on board is killed there as well. And then suddenly attention starts to turn to the plane. If we wanted to get really technical, we'd blame one sensor on the plane. So on the front of these planes, there's two sensors called angle of attack sensors. They basically measure the angle the plane is flying at relative to the air. That's important because if that gets too steep, you actually lose your aerodynamic ability and the plane can fall out of the sky. So having a sensor to measure how steep you are is really important. They basically discovered on the Lion Air flight, they had a misaligned sensor. It wasn't calibrated properly and there was an electrical fault on the one of the Ethiopian flight. Then you go, OK, but how do you crash plane with one sensor? I mean, aviation is all about layers and layers and layers of redundancy. That’s what we've learned over the years, how do we get to a point where one sensor not working has the ability to crash a plane? And what happens when you get into this story is you get into the history of Boeing. Boeing starts off in the 1900s. It's all about building quality aircraft. Then they have a merger with McDonnell Douglas, who was one of their rivals in 1997. McDonnell Douglas, totally different type of company, very, very focused on cost, shareholder investment, return on shareholder investment, all that sort of stuff. So what they do is to decide, let's take the old 737 and let's basically put new engines on it. Technically sounds great, but then they go, ooh, if we can make the 737 Max's that would become known, fly and handle and perform in the same way as a 737, we could actually say to plane companies, hey, you can buy a 737 Max and you won't have to put your existing 737 pilots in the simulator.
ANNA PLOSZAJSKI
So effectively they don't have to retrain the pilots, is that what you're saying?
SEAN BRADY
They don't have to retrain. As long as they can make that plane handle the same, anyone who's trained on a 737 has to do a course on an iPad to fly the plane. They don't have to get back in the simulator. So big competitive advantage. And then by … I can't remember exactly what time it was … but actually they had done a deal with Southwest Airlines where they said, if we're not able to pull this off, if you need to put your people in simulators, we'll give you a million dollars plane to do that. What they basically found, long story short, is that the new engines on the plane caused at certain times of the flight, the front of the plane to kick up its nose, changes the angle of attack, could stall the aircraft. And the aircraft handled differently, they were going to have to train the pilots, going to have to put them in the simulator. So in the end, what they did was they took a piece of software called MCAS and they put it in the 737 MAX and MCAS was programmed to be able to take over the plane, essentially. Change the angle of the little tail at the back, the horizontal stabilizer. And by doing that, push the nose back down again.
ANNA PLOSZAJSKI
Okay, so the pilots don't need to learn how to handle this new strange behaviour of the plane. That's just a bit of software that will do it automatically.
SEAN BRADY
Absolutely. That was Boeing's rationale to the point where the manual, if you were a pilot, didn't even mention MCAS. You were going to be able to do your iPad course, step into one of these planes if you're an existing pilot, but you were unaware that on board was a piece of software that could take control of the plane from you. In the first flight, the Lion Air flight, the pilots have no idea what's happening because what happens is the angle of attack sensor is reading too high. MCAS thinks the nose is too high up, even though it's not. And it keeps pushing the nose back. It's like over an eight minute period pushes the nose down 21 times. And the pilot has to keep pressing this trim switch to bring the nose back up again, but it's just this constant battle until they lose it and they crash. So you see how it's not one sensor that crashes the plane. It's the decision making that took place. And I would say it's very important, the decision making over time as one decision layers on top of another decision, till we get to a point where all it takes is one sensor with erroneous data to crash the plane. So this is what matters. You know, and I would say if you have to transfer that to construction projects, if people want to go construction, look at your contract structure. Look at how people get paid. Because if you say to someone in that system, hey, if you're ever worried about this project and want to stop it for safety reasons, please do. You can say that. But if there's a clause in the contract that means that they could be liable for delay costs from doing that, what do you think is going to happen?
ANNA PLOSZAJSKI
We've nearly come to the end of our conversation, Sean. Thank you so much for giving us all those really interesting examples. I'm aware that we've talked mostly about scary things, aircraft coming out of the sky, bridges collapsing. And I'd like to leave our listeners with some good news stories, some places where things aren't failing, where organisations are doing it right. Can you give us an example that you've seen that is really working well?
SEAN BRADY
Yeah, the one that is a really interesting one is commercial air travel. And if you think about it, it's the craziest thing in the world to climb into these tubes with 200 odd people and go flying up in the air with all these other tubes full of 200 people flying around as well. If you think about it, we've made flying so safe we forget how completely hazardous it actually is. Hazards are massive and they do it because of all the things we've talked about. They have incredible feedback loops. If you have some sort of problem as a pilot and an air traffic controller in communication or whatever, they have a dual reporting system. And the penalties for not reporting are worse than what could happen to you if you did report. Commercial air travel, air traffic control, nuclear, really are at the top of the tree in terms of making things really, really safe. And one of the interesting things about that is we think making things say must be having more and more rules. And what all the theory tells us is that that's not the case. And the reason is you can never design a perfect system to keep people safe. You depend on the feedback loop telling you where that system is weak. That's probably the biggest takeaway from those industries.
ANNA PLOSZAJSKI
Sean, if our listeners want to find out more about forensics engineering, where can they go?
SEAN BRADY
Oh, there's a few great books. 15 years ago, it was quite a rare thing, but now it's much better. There's a forensic engineering journal from the Institute of Civil Engineers. We do a lot of podcasting. So if you're interested in any of the safety stuff that we talked about, we have a podcast called Rethinking Safety. If you're interested in learning from engineering failures, we've a podcast called the Brady Hayward podcast, and that gets into the technical and the organisational causes. And then the one where we have out at the moment is called Simplifying Complexity, if complexity science is your thing, then that's definitely the one to go to.
ANNA PLOSZAJSKI
Wonderful. Thank you, Sean. Thanks for coming on Create the Future. It was really fascinating to meet you and talk to you. Going into this conversation, I really did think that failure was mostly just about stress and strain in materials, and I have no idea how much communication and organisational systems mattered too. What surprised me the most is how us humans contribute to the complexity in engineering systems, and how us engineers spend years studying these systems but rarely consider this overall context. My biggest takeaway is that being fearful of failure, being defensive towards negative results, and having our head in the sand are perhaps the biggest mistakes that we can make. You've been listening to Create The Future, a podcast from the Queen Elizabeth Prize for Engineering and Peanut & Crumb. This episode was presented by me Anna Ploszajski, and was produced by Jude Shapiro. To find out more, follow QEPrize on Twitter, Instagram, and Facebook.