Single Fault Tolerance

I think Crossbo’s beloved Aunt Julie jinxed him this week. After passing along the information that we had a hunt today to make up for Wednesday’s weather cancellation, she asked if “her boy” was shod. I somewhat indignantly replied that of course, he was. Brian had been here just a few days earlier to replace Arthur’s missing shoe, and he tightened one of Crossbo’s and checked all the others, and all systems were set for launch. This morning, I walked out in the pasture to catch the big clown, looked at his hind feet, and said “You sonuvabitch!” Somehow he’d managed to lose a hind shoe.

This wasn’t the first time my plans had been skewered by a lost shoe. There’s small comfort in the knowledge that it’s not just me; there’s even a poem about the loss of a horseshoe nail causing the loss of a kingdon. The equestrian world seems to be ruled by Murphy’s Law. In fact, it’s been said that Murphy obviously wasn’t a horseman, because he was too optimistic. I’d be willing to bet that whoever said that didn’t know who Murphy really was, but it’s true. He wasn’t a horseman, he was an Air Force engineer.

And, speaking of horsemen and engineers and lost horseshoe nails, a friend who is a NASA engineer (yes, she really is a blonde rocket scientist), and an equestrian, recently mentioned “single fault tolerance in hunting horses”. Not thinking like an engineer, my first reaction was “Single fault tolerance? I think I’ve tolerated a lot more than a single fault in a few horses, and they’ve mostly been pretty good about tolerating my multiple faults.” Then I realized what she meant.

In geekspeak, “fault tolerance” refers to a system’s ability to continue functioning despite the failure of a component (i.e. the system can “tolerate a fault”). A “single fault tolerant” system is one that will not be rendered inoperable by a single fault. In some systems, where the cost of failure is high, designers strive for double or triple fault tolerance. With something as complex as the space shuttle, it would be nice to bring the crew home alive even if two or three things break.

With Arthur on standby, my hunting support system is single fault tolerant. Under some circumstances, we might recover from a double fault. In the event of two failures in a single equine unit, we could still accomplish our mission. But, with one failure in each unit, we would be forced to abort, so we are not truly double fault tolerant. At this point, striving for double fault tolerance would mean returning Shadowfax to active service. In spite of the fact that he appears physically sound and is a year younger than Arthur, that’s not a tempting prospect. I think I’ll just declare myself happy with “single fault tolerance in hunting horses”.

So this morning, rather than disgustedly walking back to the barn with an empty lead rope, as I have done in years past, I looked at my backup equine unit and said “Arthur, you da man!” And, indeed, he was da man today.

In deference to his age and lack of conditoning (last Friday was the first time he was ridden since November), I opted to take it easy. It’s not unusual for some fault-tolerant systems to provide reduced performance or capacity when operating in failure mode, and the reduction is often considered an acceptable alternative compared to the cost of full redundancy.

Last week’s backup mode was not actually the result of a system failure. Both equine units were in peak operational condition. But, it’s essential for any backup system to be exercised occasionally, to ensure that it will actually be able to perform in the event of a real failure. And, as it turned out, last week’s test run of the backup unit was a good preparation for this week’s recovery mode.

Last week, in order not to overload the backup unit with an abrupt return to full capacity after an extended idle period, we started out in second field. Later in the day, when it became obvious that the pace was not going to be extreme, we moved up to allow the unit’s jumping function to be tested. All systems were go.

Today, encouraged by last week’s success, we started in front, acknowledging that we might not have the capacity for a full mission if things got hectic. Blastoff was fairly smooth. The first jump, with very little warmup was encouraging, maybe too encouraging.

I don’t know if it was cockiness or complacency from our original success, or some other problem, but we managed to botch the next jump. We got over, on the first try, but we clobbered the jump pretty hard.

After the pilot was shaken back to state of heightened alert by that scare, things seemed much smoother. Unfortunately, after a few more flawless jumps, some exterior damage to the craft was noticed, probably as a result of the undesirable contact with the second jump. A hind leg was bleeding, although not profusely.

After some minor damage control, it was determined that the craft’s performance was not impaired, and continued operation posed no significant risk. We continued with the mission until our next rendezvous with the second field.

At that time, the decision was made to avoid any risk of stress by dropping back to the second field. The superficial damage to the craft was not really a concern, but the overall condition of the equine unit had to be taken into consideration. Although he was showing no signs of tiring, I felt that it would be wiser to slow the pace at a point where the option was easily available, rather than risking discovering I had no horse left in the middle of a screaming run with no easy exit strategy. I was happy with the knowledge that the backup unit was still capable of normal function, at least for a short period of time, and didn’t really want to discover how much more it would take to exhaust him.

Leave a Reply

Your email address will not be published. Required fields are marked *