Famous Bugs courtesy of Dave Curry (ecn.davy@purdue). Originally From John Shore (shore@nr

Master Index Current Directory Index Go to SkepticTank Go to Human Rights activist Keith Henson Go to Scientology cult

Skeptic Tank!

Famous Bugs courtesy of Dave Curry (ecn.davy@purdue). Originally From John Shore (shore@nrl-css) Some time ago, I sent out a request for documented reports on "famous bugs", promising to summarize the results for all who contributed. This is the promised summary. I judge the effort as a failure. People encounter bugs and fix bugs and talk about bugs, but they rarely document bugs. Those responding to my request were well meaning and helpful, but I was left feeling a little like my hair was being done by a gossip-laden ex-hacker. I am, of course, no different. I didn't have documentation for most of the famous bugs I knew about, and I don't have sufficient time, energy, and interest to follow all the leads. I should have known better. One strong conclusion that I've reached is that many computer system horror stories become known as bugs when in fact they are not -- they're system problems that result from hardware failures, operator mistakes, and the like. Let me mention a few examples. In his book Software Reliablility, Glenford Myers mentioned a number of classical software errors. For example, he mentioned that a software error in the onboard computer of the Apollo 8 spacecraft erased part of the computer's memory. I happen to have a copy of a memo by Margaret Hamilton that summarized the conclusions of a detailed study of every Apollo bug (see below), so I looked under Apollo 8. The best I could find was this: "16. Illegal P01. P01 was selected during midcourse, illegally by the astronaut. This action destroyed W-matrix (several erasables in AGC)." The software did erase the memory, but it did so because the astronaut did something illegal, and not because the programmer goofed. This example is characteristic of the Apollo errors (see below), most of which show the need for better exception handling as part of the software specification. But weak specifications are not the same thing as bugs. Here's another example, also from Apollo. It starts with a note to me from Kaeler.pa at Xerox (via Butler Lampson): I heard about the MIT summer student at NASA whose Apollo program filled up memory (with logging information) and flashed the red abort light a few seconds before the first moon landing. The student was in the control room, and after a little thought, said, "Go ahead, I know what it is, I will take responsibility". Luckily, he was right. He was awarded all sorts of medals for being willing to take the responsibility. You should get this story from the horse's mouth before distributing it. I heard it from Neil Jacobstein (212-454-0212, in new york). He heard it from his advisor at the Johnson Space Center in Houston a few years ago. It might be interesting to trace this back to someone who really knows about it. I called Jacobstein, and after some discussion he decided that the "bug" was probably the famous "1201 descent alarm" that I mention below. Again, this was caused by an astronaut error (a radar was turned on when it should have been off). Lot's of people mentioned to me various NORAD "bugs" that caused alerts. I got a copy of a Senate Armed Services Committee Report of 9 October, 1980, "Recent False Alerts From the Nation's Missile Attack Warning System." It deals primarily with the June 1980 alerts, but it contains the following summary: Oct. 3, 1979 -- An SLBM radar (Mt. Hebro) picked up a low orbit rocket body that was close to decay and generated a false launch and impact report. November 9, 1979 -- False indications of a mass raid caused by inadvertent introduction of simulated data into the NORAD Computer System. March 15, 1980 -- Four SS-N-6 SLBMs were launched from the Kuril Islands as part of Soviet troop training. One of the lauches generated an unusual threat fan. June 3, 1980 -- False indications caused by a bad chip in a communications processor computer According to Borning@Washington (who by the way is studying computer problems in missile warning systems), the cause of the Nov. 1979 problem was as follows: To test the warning system, false attack data was intermixed with data from actual satellite observations, put on tape, and run through the system. On November 9, the test tape of this sort was accidentally left mounted on a secondary backup computer. This machine was left connected to the system in use. When the primary computer failed, a backup computer was activated, which also failed. Then the secondary computer came into play, causing the alert. All of these missile alerts were caused by real flying objects, hardware failures, or human error. I'm not saying that bugs didn't cause any missile alerts, just that the ones that are reputed to have been caused by bugs in fact were not. Perhaps computer software -- as opposed to the combination of hardware, software, and people -- is more reliable than folklore has it. I would be interested in hearing your comments on this proposition. Despite the foregoing problems, the assembly of responses makes interesting reading. In the following, I'll mention a few well-documented bugs and then append various extracts from what I received. Thanks to all who responded. In most cases, I eliminated duplicates. Special thanks to Peter Neumann (SRI), who seems to be keeping better track of these problems than anyone else. Many of these don't qualify as bugs by anyone's definition, but they're interesting stories so I've included some of them anyway. ------------- Space-Shuttle Bug This may be the most famous of all -- the one that delayed at the last minute the launch of the first space shuttle. The cause was was a bug that interfered with the communication between concurrent processes -- an area of progamming that is among the least well in hand. John R. Garman wrote about the problem in detail in ACM SIGSOFT Software Engineering News (SEN), vol. 6, No. 5, pp 3-10. -------------- The First Bug Worth mentioning for the trivia folks, it was a moth that was beaten to death by a relay in the Mark II. It was discovered by the Mark II staff, which included Grace Hopper. (Reports on this are a bit confusing. Many attribute the bug to her; in her own published account she refers to "we". I called and asked her. She said the machine operator actually pulled the moth out, and that it was found by the combined efforts of the staff.) Lots of people mentioned this bug to me; my favorite report attributed the bug to "some little old lady who works for the Navy. She's the one who came up with Cobol, I think." In fact, this is one of the better-documented bugs. You can even see its picture in the Anals of the History of Computing (vol. 3, July 1981, page 285). It's ironic that "modern" bugs have practically nothing in common with the first one (an exception being Dijkstra's well known remark about testing). -------------- ARPANET Gridlock In October 1980 the net was unusable for a period of several hours. It turned out that the routing processes in all the IMPs were consuming practically all resources as the result of processing three inconsistent routing updates. It turned out that the inconsistency arose from dropped bits in a single IMP. Whether you choose to call this a bug or not, clearly it demonstrated a design failure. The details are reported well by Eric Rosen in SEN, January 1981. --------------------- APOLLO flight experiences When Margaret Hamilton was working at the Charles Stark Draper Laboratory in the early 1970s, she documented and analyzed in some detail the various "software anomalies" that occurred during several APOLLO flights. Apparently she did this at the request of Marty Shooman. I don't think that she ever published the results, but some years back she gave us a copy of "Shuttle Management Note #14" (23 October 1972), which summarized her analysis. It makes interesting reading. One of her strongest conclusions was that 73% of the problems were caused by "real-time human error". Translated roughly into 1983 computer-speak, this means that the APOLLO software wasn't user friendly. (I guess they didn't have icons or something.) Apparently, there was much debate about this during the design, but the software types were told that astronauts have the right stuff or something so there was no need to make the software robust. One example is quite widely known, as it occured during the APOLLO 11 landing on the moon. In what was referred to as "1201-1202 Descent Alarms", the software kept restarting as the result of overloading. Turned out the radar switch was in the wrong position and used up 13% more computer time than had been anticipated. Hamilton states that "pure software errors" were not a problem on APOLLO flights. I guess she means that the software met its specifications, which is quite an accomplishment. But the specifications apparently did not say much about error detection and recovery. Hamilton states that "all potentially catastrophic problems would have been prevented by a better and/or known philosophy of providing error detection and recovery via some mechanism." ________________________ Nuclear Reactor Design Program I don't know when the bug was first introduced, but it transpired in 1979. From Jim Horning (Horning.pa@parc-maxc): A belatedly-discovered bug in a stress analysis program (converting a vector into a magnitude by summing components--rather than summing absolute values; module written by a summer student) caused a number of nuclear reactors to be closed down for checks and reinforcement a about three years ago (not long after TMI). This was fairly widely discussed in the press at the time, but I never did see how the lawsuits came out (if, indeed, they have been completed). >From br10@cmu-10a came the following newswire stories: a023 0026 16 Mar 79 PM-Topic, Advisory, Managing Editors: Wire Editors: It all started in a tiny part of a computer program used by an engineering firm designing nuclear reactors. It ended with the shutdown of five nuclear power plants at a time when President Carter is pushing oil conservation and the world oil market is in turmoil. The computer miscalculated some safety precautions required by law. The power from the closed plants now will have to be replaced by electicity generated with oil or coal. This may cost utility customers money and throw a curve at Carter's conservation program. In Today's Topic: The Little Computer and the Big Problem, AP writer Evans Witt traces this glitch in the system, from the obscure computer to its possible effect on the nation's energy problems. The story, illustrated by Laserphoto NY7, is upcoming next. The AP ap-ny-03-16 0328EST *************** a024 0044 16 Mar 79 PM-Topic-Glitch, Bjt,950 TODAY'S TOPIC: The . yyter and the Big Problem Laserphoto NY7 By EVANS WITT Associated Press Writer WASHINGTON (AP) - Something just didn't add up. And the result is: five nuclear power plants are shut down; millions of Americans may pay higher utility bills; and a sizable blow may have been struck to President Carter's efforts to reduce the use of imported oil and to control inflation. The immediate source of all this is part of the federal bureaucracy - the Nuclear Regulatory Commission which ordered the shutdowns. But in one sense, the ultimate culprit was ''Shock II,'' a tiny part of a computer program used by a private firm to design the power plants' reactors. Shock II was wrong and that means parts of the five reactors might not survive a massive earthquake. Shock II was the weak link that could have allowed the chain to snap. In between Shock II and the shutdowns were a public utility, a private engineering firm and the NRC staff. It was really the judgments of the dozens of scientists and engineers, not elected or appointed officials, that led to the shutdowns. Perhaps as a result, the decision's impact on the nation's energy situation was not even considered until the very last moment - when the commission itself was faced with the final decision. And at that point, the NRC said, it had no choice. It said the law was clear: serious questions about the reactors had been raised and the reactors had to be turned off until answers were found. The specific questions are arcane engineering issues, but the explanation is straightfoward: Will some of the systems designed to protect the reactor survive an earthquake - or will they fail, and possibly allow radioactive death to spew into the air? The regulations say the reactors must be able to withstand a quake equal to the strongest ever recorded in their area. The regulations don't allow any consideration of the likelihood of a major quake. All four states where the reactors are located - New York, Pennsylvania, Maine and Virginia - have had minor quakes in this decade and damaging quakes at least once in this century. The only way to test them - short of having a massive earthquake - is to test a model of the reactor. The ''model'' is actually a set of mathematical formulas in a computer that reflect how the reactor and its parts will behave in a quake. The model used for the five reactors came from Stone and Webster, the large Boston engineering and architectural firm that designed the plants. The Stone and Webster model indicated how strong and well supported pipes had to be and how strong valves had to be. The problem apparently cropped up after Stone and Webster suggested within the last few months more pipe supports in the secondary cooling system of the reactor at Shippingport, Pa., operated by Duquesne Light Co. in Pittsburgh. But why were the supports needed? ''This was not clear to us, looking at the calculations done by the models,'' said Gilbert W. Moore, Duquesne's general superintendent of power stations. So Dusquesne - and Stone and Webster - sent the computer models through their paces again, having them calculate and recalculate what would happen to the pipes in an earthquake. ''We came out with some numbers which were not in the range we would like,'' Moore said. That made the problem clear - the model now said the pipes might break in an earthquake. The previous analysis indicated an adequate safety margin in the pipes, and Stone and Webster's explanation was: ''One subroutine may not give uniformly conservative results.'' The problem was in a ''subroutine,'' a small part of the computer model, called ''Shock II,'' said Victor Stello, director of NRC's division of reactor operations. ''The facts were that the computer code they were using was in error,'' said Stello. ''Some of the computer runs were showing things are okay. In some cases, the piping systems were not okay. ''We didn't know the magnitude of the error or how many plants might be affected,'' he said. It was on March 1 that Duquesne told the NRC of the problem by telephone and asked for a meeting to discuss it. The same day, Energy Secretary James R. Schlesinger was telling Congress that unleaded gas might cost $1 a gallon within a year and service stations might be ordered shut down on Sundays because of oil shortages. The meeting took place on Thursday, March 8, in Washington with NRC staff, Stone and Webster engineers and Duquesne Light people on hand. Through the weekend, Stello said, engineers from NRC, Duquesne and Stone and Webster worked at the private firm's Boston office, analyzing the severity of the problem. ''By the middle of Sunday (March 10) we begin to get a pretty good idea of what it meant for the systems,'' Stello said. ''Monday, we got the latest information from our people at the Stone and Webster offices. It became clear that there would be a number of the safety systems that would have stresses in excess of allowable limits. The magnitude of the excess was considerable.'' Tuesday, members of the NRC were briefed by their staff of engineers and scientists. They asked for an analysis of the economic impact of the decision, and then ordered the plants closed within 48 hours. And the five reactors shut down: Duquesne Light Co.'s Beaver Valley plant at Shippingport, Pa.; Maine Yankee in Wiscasset, Maine; the Power Authority of New York's James Fitzpatrick plant at Scriba, N.Y.; and two Virginia and Electric Power Co. reactors at Surry, Va. It may take months to finish the analysis of the potential problems and even longer to make changes to take care of the situation. Until the reactors start generating again, the utilities will have to turn to plants using oil or coal. This may cost more, and that cost may be borne by the millions of utility customers. To replace the power from these nuclear plants could require 100,000 barrels of oil a day or more. And this at a time when President Carter has promised to cut U.S. oil consumption by 5 percent - about 1 million barrels a day - and when the world's oil markets are in turmoil because of recent upheavals in Iran. ------------------------------- Summary of various problems from NEUMANN@SRI-AI Review of Computer Problems -- Catastrophes and Otherwise As a warmup for an appearance on a SOFTFAIR panel on computers and human safety (28 July 1983, Crystal City, VA), and for a new editorial on the need for high-quality systems, I decided to look back over previous issues of the ACM SIGSOFT SOFTWARE ENGINEERING NOTES [SEN] and itemize some of the most interesting computer problems recorded. The list of what I found, plus a few others from the top of the head, may be of interest to many of you. Except for the Garman and Rosen articles, most of the references to SEN [given in the form (SEN Vol No)] are to my editorials. SYSTEM -- SF Bay Area Rapid Transit (BART) disaster [Oct 72] Three Mile Island (SEN 4 2) SAC: 50 false alerts in 1979 (SEN 5 3); simulated attack triggered a live scramble [9 Nov 79] (SEN 5 3); WWMCCS false alarms triggered scrambles [3-6 Jun 80] (SEN 5 3) Microwave therapy killed arthritic patient by racing pacemaker (SEN 5 1) Credit/debit card copying despite encryption (Metro, BART, etc.) Remote (portable) phones (lots of free calls) SOFTWARE -- First Space Shuttle launch: backup computer synchronization (SEN 6 5 [Garman]) Second Space Shuttle operational simulation: tight loop on cancellation of early abort required manual intervention (SEN 7 1) F16 simulation: plane flipped over crossing equator (SEN 5 2) Mariner 18: abort due to missing NOT (SEN 5 2) F18: crash due to missing exception condition (SEN 6 2) El Dorado: brake computer bug causing recall (SEN 4 4) Nuclear reactor design: bug in Shock II model/program (SEN 4 2) Various system intrusions ... HARDWARE/SOFTWARE -- ARPAnet: collapse [27 Oct 1980] (SEN 6 5 [Rosen], 6 1) FAA Air Traffic Control: many outages (e.g., SEN 5 3) SF Muni Metro: Ghost Train (SEN 8 3) COMPUTER AS CATALYST -- Air New Zealand: crash; pilots not told of new course data (SEN 6 3 & 6 5) Human frailties: Embezzlements, e.g., Muhammed Ali swindle [$23.2 Million], Security Pacific [$10.2 Million], City National, Beverly Hills CA [$1.1 Million, 23 Mar 1979] Wizards altering software or critical data (various cases) SEE ALSO A COLLECTION OF COMPUTER ANECDOTES SUBMITTED FOR the 7th SOSP (SEN 5 1 and SEN 7 1) for some of your favorite operating system and other problems... [Muni Metro Ghosts] The San Francisco Muni Metro under Market Street has been plagued with problems since its inauguration. From a software engineering point of view, the most interesting is the Ghost Train problem, in which the signalling system insisted that there was a train outside the Embarcadero Station that was blocking a switch. Although in reality there was obviously no such train, operations had to be carried on manually, resulting in increasing delays and finally passengers were advised to stay above ground. This situtation lasted for almost two hours during morning rush hour on 23 May 1983, at which point the nonexistent train vanished as mysteriously as it had appeared in the first place. (The usual collection of mechanical problems also has arisen, including brakes locking, sundry coupling problems, and sticky switches. There is also one particular switch that chronically causes troubles, and it unfortunately is a weakest-link single point of failure that prevents crossover at the end of the line.) --------------------- Problems mentioned in the book Software Reliability, Glen Myers Myers mentions a variety of problems. One famous one (lot's of people seem to have heard about it) is the behavior of an early version of the ballistic missile early warning system in identifying the rising moon as an incoming missile. Myers points out that, by many definitions, this isn't a software error -- a problem I discussed a bit at the beginning of this message. Other problems mentioned by Myers include various Apollo errors I've already mentioned, a 1963 NORAD exercise that was incapacitated because "a software error casued the incorrect routing of radar information", the loss of the first American Venus probe (mentioned below in more detail). Mentioned with citations were an Air Force command system that was averaging one software failure per day after 12 years in operation, deaths due to errors in medical software, and a crash-causing error in an aircraft design program. I was not able to follow up on any of the citations. __________________________________ __________________________________ The rest of this message contains excerpts from things I received from all over, in most cases presented without comments. __________________________________ __________________________________ faulk@nrl-css (who got it elsewhere) Today, I heard good story that is a perfect example of the problems that can arise whe the assumptions that one module makes about another are not properly documented. (The story is from a system engineer here whose father got it from the Smithsonian Space people.) Aparently, the Jupiter(?) probe to mars could have programs beamed to it which it would load in internal memory. The system engineers used this property to make mission changes and/or corrections. After the probe had been on Mars for a while, memory started getting tight. One of the engineers had the realization that they no longer needed the module that controled the landing so the space could be used for something else. The probe was sent a new program that overwrote the landing module. As soon as this was accomplished, all contact was lost to the probe. Looking back into the code to find what had gone wrong, the programmers discovered that because the landing module had to have information on celestial navigation, some or all of the celestial navigation functions were inclded in the landing module. Unfortunately, the antenna pointing module also required celestial navigation information to keep the antenna pointed at earth. To do this, it use the navigation functions in the landing module. Overlaying the module has left the antenna pointing in some unknown direction and all contact with the craft has been lost forever. Fortunately, all of the mission requirements had been fulfilled so it was no great loss. It can live on as a great example of bad design. ------------------------- mo@LBL-CSAM The folklore tells of a bug discovered during the fateful flight of Apollo 13. It seems that the orbital mechanics trajectory calculation program had a path which had never been excercised because of the smooth, gentle orbital changes characteristic of a nominal Apollo flight. However, when the flight dynamics team was trying ways to get them home with the aid of much more creative maneuvers, the program promplty crashed with a dump (running on IBM equipment, I believe). The story goes that the fix was simple - something on the order of a missing decimal, or a zero-oh reversal, (Divide by zero!!!!!) but there was much consternation and tearing of hair when this critical program bought the farm in the heat of the moment. This was related to me by an ex-NASA employee, but I have heard it through other paths too. I guess the NASA flight investigation summary would be one place to try and verify the details. ---------------------------- jr@bbncd One cute one was when the Multics swapper-out process swapped out the swapper-in process. (recall that all of the Multics OS was swappable) ------------------------------------ dan@BBN-UNIX Here in Massachusetts we've recently begun testing cars for emissions. All car inspections are carried out at gas stations, which in order to participate in the program had to buy a spiffy new emissions analyzer which not only told you what your emissions were, but passed judgement on you as well, and kept a recorrd on mag tape which was sent to the Registry of Motor Vehicles so that they could monitor compliance. Well, on June 1 the owners of the cheaper ($8K) of the two acceptable analyzers discovered that their machines could not be used; they didn't like the month of June! The company which built them, Hamilton, had to apply a quick fix which told the machines that it was actually December (!?). Lots of people were inconvenienced. Unfortunately all I know about this at the moment is what the Boston Globe had to say, so I don't know what the actual problem was. The article said that the quick cure involved replacing the "June" chip with the "December" chip; I don't know what that means, if anything. Electronic News or Computerworld ought to have more accurate information. Don't forget about the rocket launch early in the space program which had to be aborted because the Fortran program controlling it believed that the number of seconds in a day was 86400 (rather than the sidereal time figure). The recent issue of Science News with the cover story on when computers make mistakes mentioned a story about a graduate student who almost didn't get his thesis due to inaccuracies in the university computer's floating point software. Not really earthshaking, except to him, I suppose. ----------------------------- STERNLIGHT@USC-ECL I don't have the data, but there were at least two "almost launches" of missiles. The rising moon was only one. You might try contacting Gus Weiss at the National Security Council-- he will be able to tell you quite a bit. Mention my name if you like. [I called Weiss, who didn't have much to say. He kept repeating that the problems were fixed--js] ------- mark@umcp-cs The following is probably not famous except with me, but deserves to be. True story: Once upon a time I managed to write a check so that my bank balance went exactly to zero. This was not so unusual an occurance, as my checking account had an automatic loan feature in case of overdraft, and I used this feature occasionally. Negative and positive balances were therefore well known in this account. Not so zero balances. Soon after writing this check I attempted to withdraw some funds using my money machine card. Unsuccessful. I attempted to DEPOSIT money via the machine. Unsuccessful. I talked to a person: they had no record of my account ever having existed. After several trips to the bank, each time up one more level in the management hierarchy, I, the bank managers and me, discovered the following: The bank's computer had been programmed so that the way to delete an account was to set the balance to zero. When I wrote my fatal zeroing check the computer promptly forgot all about me. Only my passion for paper records, and the bank's paper redundancy, enabled the true story to emerge and my account to be restored. Interestingly, no funds were ever in danger, since the account was closed with NO MONEY in it. Nonetheless, the inconvenience was considerable. Once the situation became straightened out I immediately transferred my account to another bank, writing a letter to the first bank explaining my reasons for doing so. -------------------------------- craig@umcp-cs The most famous bug I've ever heard of was in the program which caluclated the orbit for an early Mariner flight to Venus. Someone changed a + to a - in a Fortran program, and the spacecraft went so wildly off course that it had to be destroyed. -------------------------------- fred@umcp-cs Some examples of bugs I've heard about but for which I don't have documentation: (a) bug forced a Mercury astronaut to fly a manual re-entry; . . . There was something about this on the Unix-Wizards mailing list a while back. The way I understand it, a programmer forgot that the duration of the Mercury Capsule's orbit had been calculated in siderial time, and left out the appropriate conversion to take into account the rotation of the Earth beneath the capsule. By the end of the mission the Earth had moved several hundred miles from where it ``should'' have been according to the program in question. Sorry I can't give you any definite references to this. --------------------- KROVETZ@NLM-MCS I've heard of two bugs that I think are relatively famous: 1. A bug in a FORTRAN program that controlled one of the inner planet fly-bys (I think it was a fly-by of Mercury). The bug was caused because the programmer inadvertently said DO 10 I=1.5 instead of DO 10 I=1,5. FORTRAN interprets the former as "assign a value of 1.5 to the variable DO10I". I heard that as a result the fly-by went off course and never did the fly-by! A good case for using variable declarations. 2. I'm not sure where this error cropped up, but one of the earlier versions of FORTRAN a programmer passed a number as an actual argument (e.g. CALL MYPROC(2)) and within the procedure changed the formal argument. Since FORTRAN passes arguments by reference this had the result of changing the constant "2" to something else! Later versions of FORTRAN included a check for changing an argument when the actual is an expression. ------------------------------------- uvicctr!dparnas >From jhart Thu Jun 9 13:30:48 1983 To: parnas San Fransisco Bay Area Rapid Transit, reported in Spectrum about two years ago. "Ghost trains", trains switched to nonexistent lines, and best of all the rainy day syndrome. ------------------------------------- uvicctr!uw-beaver!allegra!watmath!watarts!geo One of my pals told me this story. One morning, when they booted, years ago, the operators on the Math faculty's time-sharing system set the date at December 7th, 1941 (ie Pearl Harbor). Well the spouse of the director of the MFCF (ie Math Faculty Computing Facility) signed on, was annoyed by this, and changed the date to the actual date. Everyone who was signed on while this was done was charged for thirty-something years of connect time. I wouldn't know how to document this story. Oh yes, didn't Donn Parker, self-proclaimed computer sleuth, call the fuss made over UNIX and intelligent terminals some outrageous phrase, like 'The bug of the century'? I am refering to the fuss made over the fact that some of the terminals that berkeley had bought were sufficiently intelligent, that they would do things on the command of the central system. The danger was that if someone was signed on to one of these terminals as root, an interloper could write something to this terminal causing the terminal to silently transmit a string back to UNIX. Potentially, this string could contain a command line giving the interloper permissions to which they were not entitled. Cordially, Geo Swan, Integrated Studies, University of Waterloo allegra!watmath!watarts!geo ---------------------------- smith@umcp-cs John, another bug for your file. Unfortunately it is a rumor that I I haven't tried to verify. Recall that FORTRAN was developed on the IBM 704. One of the 704's unusual features was that core storage used signed magnitude, the arithmatic unit used 2's complement, and the index regesters used 1's complement. When FORTRAN was implemented on the IBM product that replaced the 704, 7094 etc. series, the 3 way branching if went to the wrong place when testing negative zero. (It branched negative, as opposed to branhing to zero). I heard this rumor from Pat Eberlein (eberlein@buffalo). Supposedly, the bug wasn't fixed (or descovered) for two years. ---------------------------- VES@KESTREL 1. In the mid 70's in a lumber provessing plant in Oregon a program was controlling the cutting of logs into boards and beams. The program included an algorithm for deciding the most efficient way to cut the log (in terms of utilizing most of the wood), but also controlled the speed with which the log was advancing. Once the speed of a log increased to dangerous levels. All personnel was scattered and chased out of the building, the log jumped off the track, fortunately there were no casualties. This was caused by a software bug. A reference to the event would be the former director of the Computer Center at Oregon State University (prior to 1976), who at the time I heard the story (Spring 1977) was President of the company which developed the software. 2. Abother rather amusing incident dates back to the mid 60's. It was not caused by a software bug but is indicative of the vulnerability of software systems particularly in those early days. It involved the Denver office of Arizona airlines. Their reservation system was periodically getting garbage input. Software experts were dispatched but failed to identify the cause. Finally the software manager of the company which developed the system went to study the problem on the spot. After spending a week at the site he managed to identify a pattern in the generation of garbage input: it was happening only during the shifts of a paticular operator and only when coffie was served to her. Shortly afterwards the cause was pinpointed. The operator was a voluminous lady with a large belly. The coffie pot was placed behind the terminal and when she would reach for it her belly would rest on the keyboard. Unfortuna- tely, I don't have more exact references to that event. ---------------------------- RWK at SCRC-TENEX There's the famous phase-of-the-moon bug which struck (I believe) Gerry Sussman and Guy Steele, then both of MIT. It turned out to be due to code which wrote a comment into a file of LISP forms, that included the phase of the moon as part of the text. At certain times of the month, it would fail, due to the comment line being longer than the "page width"; they had failed to turn off automatic newlines that were being generated by Maclisp when the page width was exceeded. Thus the last part of the line would be broken onto a new line, not proceeded with a ";" (the comment character). When reading the file back in, an error would result. -------------------------------- gwyn@brl-vld An early U.S. Venus probe (Mariner?) missed its target immensely due to a Fortran coding error of the following type: DO 10 I=1.100 Which should have been DO 10 I=1,100 The first form is completely legal; it default-allocates a REAL variable DO10I and assigns 1.1 to it! ------------------------------------- Horning.pa@PARC-MAXC I have heard from various sources (but never seen in print) the story that the problem with the wings of the Lockheed Electras (that caused several fatal crashes) slipped past the stress analysis program because of an undetected overflow. This one would probably be next to impossible to document. One of my favorite bugs isn't all that famous, but is instructive. In about 1961, one of my classmates (Milton Barber) discovered that the standard integer binary-to-decimal routine provided by Bendix for the G-15D computer wasn't always exact, due to accumulated error from short multiplications. This only affected about one number in 26,000, but integer output OUGHT to be exact. The trick was to fix the problem without using any additional drum locations or drum revoltions. This occupied him for some time, but he finally accomplished it. His new routine was strictly smaller and faster. But was it accurate? Milton convinced himself by numerical analysis that it would provide the correct answer for any number of up to seven digits (one word of BCD). Just to be safe, he decided to test it exhaustively. So he wrote a loop that counted in binary and in BCD, converted the binary to BCD, and compared the results. On the G-15D this ran at something like 10 numbers per second. For several weeks, Milton took any otherwise idle time on the college machine, until his loop had gone from 0 to 10**7-1 without failure. The he proudly submitted his routine to Bendix, which duly distributed it to all users. Soon thereafter, he got a telephone call: "Are you aware that your binary to decimal conversion routine drops the sign on negative numbers?" This is the most exhaustive program test that I've ever seen, yet the program failed on half its range! ----------- Horning.pa [Excerpts from a trip report by Dr. T. Anderson of the University of Newcastle upon Tyne.] The purpose of my trip was to attend a subworking group meeting on the production of reliable software, sponsored by NASA, chaired by John Knight (University of Virginia), and organized and hosted by the Research Triangle Institute [Nov. 3-4, 1981]. Essentially, NASA would like to know how on earth software can be produced which will conform to the FAA reliability standards of 10^-9 failures/hour. Sadly, no one knew. FRANK DONAGHE (IBM FEDERAL SYSTEMS DIVISION): PRODUCING RELIABLE SOFTWARE FOR THE SPACE SHUTTLE Software for the Space Shuttle consists of about 1/2 million lines of code, produced by a team which at its largest had about 400 members. Costs were high at about $400 per line. . . . Between the first and second flight 80% of the mdoules were changed, such that about 20% of the code was replaced. Three weeks prior to the second flight a bug in the flight software was detected which tied up the four primary computers in a tight (two instruction) loop. . ----------- Laws@SRI-AI Don't forget the bug that sank the Sheffield in the Argentine war. The shipboard computer had been programmed to ignore Exocet missiles as "friendly." I might be able to dig up an IEEE Spectrum reference, but it is not a particularly good source to cite. The bug has been widely reported in the news media, and I assume that Time and Newsweek must have mentioned it. I'll try to find the reference, but I must issue a disclaimer: some of my SRI associates who monitor such things more closely than I (but still without inside information) are very suspicious of the "bug" explanation. "Computer error" is a very easy way to take the heat off, and to cover what may have been a tactical error (e.g., turning off the radar to permit some other communication device to function) or a more serious flaw in the ship's defensive capability. ------- PARK@SRI-AI >From International Defense Review and New Scientist after the Falklands war ... The radar system on the Sheffield that didn't report the incoming Exocet missile because it wasn't on the list of missiles that it expected a Russian ship to use. The missiles fired over the heads of British troops on the Falklands beaches at the Argentinians, that could have gone off if they had detected enough metal below them (probably not really software). The missile that was guided by a person watching a tv picture of the missile from a shipboard camera. A flare on the tail of the missile intended to make the missile more visible to the camera tended to obscure the target. A hasty software mod made the missile fly 20 feet higher (lower?) so that the operator could see the target. A more general consideration is that in the event of an electromagnetic pulse's deactivating large numbers of electronic systems, one would prefer that systems like missiles in the air fail safe. -------------------------------- Laws@SRI-AI I have scanned Spectrum's letters column since the original Exocet mention in Oct. '82, but have not found any mention of the bug. Perhaps I read the item on the AP newswire or saw a newspaper column posted on a bulletin board here at SRI. Possibly Garvey@SRI-AI could give you a reference. Sorry. -------------------- Garvey@SRI-AI Subject: Re: Falkland Islands Bug The original source (as far as I know) was an article in New Scientist (a British magazine) on 10 Feb 83. It suggested that the Exocet was detected by the Sheffield's ESM gear, but catalogued as a friendly (!) missile, so no action was taken. I have two, strictly personal (i.e., totally unsubstantiated by any facts or information whatsoever) thoughts about this: 1) I suspect the article is a bit of disinformation to cover up other failings in the overall system; from bits and pieces and rumors, I would give top billing to the possibility of poor spectrum management as the culprit; 2) I wouldn't care if the missile had the Union Jack on the nose and were humming Hail Brittania, if it were headed for me, I would classify it as hostile! ______________ PALMER@SCRC-TENEX I was working for a major large computer manufacturer {not the one that employees me today}. One of the projects I handled was an RPG compiler that was targeted to supporting customers that were used to IBM System III systems. There had been complaints about speed problems from the field on RPG programs that used a table lookup instruction. The computer supporting the compiler had excellent micorocde features: we decided to take advantage of the feature by microcoding the capability into the basic machine. The feature was pretty obvious: it would search for things in ordered or unordered tables and do various things depending on whether the key was in the table. We made what we considered to be the "obvious" optimization in the case of ordered tables - we performed a binary search. Nothing could be faster given the way the tables were organized. We wrote the code, tested it on our own test cases and some field examples and got performance improvements exceeding sales requirements. It was an important fix... Unfortunatly, it was wrong. It isn't clear what "it" was in this case - Us or IBM. People loaded their tables in the machine with each run of an RPG program and they often wouldn't bother to keep their ordered tables ordered. IBM didn't care - it ignored the specification {most of the time}. Our code would break when people gave us bad data in ways that IBM's wouldn't. We had to fix ours. --------------------------------- Olmstread.PA@PARC-MAXC I can't supply any documentation, but I was told when I was in school with Ron Lachman (you might want to check with him at LAI -- laidback!ron, I think) that IBM had a bug in its program IEFBR14. This program's sole job was to return (via BR 14, a branch through register 14, the return register); it was used by JCL (shudder!) procedures which allocated file space and needed a program, any program, to run. It was a one-line program with a bug: it failed to clear the return code register (R 15, I think). I submit you won't find any programs with a higher bugs-per-instruction percentage. ---------------------- hoey@NRL-AIC I got this at MIT.... From: ihnp4!zehntel!zinfandel!berry@ucb-vax In the April 1980 issue of ACM SIGSOFT Software Engineering Notes, editor Peter G. Neumann (NEUMANN@SRI-KL at that time) relays information that Earl Boebert got from Mark Groves (OSD R&E) regarding bugs in the software of the F-16 fighter. Apparently a problem in the navigation software inverted the aircraft whenever it corssed the equator. Luckily it was caught early in simulation testing and promptly fixed. In the July issue, J.N. Frisina at Singer-Kearfott wrote to Mr. Neumann, "concerned that readers might have mistakenly believed there was a bug in the flight software, which was of course not the case." [At least they fixed THAT one. Wasn't it Hoare who said that acceptance testing is just an unsuccessful attempt to find bugs?] Mr. Frisina wrote: "In the current search for reliable software, the F16 Navigation software is an example of the high degree of reliability and quality that can be obtained with the application of proper design verification and testing methodologies. All primary misison functions were software correct." In the April '81 Issue it is revealed that the F18 range of control travel limits imposed by the F18 software are based on assumptions about the inability of the aircraft to get into certain attitudes. Well, some of these 'forbidden' attitudes are in fact attainable. Apparently so much effort had gone design and testing of the software that it is now preferable to modify the aircraft to fit the software, rather than vice-versa! ------- Cantone@nrl-aic I've heard from Prof. Martin Davis, a logician at NYU, that Turing's Ph.D. thesis was just filled with bugs. His thesis was a theoretical description of his Turing machine that included sample computer programs for it. It was these programs that were filled with bugs. Without computers there was no way to check them. (Those programs could have worked with only minor fixes). [NOTE: I called Davis, who gave as a reference a paper by Emile Post on recursive unsolvability that appeared in 1947-8 in the Journal of Symbolic Logic -- js] ---------------------- David.Smith@CMU-CS-IUS In simulation tests between the first and second Shuttle flights, a bug was found in the onboard computer software, which could have resulted in the premature jettison of ONE of the SRB's. That would have been the world's most expensive pinwheel! I read this in Aviation Week, but that leaves a lot of issues to scan. ------------------------------ butkiewi@nrl-css In response to your request for info on bugs, here's one. We work with some collection system software that was initially deployed in 1974. Part of the system calculated the times of upcoming events. In 1976, on February 29th, some parts of the system thought it was a different julian day and it basically broke the whole system. Several subroutines needed leap year fixes. One software engineer was called in from home and worked all night on that one. How many sound Software Engineering Principles were violated? ---------------------------- Stachour.CSCswtec@HI-MULTICS while i cannot cite any published documentation, and this could hardly qualify as a famous bug, a mail-system I once worked on (which ran with priviledge to write into any-one mailboxes) was discovered to have an incorrect check for message-length froma message coming from a file. The result was that a 'specially prepared msg' could arrange to overlay the test of the mail-system, and especilly restore the system 'change-access' program on top of the mail-system, which then gave the caller power to change access controls on any file of the system. This was for a university-built mail-system for a HOneywell GCOS3 ciria 1976. -------------------- Sibert@MIT-MULTICS I imagine you've heard about this already, and, if not, I can't provide any documentation, but anyway: it is said that the first Mariner space probe, Mariner 1, ended up in the Atlantic instead of around Venus because someone omitted a comma in a guidance program. -------------------------- Kyle.wbst@PARC-MAXC TRW made a satellite in the late '50's or early '60's with the feature that it could power down into a stand-by mode to conserve electrical consumption. On the first pass over the cape (after successful orbital check out of all systems), the ground crew transmitted the command to power down. On the next pass, they transmitted the command to power up and nothing happened because the software/hardware system on board the satellite shut EVERYTHING down (including the ground command radio receiver). ---------------------------- Hoffman.es@PARC-MAXC >From the AP story carried on Page 1 of today's Los Angeles Times: "Jet Engine Failure Tied to Computer: It's Too Efficient The efficiency of a computer aboard a United Airlines 767 jet may have led to the failure of both of the plane's engines, forcing the aircraft into a four-minute powerless glide on its approach to Denver, federal officials said Tuesday. . . . [The National Transportation Safety Board's] investigation has disclosed that the overheating problem stemmed from the accumulation of ice on the engines. . . . [I]t is believed that the ice built up because the onboard computer had the aircraft operating so efficiently during the gradual descent that the engines were not running fast enough to keep the ice from forming. . . . The incident raised questions among aviation safety experts about the operation of the highly computerized new generation of jetliners that are extremely fuel-efficient because of their design and computerized systems. "The question is at what point should you averride the computer," one source close to the inquiry said. . . . [T]he engines normally would have been running fast enough to keep the ice from forming. In the case of Flight 310, investigators believe, the computer slowed the engine to a rate that conserved the maximum amount of fuel but was too slow to prevent icing. A key question, one source said, is whether the computer-controlled descent might have kept the flight crew from recognizing the potential icing problem. Airline pilots for some time have complained that the highly computerized cockpits on the new jets -- such as the 767, Boeing's 757 and the Airbus 310 -- may make pilots less attentive. . . . __________________ Kaehler.pa@parc-maxc Info-Kermit@COLUMBIA-20 On Wednesday, August 24, at 11:53:51-EDT, KERMIT-20 stopped working on many TOPS-20 systems. The symptom was that after a certain number of seconds (KERMIT-20's timeout interval), the retry count would start climbing rapidly, and then the transfer would hang. The problem turns out to be a "time bomb" in TOPS-20. Under certain conditions (i.e. on certain dates, provided the system has been up more than a certain number of hours), the timer interrupt facility stops working properly. If KERMIT-20 has stopped working on your system, just reload the system and the problem will go away. Meanwhile, the systems people at Columbia have developed a fix for the offending code in the monitor and have distributed it to the TOPS-20 managers on the ARPAnet. The problem is also apparent in any other TOPS-20 facility that uses timers, such as the Exec autologout code. The time bomb next goes off on October 27, 1985, at 19:34:06-EST. ----------- ----------- Craig.Everhart@CMU-CS-A has put together a long file of stories that were gathered as the result of a request for interesting, humorous, and socially relevant system problems. They make fun reading, but most of them aren't really bugs, let alone well-docmented ones, so I decided not to include them. The file, however, can be obtained from Craig. (end-of-message) -Bob (Krovetz@NLM-MCS)


E-Mail Fredric L. Rice / The Skeptic Tank