These excerpts are from the article Leveson published in 1993 in IEEE Computer. You can find a more current version of the article at her web site at http://sunnyday.mit.edu/papers/therac.pdf. We selected these excerpts because we felt they described the most critical issues in the case. They go into much more detail on the software problems, the design of the machine and software, and the interface on the VT100 terminal.
Therac History. An overview of the history and physical design of the Therac-25.
The TurnTable. A close look at how the turntable was constructed and how its position was monitored. The turntable position was a critical issue in all the overdoses.
Software Design. An overview of the design of the software in Therac-25. Particular attention is given to how real-time issues resulted in race conditions.
Safety Analysis. An overview of the several different safety analyses that were done on the Therac-25 system.
Interface. A description of the operator interface on the VT100 operator console. Particular attention is given to the difficulties with error messages and with editing.
Tyler Software Problem. A description of the software problem that resulted in overdoses at Tyler, TX.
Yakima Software Problem. A description of the software problem that resulted in overdoses at Yakima, and possibly Hamilton, Ontario.
The material on this page is reprinted from N.G. Leveson, & C.S. Turner. "An Investigation of the Therac-25 Accidents." Computer, Vol. 26, No. 7, July 1993, pp. 18-41. Copyright © 1993 Institute of Electrical and Electronics Engineers. This material is posted here with permission of IEEE. Such permission of the IEEE does not in any way imply IEEE endorsement of any of St. Olaf College's products or services. Internal or personal use of this material is permitted. However, permission to reprint/republish this material for advertising or promotional purposes or for creating new collective works for resale or redistribution must be obtained from the IEEE by sending a blank email message to firstname.lastname@example.org. By choosing to view this document, you agree to all provisions of the copyright laws protecting it.
Medical linear accelerators (linacs) accelerate electrons to create high-energy beams that can destroy tumors with minimal impact on the surrounding healthy tissue. Relatively shallow tissue is treated with the accelerated electrons; to reach deeper tissue, the electron beam is converted into X-ray photons.
In the early 1970's, Canadian Medical Corporation (CMC) and a French company called CGR collaborated to build linear accelerators. (CMC is an arms-length entity, called a crown corporation, of the Canadian Government.) Since the time of the incidents related in this article, CMC Medical, a division of CMC is in the process of being privatized and is now called Therapeutic Accelerators Limited. Currently CMC's primary business is the design and installation of nuclear reactors.) The products of CMC and CGR's cooperation were (1) the Therac-6, a 6 million electron volt (MeV) accelerator capable of producing X rays only and, later, (2) the Therac-20, a 20 Me V dual-mode (X rays or electrons) accelerator. Both were versions of older CGR machines, the Neptune and Sagittaire, respectively, which were augmented with computer control using a DEC PDP 11 minicomputer.
Software functionality was limited in both machines: The computer merely added convenience to the existing hardware, which was capable of standing alone. Industry-standard hardware safety features and interlocks in the underlying machines were retained. We know that some old Therac-6 software routines were used in the Therac-20 and that CGR developed the initial software.
The business relationship between CMC and CGR faltered after the Therac-20 effort. Citing competitive pressures, the two companies did not renew their cooperative agreement when scheduled in 1981. In the mid-1970's, CMC developed a radical new "double-pass" concept for electron acceleration. A double-pass accelerator needs much less space to develop comparable energy levels because it folds the long physical mechanism required to accelerate the electrons, and it is more economic to produce (since it uses a magnetron rather than a klystron as the energy source).
Using this double-pass concept, CMC designed the Therac-25, a dual-mode linear accelerator that can deliver either photons at 25 Me V or electrons at various energy levels (see Figure 1). Compared with the Therac-20, the Therac-25 is notably more compact, more versatile, and arguably easier to use. The higher energy takes advantage of the phenomenon of "depth dose": As the energy increases, the depth in the body at which maximum dose buildup occurs also increases, sparing the tissue above the target area. Economic advantages also come into play for the customer, since only one machine is required for both treatment modalities (electrons and photons).
Several features of the Therac-25 are important in understanding the accidents. First, like the Therac-6 and the Therac-20, the Therac-25 is controlled by a PDP 11. However, CMC designed the Therac-25 to take advantage of computer control from the outset; CMC did not build on a stand-alone machine. The Therac-6 and Therac-20 had been designed around machines that already had histories of clinical use without computer control.
In addition, the Therac-25 software has more responsibility for maintaining safety than the software in the previous machines. The Therac-20 has independent protective circuits for monitoring electron-beam scanning, plus mechanical interlocks for policing the machine and ensuring safe operation. The Therac-25 relies more on software for these functions. CMC took advantage of the computer's abilities to control and monitor the hardware and decided not to duplicate all the existing hardware safety mechanisms and interlocks. This approach is becoming more common as companies decide that hardware interlocks and backups are not worth the expense, or they put more faith (perhaps misplaced) on software than on hardware reliability.
Finally some software for the machines was interrelated or reused. In a letter to a Therac-25 user, the CMC quality assurance manager said, "The same Therac-6 package was used by the CMC software people when they started the Therac-25 software. The Therac-20 and Therac-25 software programs were done independently, starting from a common base." Reuse of Therac-6 design features or modules may explain some of the problematic aspects of the Therac-25 software development and design. The quality assurance manager was apparently unaware that some Therac-20 routines were also used in the Therac-25; this was discovered after a bug related to one of the Therac-25 accidents was found in the Therac-20 software.
CMC produced the first hardwired prototype of the Therac-25 in 1976, and the completely computerized commercial version was available in late 1982. In March 1983, CMC performed a safety analysis on the Therac-25. This analysis was in the form of a fault tree and apparently excluded the software. According the final report, the analysis made several assumptions:
The fault tree resulting from this analysis does appear to include computer failure, although apparently judging from these assumptions, it considers only hardware failures. For example, in one OR gate leading to the event of getting the wrong energy, a box contains "Computer selects wrong energy" and a probability of 10-11 is assigned to this event. For "Computer selects wrong mode," a probability of 4 x 10-9 is given. The report provides no justification of either number.
The Therac-25 turntable design is important in understanding the accidents. The upper turntable (see Figure B) is a rotating table, as the name implies. The turntable rotates accessory equipment into the beam path to produce two therapeutic modes: electron mode and photon mode. A third position (called the field-light position) involves no beam at all; it facilitates correct positioning of the patient.
Proper operation of the Therac-25 is heavily dependent on the turntable position; the accessories appropriate to each mode are physically attached to the turntable. The turntable position is monitored by three microswitches corresponding to the three cardinal turntable positions: electron beam, X ray, and field light. These microswitches are attached to the turntable and are engaged by hardware stops at the appropriate positions. The position of the turntable, sent to the computer as a 3-bit binary signal, is based on which of the three microswitches are depressed by the hardware stops.
The raw, highly concentrated accelerator beam is dangerous to living tissue. In electron therapy, the computer controls the beam energy (from 5 to 25 MeV) and current while scanning magnets spread the beam to a safe, therapeutic concentration. These scanning magnets are mounted on the turntable and moved into proper position by the computer. Similarly, an ion chamber to measure electrons is mounted on the turntable and also moved into position by the computer. In addition, operator-mounted electron trimmers can be used to shape the beam if necessary.
For X-ray therapy, only one energy level is available: 25 MeV. Much greater electron-beam current is required for photon mode (some 100 times greater than that for electron therapy)[Rawlinson] to produce comparable output. Such a high dose-rate capability is required because a "beam flattener" is used to produce a uniform treatment field. This flattener, which resembles an inverted ice-cream cone, is a very efficient attenuator. To get a reasonable treatment dose rate out, a very high input dose rate is required. If the machine produces a photon beam with the beam flattener not in position, a high output dose rate results. This is the basic hazard of dual-mode machines: If the turntable is in the wrong position, the beam flattener will not be in place.
In the Therac-25, the computer is responsible for positioning the turntable (and for checking turntable position) so that a target, flattening filter, and X-ray ion chamber are directly in the beam path. With the target in the beam path, electron bombardment produces X-rays. The X-ray beam is shaped by the flattening filter and measured by the X-ray ion chamber.
No accelerator beam is expected in the field-light position. A stainless steel mirror is placed in the beam path and a light simulates the beam. This lets the operator see precisely where the beam will strike the patient and make necessary adjustments before treatment starts. There is no ion chamber in place at this turntable position, since no beam is expected.
Traditionally, electromechanical interlocks have been used on these types of equipment to ensure safety — in this case, to ensure that the turntable and attached equipment are in the correct position when treatment is started. In theTherac-25, software checks were substituted for many traditional hardware interlocks.
Reference: J.A. Rawlinson, "Report on the Therac-25," OCTRF/OCI Physicists Meeting, Kingston, Ont., May 7, 1987.
We know that the software for the Therac-25 was developed by a single person, using PDP-11 assembly language, over a period of several years. The software "evolved" from the Therac-6 software, which was started in 1972. According to a letter from CMC to the FDA, the "program structure and certain subroutines were carried over to the Therac-25 around 1976." Apparently, very little software documentation was produced during development. In a 1986 internal FDA memo, a reviewer lamented, "Unfortunately, the CMC response also seems to point out an apparent lack of documentation on software specifications and a software test plan."
The manufacturer said that the hardware and software were "tested and exercised separately or together over many years." In his deposition for one of the lawsuits, the quality assurance manager explained that testing was done in two parts. A "small amount" of software testing was done on a simulator, but most testing was done as a system. It appears that unit and software testing was minimal, with most effort directed at the integrated system test. At a Therac-25 user group meeting, the same quality assurance manager said that the Therac-25 software was tested for 2,700 hours. Under questioning by the users, he clarified this as meaning "2,700 hours of use."
The programmer left CMC in 1986. In a lawsuit connected with one of the accidents, the lawyers were unable to obtain information about the programmer from CMC. In the depositions connected with that case, none of the CMC employees questioned could provide any information about his educational background or experience. Although an attempt was made to obtain a deposition from the programmer, the lawsuit was settled before this was accomplished. We have been unable to learn anything about his background.
CMC claims proprietary rights to its software design. However, from voluminous documentation regarding the accidents, the repairs, and the eventual design changes, we can build a rough picture of it.
The software is responsible for monitoring the machine status, accepting input about the treatment desired, and setting the machine up for this treatment. It turns the beam on in response to an operator command (assuming that certain operational checks on the status of the physical machine are satisfied) and also turns the beam off when treatment is completed, when an operator commands it, or when a malfunction is detected. The operator can print out hard-copy versions of the CRT display or machine setup parameters.
The treatment unit has an interlock system designed to remove power to the unit when there is a hardware malfunction. The computer monitors this interlock system and provides diagnostic messages. Depending on the fault, the computer either prevents a treatment from being started or, if the treatment is in progress, creates a pause or a suspension of the treatment.
The manufacturer describes the Therac-25 software as having a stand-alone, real-time treatment operating system. The system is not built using a standard operating system or executive. Rather, the real-time executive was written especially for the Therac-25 and runs on a 32K PDP 11/23. A preemptive scheduler allocates cycles to the critical and noncritical tasks.
The software, written in PDP 11 assembly language, has four major components; stored data, a scheduler, a set of critical and noncritical tasks, and interrupt services. The stored data includes calibration parameters for the accelerator setup as well as patient-treatment data. The interrupt routines include:
The scheduler controls the sequences of all noninterrupt events and coordinates all concurrent processes. Tasks are initiated every 0.1 second, with the critical tasks executed first and the noncritical tasks executed in any remaining cycle time. Critical tasks include the following:
Noncritical tasks include
It is clear from the CMC documentation on the modifications that the software allows concurrent access to shared memory, that there is no real synchronization aside from data stored in shared variables, and that the "test" and "set" for such variables are not indivisible operations. Race conditions resulting from this implementation of multitasking played an important part in the accidents.
The Therac-25 safety included (1) failure mode and effect analysis, (2) fault-tree analysis, and (3) software examination.
Failure mode and effect analysis. An FMEA describes the associated system response to all failure modes of the individual system components, considered one by one. When software was involved, CMC made no assessment of the "how and why" of software faults and took any combination of software faults as a single event. The latter means that if the software was the initiating event, then no credit was given for the software mitigating the effects. This seems like a reasonable and conservative approach to handling software faults.
Fault-tree analysis. An FMEA identifies single failures leading to Class I hazards. To identify multiple failures and quantify the results, CMC used fault-tree analysis. An FTA starts with a postulated hazard-- for example, two of the top events for the Therac-25 are high dose per pulse and illegal gantry motion. The immediate causes for the event are then generated in an AND/OR tree format, using a basic understanding of the machine operation to determine the causes. The tree generation continues until all branches end in the "basic events." Operationally, a basic event is sometimes defined as an event that can be quantified (for example, a resistor fails open).
CMC used a "generic failure rate" of 10-4 per hour for software events. The company justified this number as based on the historical performance of the Therac-25 software. The final report on the safety analysis said that many fault trees for the Therac-25 have a computer malfunction as a causative event, and the outcome of quantification is therefore dependent on the failure rate chosen for software.
Leaving aside the general question of whether such failure rates are meaningful or measurable for software in general, it seems rather difficult to justify a single figure of this sort for every type of software error or software behavior. It would be equivalent to assigning the same failure rate to every type of failure of a car, no matter what particular failure is considered.
The authors of the safety study did note that despite the uncertainty that software introduces into quantification, fault-tree analysis provides valuable information in showing single and multiple failure paths and the relative importance of different failure mechanisms. This is certainly true.
Software examination. Because of the difficulty of quantifying software behavior, CMC contracted for a detailed code inspection to "obtain more information on which to base decisions." The software functions selected for examination were those related to the Class I software hazards identified in the FMEA: electron-beam scanning, energy selection, beam shutoff, and dose calibration.
The outside consultant who performed the inspection included a detailed examination of each function's implementation, a search for coding errors, and a qualitative assessment of its reliability. The consultant recommended program changes to correct shortcomings, improve reliability, or improve the software package in a general sense. The final safety report gives no information about whether any particular methodology or tools were used in the software inspection or whether someone just read the code looking for errors.
Conclusions of the safety analysis. The final report summarizes the conclusions of the safety analysis:
The conclusions of the analysis call for 10 changes to Therac-25 hardware; the most significant of these are interlocks to back up software control of both electron scanning and beam energy selection.
Although it is not considered necessary or advisable to rewrite the entire Therac-25 software package, considerable effort is being expended to update it. The changes recommended have several distinct objectives: improve the protection it provides against hardware failures; provide additional reliability via cross-checking; and provide a more maintainable source package. Two or three software releases are anticipated before these changes are completed.
The implementation of these improvements including design and testing for both hardware and software is well under way. All hardware modifications should be completed and installed by mid 1989, with final software updates extended into late 1989 or early 1990.
The recommended hardware changes appear to add protection against software errors, to add extra protection against hardware failures, or to increase safety margins. The software conclusions included the following:
The software code for Beam Shut-Off, Symmetry Control, and Dose Calibration was found to be straight-forward and no execution path could be found which would cause them to perform incorrectly. A few improvements are being incorporated, but no additional hardware interlocks are required.
Inspection of the Scanning and Energy Selection functions, which are under software control, showed no improper execution paths; however, software inspection was unable to provide a high level of confidence in their reliability. This was due to the complex nature of the code, the extensive use of variables, and the time limitations of the inspection process. Due to these factors and the possible clinical consequences of malfunction, computer-independent interlocks are being retrofitted for these two cases.
Given the complex nature of this software design and the basic multitasking design, it is difficult to understand how any part of the code could be labeled "straightforward" or how confidence could be achieved that "no execution paths" exist for particular types of software behavior. However, it does appear that a conservative approach-- including computer-independent interlocks-- was taken in most cases. Furthermore, few examples of such safety analyses of software exist in the literature. One such software analysis was performed in 1989 on the shutdown software of a nuclear power plant, which was written by a different division of CMC.1 Much still needs to be learned about how to perform a software-safety analysis.
1. W. C. Bowman et al. "An Application of Fault Tree Analysis to Safety-Critical Software at Ontario Hydro," Conf. Probabilistic Safety Assessment and Management, 1991.
The Therac-25 operator controls the machine with a DEC VT100 terminal. In the general case, the operator positions the patient on the treatment table, manually sets the treatment field sizes and gantry rotation, and attaches accessories to the machine. Leaving the treatment room, the operator returns to the VT100 console to enter the patient identification, treatment prescription (including mode, energy level, dose, dose rate, and time), field sizing, gantry rotation, and accessory data. The system then compares the manually set values with those entered at the console. If they match, a "verified" message is displayed and treatment is permitted. If they do not match, treatment is not allowed to proceed until the mismatch is corrected. Figure A. shows the screen layout.
Figure A. Operator interface screen layout
When the system was first built, operators complained that it took too long to enter the treatment plan. In response, the manufacturer modified the software before the first unit was installed so that, instead of reentering the data at the keyboard, operators could use a carriage return to merely copy the treatment site data [Miller]. A quick series of carriage returns would thus complete data entry. This interface modification was to figure in several accidents.
The Therac-25 could shut down in two ways after it detected an error condition. One was a treatment suspend, which required a complete machine reset to restart the machine. If a treatment pause occurred, the operator could press the "P" key to "proceed" and resume treatment quickly and conveniently. The previous treatment parameters remained in effect, and no reset was required. This convenient and simple feature could be invoked a maximum of five times before the machine automatically suspended treatment and required the operator to perform a system reset.
Error messages provided to the operator were cryptic, and some merely consisted of the word "malfunction" followed by a number from 1 to 64 denoting an analog/digital channel number. According to an FDA memorandum written after one accident:
The operator's manual supplied with the machine does not explain nor even address the malfunction codes. The [Maintenance] Manual lists the various malfunction numbers but gives no explanation. The materials provided give no indication that these malfunctions could place a patient at risk.
The program does not advise the operator if a situation exists wherein the ion chambers used to monitor the patient are saturated, thus are beyond the measurement limits of the instrument. This software package does not appear to contain a safety system to prevent parameters being entered and intermixed that would result in excessive radiation being delivered to the patient under treatment.
An operator involved in an overdose accident testified that she had become insensitive to machine malfunctions. Malfunction messages were commonplace — most did not involve patient safety. Service technicians would fix the problems or the hospital physicist would realign the machine and make it operable again. She said, "It was not out of the ordinary for something to stop the machine…It would often give a low dose rate in which you would turn the machine back on…They would give messages of low dose rate, V-tilt, H-tilt, and other things; I can't remember all the reasons it would stop, but there [were] a lot of them." The operator further testified that during instruction she had been taught that there were "so many safety mechanisms" that she understood it was virtually impossible to overdose a patient.
A radiation therapist at another clinic reported an average of 40 dose-rate malfunction, attributed to underdoses, occurred on some days.
Reference: E. Miller, "The Therac-25 Experience," Proc. Conf. State Radiation Control Program Directors, 1987.
A lesson to be learned from the Therac-25 story is that focusing on particular software bugs is not the way to make a safe system. Virtually all complex software can be made to behave in an unexpected fashion under certain conditions. The basic mistakes here involved poor software-engineering practices and building a machine that relies on the software for safe operation.
Furthermore, the particular coding error is not as important as the general unsafe design of the software overall. Examining the part of the code blamed for the Tyler accidents is instructive, however, in showing the overall software design flaws. The following explanation of the problem is from the description CMC provided for the FDA, although we have tried to clarify it somewhat. The description leaves some unanswered questions, but it is the best we can do with the information we have.
As described in the sidebar on Therac-25 development and design, the treatment monitor task (Treat) controls the various phases of treatment by executing its eight subroutines (see Figure 2). The treatment phase indicator variable (Tphase) is used to determine which subroutine should be executed. Following the execution of a particular subroutine, Treat reschedules itself.
One of Treat’s subroutines, called Datent (data entry), communicates with the keyboard handler task (a task that runs concurrently with Treat) via a shared variable (Data-entry completion flag) to determine whether the prescription data has been entered. The keyboard handler recognizes the completion of data entry and changes the Data-entry completion variable to denote this. Once the Data-entry completion variable is set, the Datent subroutine detects the variable’s change in status and changes the value of Tphase from 1 (Data Entry) to 3 (Set-Up Test).
In this case, the Datent subroutine exits back to the Treat subroutine, which will reschedule itself and begin execution of the Set-Up Test subroutine. If the Data-entry completion variable has not been set, Datent leaves the value of Tphase unchanged and exits back to Treat’s main line. Treat will then reschedule itself, essentially rescheduling the Datent subroutine.
The command line at the lower right corner of the screen is the cursor’s normal position when the operator has completed all necessary changes to the prescription. Prescription editing is signified by cursor movement off the command line. As the program was originally designed, the Data-entry completion variable by itself is not sufficient since it does not ensure that the cursor is located on the command line. Under the right circumstances, the data-entry phase can be exited before all edit changes are made on the screen.
The keyboard handler parses the mode and energy level specified by the operator and places an encoded result in another shared variable, the 2-byte mode/energy offset (MEOS) variable. The low-order byte of this variable is used by another task (Hand) to set the collimator/turntable to the proper position for the selected mode/energy. The high-order byte of the MEOS variable is used by Datent to set several operating parameters.
Initially, the data-entry process forces the operator to enter the mode and energy, except when the operator selects the photon mode, in which case the energy defaults to 25 MeV. The operator can later edit the mode and energy separately. If the keyboard handler sets the data-entry completion variable before the operator changes the data in MEOS, Datent will not detect the changes in MEOS since it has already exited and will not be reentered again. The upper collimator, on the other hand, is set to the position dictated by the low-order byte of MEOS by another concurrently running task (Hand) and can therefore be inconsistent with the parameters set in accordance with the information in the high-order byte of MEOS. The software appears to include no checks to detect such an incompatibility.
Figure 3. Datent, Magnet, and Ptime subroutines
The first thing that Datent does when it is entered is to check whether the mode/energy has been set in MEOS. If so, it uses the high-order byte to index into a table of preset operating parameters and places them in the digital-to-analog output table. The contents of this output table are transferred to the digital-analog converter during the next clock cycle. Once the parameters are all set, Datent calls the subroutine Magnet, which sets the bending magnets. Figure 3 is a simplified pseudocode description of relevant parts of the software.
Setting the bending magnets takes about 8 seconds. Magnet calls a subroutine called Ptime to introduce a time delay. Since several magnets need to be set, Ptime is entered and exited several times. A flag to indicate that bending magnets are being set is initialized upon entry to the Magnet subroutine and cleared at the end of Ptime. Furthermore, Ptime checks a shared variable, set by the keyboard handler, that indicates the presence of any editing requests. If there are edits, then Ptime clears the bending magnet variable and exits to Magnet, which then exits to Datent. But the edit change variable is checked by Ptime only if the bending magnet flag is set. Since Ptime clears it during its first execution, any edits performed during each succeeding pass through Ptime will not be recognized. Thus, an edit change of the mode or energy, although reflected on the operator’s screen and the mode/energy offset variable, will not be sensed by Datent so it can index the appropriate calibration tables for the machine parameters.
Recall that the Tyler error occurred when the operator made an entry indicating the mode/energy, went to the command line, then moved the cursor up to change the mode/energy, and returned to the command line all within 8 seconds. Since the magnet setting takes about 8 seconds and Magnet does not recognize edits after the first execution of Ptime, the editing had been completed by the return to Datent, which never detected that it had occurred. Part of the problem was fixed after the accident by clearing the bending-magnet variable at the end of Magnet (after all the magnets have been set) instead of at the end of Ptime.
But this was not the only problem. Upon exit from the Magnet subroutine, the data-entry subroutine (Datent) checks the data-entry completion variable. If it indicates that data entry is complete, Datent sets Tphase to 3 and Datent is not entered again. If it is not set, Datent leaves Tphase unchanged, which means it will eventually be rescheduled. But the data-entry completion variable only indicates that the cursor has been down to the command line, not that it is still there. A potential race condition is set up. To fix this, CMC introduced another shared variable controlled by the keyboard handler task that indicates the cursor is not positioned on the command line. If this variable is set, then prescription entry is still in progress and the value of Tphase is left unchanged.
The software problem for the second Yakima accident is fairly well established and different from that implicated in the Tyler accidents. There is no way to determine what particular software design errors were related to the Kennestone, Hamilton, and first Yakima accidents. Given the unsafe programming practices exhibited in the code, it is possible that unknown race conditions or errors could have been responsible. There is speculation, however, that the Hamilton accident was the same as this second Yakima overdose. In a report of a conference call on January 26, 1987, between the CMC quality assurance manager and Ed Miller of the FDA discussing the Yakima accident, Miller notes:
This situation probably occurred in the Hamilton, Ontario, accident a couple of years ago. It was not discovered at that time and the cause was attributed to intermittent interlock failure. The subsequent recall of the multiple microswitch logic network did not really solve the problem.
The second Yakima accident was again attributed to a type of race condition in the software, this one allowed the device to be activated in an error setting (a "failure" of a software interlock). The Tyler accidents were related to problems in the data-entry routines that allowed the code to proceed to Set-Up Test before the full prescription had been entered and acted upon. The Yakima accident involves problems encountered later in the logic after the treatment monitor Treat reaches Set-Up Test.
The Therac-25’s field-light feature permits very precise positioning of the patient for treatment. The operator can control the Therac-25 right at the treatment site using a small hand control offering certain limited functions for patient setup, including setting gantry, collimator, and table motions.
Normally, the operator enters all the prescription data at the console (outside the treatment room) before the final setup of all machine parameters is completed in the treatment room. This gives rise to an "unverified" condition at the console. The operator then completes the patient setup in the treatment room, and all relevant parameters now "verify". The console displays the message "Press set button" while the turntable is in the field-light position. The operator now presses the set button on the hand control or types "set" at the console. That should set the collimator to the proper position for treatment.
Figure 4. Yakima software flaw
In the software, after the prescription is entered and verified by the Datent routine, the control variable Tphase is changed so that the Set-Up Test routine is entered (see Figure 4). Every pass through the Set-Up Test routine increments the upper collimator position check, a shared variable called Class3. If Class3 is nonzero, there is an inconsistency and treatment should not proceed. A zero value for Class3 indicates that the relevant parameters are consistent with treatment, and the beam is not inhibited.
After setting the Class 3 variable, Set-Up Test next checks for any malfunctions in the system by checking another shared variable (set by a routine that actually handles the interlock checking) called F$mal to see if it has a nonzero value. A nonzero value in F$mal indicates that the machine is not ready for treatment, and the Set-Up Test subroutine is rescheduled. When F$mal is zero (indicating that everything is ready for treatment), the Set-Up Test subroutine sets the Tphase variable equal to 2, which results in next scheduling the Set-Up Done subroutine, and the treatment is allowed to continue.
The actual interlock checking is performed by a concurrent Housekeeper task (Hkeper). The upper collimator position check is performed by a subroutine of Hkeper called Lmtchk (analog/digital limit checking). Lmtchk first checks the Class3 variable. If Class3 contains a nonzero value, Lmtchk calls the Check Collimator (Chkcol) subroutine. If Class3 contains zero, Chkcol is bypassed and the upper collimator position check is not performed. The Chkcol subroutine sets or resets bit 9 of the F$mal shared variable, depending on the position of the upper collimator (which in turn is checked by the Set-Up Test subroutine of Datent so it can decided whether to reschedule itself or proceed to Set-Up Done).
During machine setup, Set-Up Test will be executed several hundred times since it reschedules itself waiting for other events to occur. In the code, the Class3 variable is incremented by one in each pass through Set-Up test. Since the class3 variable is 1 byte, it can only contain a maximum value of 255 decimal. Thus, on every 256th pass through the Set-Up Test code, the variable overflows and has a zero value. That means that on every 356th pass through Set-Up Test, the upper collimator fault will not be detected.
The overexposure occurred when the operator hit the "set" button at the precise moment that Class3 rolled over to zero. Thus Chkcol was not executed, and F$mal was not set to indicate the upper collimator was still in field-light position. The software turned on the full 25 MeV without the target in place and without scanning. A highly concentrated electron beam resulted, which was scattered and deflected by the stainless steel mirror that was in the path.
CMC described the technical "fix" implemented for this software flaw as simple: The program is changed so that the Class3 variable is set to some fixed nonzero value each time through Set-Up Test instead of being incremented.