In-Flight Observations of Multiple-Bit Upset in DRAMs
Gary M. Swift and Steven M. Guertin
Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California

Abstract

In-flight data is examined showing a very high incidence of multiple-bit errors in a solid-state recorder that incorporates error detection and correction. The high MBU rate is shown to be a consequence of the physical selection of memory cell locations in the recorder architecture.
In-Flight Observations of Multiple-Bit Upset in DRAMs†
Gary M. Swift and Steven M. Guertin
Jet Propulsion Laboratory
California Institute of Technology
Pasadena, California

I. INTRODUCTION

The Cassini spacecraft, launched in October 1997 on the way to Saturn’s moon Titan, carries two solid-state recorders that replace magnetic tape recorders for the storage of science data while awaiting relay to earth. TRW built the two identical flight recorders, which each contain 2.5 gigabits of memory. The memory consists of arrays of 4-Mb OKI DRAMs, organized in a 1-Mb x 4 device architecture. DRAMs are highly sensitive to single-event upset [1, 2] and these are no exception [3]. Error-detection-and-correction (EDAC) was used to lower the visible upset rate, and additional shielding was placed around the memory cards to reduce the number of protons and low energy ions from solar events that actually strike the sensitive DRAMs.

Extensive radiation tests of the 4-Mb DRAM were done during the design and qualification of this critical subsystem. Those results were used along with the theoretical improvement provided by the EDAC to calculate the error rate expected during flight, but no tests were done on the entire subsystem.

This paper discusses the observed error rate from the solid-state recorder, which exhibited a much higher number of uncorrected multiple-bit errors than expected from the architecture of the recorder and EDAC.

II. DRAM TEST DATA AND ERROR CORRECTION

A. Heavy-Ion Test Data

Samples of 4-Mb OKI DRAMs from the flight lot were tested at Brookhaven National Laboratory. All bits and the overall functional operation of the devices were monitored during these tests. The data pattern consisted of half upsettable bits and half not, in order to simulate expected-flight use.

Figure 1. Heavy-ion cross section of the OKI DRAMs.

The cross sections obtained for the DRAM samples are shown in Figure 1, counting each individual error in each cell location. During the extensive testing that was done on this device, we observed that the effective LET was not increased when the device was tilted with respect to the incident ion beam; in other words, the cross section is essentially isotropic and the “cosine law” does not apply. The threshold LET -- defined here as the lowest LET at which errors are actually observed -- is very low, approximately 1 MeV-cm²/mg, which is typical of DRAM technology. The upset cross section increases rapidly with increasing LET, eventually flattening at an LET of about 15 MeV-cm²/mg. At high LET, each ion is causing, on average, more than one error. The cross section continues to increase due to the low-resistivity substrate of the DRAM, which allows extensive charge collection by diffusion.

B. Proton Test Data

Because of the low LET threshold, these DRAMs can easily be upset with protons. The proton-induced upset rate can dominate the heavy ion rate during a solar flare, and Cassini’s lengthy mission (12 years) practically guarantees that it will experience significant flares. Therefore, proton upset testing was performed at Harvard as part of the radiation lot acceptance.

†The research in this paper was carried out by the Jet Propulsion Laboratory, California Institute of Technology, under contract with the National Aeronautics and Space Administration, Code AE, under the NASA Electronic Parts and Packaging Program (NEPP).
testing. These test results are shown in Figure 2.

![Proton cross section of the OKI DRAMs](image)

**Figure 2.** Proton cross section of the OKI DRAMs.

C. Error Correction Algorithm

The memory subsystem uses words that are 39 bits in length. Thirty-two of these bits store data, and the other 7 are used to store the Hamming code, which provides the ability to detect and correct single-bit errors. Double-bit errors can also be detected with this architecture, but not corrected. Triple-bit and higher errors are either “corrected” erroneously or not detected. A similar EDAC system has been flown with static RAM-based SSRs with no uncorrectable errors being reported [4].

The memory subsystem counts the number of single-bit errors that are detected during operation, as well as the number of double-bit (uncorrectable) errors. These error counts are transmitted in the engineering telemetry along with other flight data, providing a history of the overall performance of the EDAC system and a record of the radiation environment during flight.

III. IN-FLIGHT RESULTS

The Cassini spacecraft was launched during a very quiet period in the solar cycle, and little flare activity has been experienced during the first 21/2 years of flight. However, shortly after launch a sharp increase in the single-bit error rate was noted in the active solid-state recorder. Figure 3 shows the single-bit errors that were detected and corrected during successive 24-hour time intervals surrounding the remarkable day. The background level is nearly constant at about 280 error per day. On November 6, 1997, the number of errors increased by about a factor of four, as shown in the figure. Because Cassini was still in the vicinity of earth and due to the coincidence in time of a small solar proton event detected by the earth-orbiting GOES-9 satellite, the increase in SSR errors can be ascribed to the proton event.

![Daily single-bit error rate of the SSR showing an increase during a small solar flare](image)

**Figure 3a.** Daily single-bit error rate of the SSR showing an increase during a small solar flare.

![Hourly single-bit error rate of the SSR during the day of the small flare](image)

**Figure 3b.** Hourly single-bit error rate of the SSR during the day of the small flare.

The double-bit error rate observed during the same time period showed only a small, statistically insignificant increase. This is consistent with the energy spectrum of the proton event being dominated by low energy protons, which are less effective in causing multiple-bit upsets (MBUs) than protons with higher energy or heavy ions.

For galactic cosmic rays (the dominant background contribution), the induced daily error rate was predicted to be 180 during solar maximum and 890 during solar minimum. The observed error rate of ~280 errors per day is between these limits, in reasonable agreement with the predictions. Note that
the ratio of single-bit to uncorrectable errors is nearly constant, approximately 140 (detectable double-bit errors arise at the rate of ~2 per day). However, this double-bit error rate is far higher than one would expect given the singles rate. If the architecture performed as expected, the number of multiple-bit errors should average less than one per year. This anomalous performance spurred additional proton and heavy ion testing to investigate the nature of single event multiple-bit upsets in the DRAMs.

IV. ANOMALY INVESTIGATION

A. Recorder Architecture

The recorder architecture was examined more closely to determine how the EDAC was implemented. Data from five different chips were used within the 39-bit EDAC word. Two successive passes were used. The first pass obtained 20 bits of data, four bits from each of the five DRAM chips. The addresses of the four bits within each chip were widely separated. During the second pass to obtain the last 19 bits, the address for the bank of five DRAMs was incremented by one. Consequently the data from each bit in the first pass was physically very close to the data from the equivalent bit in the second pass. Figure 4 shows a physical diagram of this process.

Although it is likely that successive addresses in DRAMs are located close together, that is not necessarily the case. The actual locations of the bits within the DRAM were mapped to the addresses by doing a series of tests with Californium-252 fission fragments, which induced multiple-bit in adjacent bits. A similar approach has been used by Buchner, et al., using a pulsed laser [6], but it is sometimes impossible to use a laser with DRAMs because of the extensive coverage of the DRAM chip with metal. Information from the manufacturer confirmed that the correct mapping of logical addresses to physical locations was obtained.

B. MBU Characterization of DRAMs

In order to investigate this anomaly, additional tests were done on individual DRAMs from the flight lot. Devices were irradiated at a very slow flux and scanned at a rapid rate, so that raw data on single-event multiple-bit upsets could be obtained. This information was examined to count the number of multiples which spanned an even address and the next address, as these would be double errors in a single EDAC word. All the rest of the upsets, including those spanning odd addresses and the next one, were added to the single-bit error tally. The ratio of doubles to singles as a function of LET is shown in Figure 5, and the observed ratio of 0.7 is also noted.

Figure 5. Percent of multiple bits vs. LET for the OKI 4-Mb DRAMs.

This figure shows that the expected number of multiple-bit errors increases rapidly with increasing
LET. Therefore, the architectural flaw of not assembling EDAC words from widely separated bits along with the device sensitivity to MBUs explains the anomalously high number of in-flight uncorrectable errors.

Tests were also conducted at the Indiana University Cyclotron Facility to determine the device susceptibility to proton-induced MBUs. The results are summarized in the following table, again accounting for the way the EDAC words are assembled. In addition to being important for the flares that Cassini is expected to encounter, these results are particularly important to AXAF, which is flying the Cassini spare recorders, also without fixing the EDAC architectural flaw. AXAF orbits the earth, regularly encountering the south Atlantic anomaly.

<table>
<thead>
<tr>
<th>Energy (MeV)</th>
<th>Ratio (percent)</th>
</tr>
</thead>
<tbody>
<tr>
<td>50</td>
<td>1.5</td>
</tr>
<tr>
<td>99</td>
<td>2.6</td>
</tr>
<tr>
<td>106</td>
<td>2.9</td>
</tr>
<tr>
<td>153</td>
<td>4.5</td>
</tr>
<tr>
<td>192</td>
<td>4.5</td>
</tr>
</tbody>
</table>

Table I. Proton Results
Ratio of Uncorrectable Errors to Singles

V. DISCUSSION

The results in this paper show that although DRAMs are very sensitive to single-event upset, they are still viable choices for space designs that include properly designed EDAC to eliminate single-bit errors. More advanced error correction can also be used to correct for double errors (or even higher numbers of bit errors) at the expense of additional check bits for the same amount of data.

In this case, an unfortunate oversight in the physical distribution of bits within a single word causes the uncorrectable multiple-bit upset rate to be orders of magnitude higher than it should be. It is particularly unfortunate because the design could have easily fixed the problem by swapping the least significant address line with any other.

Fortunately, the uncorrectable error rate seen so far in the Cassini recorders is so low that it is merely a nuisance. This is mainly due to the fact that a large solar flare has not been encountered so far. Now that the spacecraft is headed away from the sun the danger lessens as time passes. Further, extra shielding was placed around the DRAM cards to reduce the number of protons and low energy ions. The weak solar flare seen in the first month after launch suggests that very few multiple-bit errors occur when low-energy protons strike the DRAMs.

The main lesson from this experience is that the architecture of DRAM-based designs must be scrutinized carefully if unpleasant surprises are to be avoided. A similar experience with IBM DRAMs on the Hubble Space Telescope [7] reinforces this conclusion. Finally, it is noted that unit-level ground testing of the design would have caught the architectural flaw while it could still be fixed.

REFERENCES