Retrolental Fibroplasia: A Modern Parable – Chapter 13

Chapter 13
The Experimental Method in Clinical Studies of Children

Lawyers and judges have denounced the Cooperative Study of RLF during some of the malpractice court trials in which I have testified. There is moral outrage because of the procedure of assigning infants to oxygen treatments by lot in the formal study which brought the epidemic to a swift end. But no word of criticism is expressed against the 12 years of informal experimentation in which physician-prescribed treatments led to the blinding of 10,000 children. The controlled trial of chloramphenicol, which ended the “gray syndrome” epidemic (p 82), was attacked in the United States Senate for the same reason (assignments by lot). The situation is bizarre. As I have shown earlier, a doctor may (with impunity) prescribe a “fashionable” untested treatment because of the advice of an authority or a colleague who is a personal friend; because he has read about it in the newspapers, in advertisements, or in medical journals; or simply because the treatment “makes good physiologic sense.” On the other hand, if he should decide that there is sufficient uncertainty to warrant caution, and he chooses to undertake a planned test, his action is subject to criticism. “I need permission to give a new drug to half my patients, but not to give it to them all,” said Smithells, commenting on the absurdity of this situation.

I agree with Smithells; it is difficult to understand this state of affairs. It seems to have something to do with intent. If the physician causes death and maiming as the result of well-intentioned guessing, this is tolerated because it is not perceived as “experimentation.” And theologically oriented bioethicists are firm about the decisive importance of “intention”: expiation is sanctified by good intentions. However, from the patient’s point of view, this semantic myopia is hardly comforting. The uninterpretable results of crude trial-and-error studies seem to be tolerated by society, but the patients are seldom heard from. I discussed the opposing strategem — formal experiment — with RLF-blinded young adults. They had no difficulty understanding the need for the controlled-trial approach in 1953-54 (including assignments by lot) given the dilemmas of that time.

In connection with drug experimentation and the public conscience, Edmund Cahn has commented on a pious fraud. He calls it the “Pompey Syndrome” (named for Sextus Pompey who appears in Shakespeare’s Anthony and Cleopatra). Cahn relates:

Pompey, whose navy has won control of the seas around Italy, comes to negotiate peace with the Roman triumvirs … and they meet in a roistering party on Pompey’s ship. As they carouse, one of Pompey’s lieutenants draws him aside and whispers that he can become the lord of all the world if he will only grant the lieutenant leave to cut first the mooring cable and then the throats of the triumvirs. Pompey pauses, then replies in these words:
“Ah, this thou shouldst have done,
And not have spoke on’t! In me tis villany;
In thee’t had been good service. Thou must know
‘Tis not my profit that does lead mine honour;
Mine honour, it. Repent that e’er thy tongue
Hath so betrayed thine act; being done unknown
I should have found it afterwards well done,
But must condemn it now. Desist and drink. “

Cahn continues, very elegantly:

. . . here we have the most pervasive of moral syndromes, the one most characteristic of so-called respectable man in civilized society. To possess the end and yet not be responsible for the means, to grasp the fruit while disavowing the tree, to escape being told the cost until someone else has paid it irrevocably; this is the “Pompey Syndrome” and the chief hypocrisy of our time. In the days of the outcry against thalidomide, how much of popular indignation might be attributed to this same syndrome; how many were furious because their own lack of scruple had been exposed?

The point is well taken. We cannot evade responsibility by shutting our eyes to the means used for gathering “intelligence” in medical warfare. Physicians, lawyers and judges are well versed on the subject of “evidence.” No one in these professions can plead ignorance about the logical basis of the experimental method (or the special form of experimentation — the controlled trial — which has been available to contain the dimensions of treatment disasters). On the other hand, persons-on-the-street should not be accused of hypocrisy because the public has been so poorly informed about the risk-limiting alternatives to the primitive trial-and-error methods which are used so frequently to evaluate innovations. Accordingly, I would like to turn now to some of the ideas which underlie the experimental method and to some of the conflicts which account for resistance to this approach in studies involving children.

Informal, yet accurate, ongoing personal observations of the natural world have always been accepted as fundamental to progress; however, organized observations and enumeration were later refinements of the observational method of inquiry. Their introduction into medicine was slow and acceptance was uneven. Formal rules to guard against the problems of guessing and personal bias in arriving at conclusions in studies at the bedside have been developed only in the recent past. The format of the controlled clinical trial was perfected after World War II (within the “life-history” of the RLF incident).

The London Bills of Mortality, which date from 1603 onward, marked the beginning of the numerical approach to description of medical events. However, for many years the death rolls in parishes (and the fatalities of infants contributed disproportionately to the totals) were used merely to warn the Sovereign of the need to move into Clean Air. Greenwood reviewed the early period in the history of medical statistics, and in the papers and correspondence of a skeptical English physician, William Petty, he came upon some novel questions for the Bills to answer:

Whether of 1000 patients to the best physicians, aged of any decade, there do not die as many as out of the inhabitants of places where there dwell no physicians. Whether of 100 sick of acute diseases who use physicians, as many die and in misery, as where no art is used, or only chance.

Although these particular analyses were never carried out, the seed of the idea of formal comparisons was planted, and Greenwood suggests that out of the casual correspondence between Petty and his friend, John Graunt, in the mid-17th century, a new method of scientific investigation germinated and grew slowly. It is an irony of history that the idea developed more rapidly in other fields of biology than in medicine. Agricultural scientists, for example, took up the approach of systematic comparisons because of the practical difficulties in their work. They encountered the kind of problems found in medicine; different from those found in the physical sciences. For instance, physicists developed a method of observation and experiment which studied one factor at a time. To learn how pressure changes the volume of a gas, the temperature was kept constant. The physical experiment was limited, and isolated from its environment; but this kind of isolation was rarely possible in working with living things. Agricultural experimenters who wished to study the effects of fertilizers had to contend with innumerable and varying environmental influences. They could not eliminate them, yet their variations threatened to mask the singular effect of fertilizer. The techniques of numerical comparison of observations which I will discuss shortly, permitted the experimenters to work with evidence as they found it, and to measure an effect against the background of variation. They did not try to idealize an experiment, and instead accepted reality. The move from observations to a kind of experimentation which made it possible to approach the problem of uncontrollable confounding factors in a real-world setting was a major step in the biologic sciences.

The great French physiologist, Claude Bernard, proposed an operational approach to the acquisition of new information in medicine, by making a clear distinction between “observation” and “experiment.” In his famous monograph, An Introduction to the Study of Experimental Medicine (1865), he noted that

In a philosophic sense, observation shows, and experiment teaches … in all experimental knowledge there are three phases: an observation made, a comparison established, and a judgment rendered.

He classified “observation” into two classes:

… A spontaneous or passive observation which the physician makes by chance and without being led to it by any preconceived idea … [secondly] an active observation … made with a preconceived idea, with intention to verify the accuracy of a mental conception.

The next step, “a comparison made”, is the essence of the experimental method, but it has always been the most difficult to carry out in clinical medicine. The problems, well-exemplified in the history of the RLF epidemic, are of two principal kinds: psychologic difficulties and intellectual blocks.

The psychologic problems in experimental studies which involve children are generated by the nature of the interpersonal relationship between physician and parents. The authoritarian model bequeathed to medicine from the past assigns an all-knowing role to the physician. When the doctor confesses his uncertainty (the only stance possible in medical investigation), he threatens the very core of the relationship. It is pointless to argue about whose emotional needs are served by maintaining the fiction of physician omniscience, since both parties are thereby calmed. When a physician considers saying to a parent, “Your baby is going blind with RLF and I don’t know how to prevent it,” he comes face to face with his own need to grasp at straws in order to keep the authoritarian model afloat. Freidson examined this aspect of the professional role; he pointed out that client-oriented physicians (those who depend, primarily, on patients for approval) have more difficulty in unmasking than do their colleague-oriented coevals (those who depend on other physicians for approval). The subject deserves more careful study than it has received. The emotional entanglement of physician and parent has been a major factor in distracting both from the very real need of the infant patient for low-risk evaluation strategies in the contemporary world of powerful treatments, where the frightful consequences of guessing wrong have been magnified immeasurably.

The intellectual obstructions to the proper application of the experimental method in studies which involve children lie in the realm of statistics. There is widespread lack of understanding of the rationale of statistical techniques. This is the nub of a problem which seems to improve very little because of stubborn resistance, if not outright antipathy, to a revolutionary thought in modern science. Bronowski said of the statistical method, “It replaces the concept of inevitable effect by that of the probable trend.”

Physicians have been reluctant to give up the idea of inevitable effect. Indeed, there is a long history of positivism in medicine which is difficult to overcome. The ideas of Galen (A.D. 138-201), which endured without challenge in the Western world for 16 centuries, typified the dogmatic approach. He codified a system of medicine based on his vast experience and on dissections of animals. The general approach was teleologic: Nature acts with perfect wisdom, he held, and from this credo he erected an authoritarian schema of diagnosis and treatment. A revealing aphorism ascribed to Galen reads:

All who drink of this remedy recover in a short time, except those whom it does not help, who all die. Therefore, it is obvious that it fails only in incurable cases.

This type of positivist reasoning has persisted into the modern era. It played a role in delaying the solution to every medical problem I have described in this book. The view is reflected best in the 1953 report advising detergent-mist (“. . . this is almost an infallible weapon . . .,” Chapter 10) and in the 1964 New York Times report of epsom-salt enemas (“It is not yet certain that the theory is correct, but it is certain that the treatment works,” Chapter 10). When we administered ACTH to infants with early RLF (Chapter 3), [we] were strongly tempted to explain away (Galen-like) the “failures” of treatment. Even after we demonstrated that the initial impression of good results could not be substantiated by a comparative trial, some physicians continued to use this “remedy” because their results were good. Such dogged convictions based on personal experience have been a major stumbling block in modern medicine. They stand in the way of objective evaluation of new proposals once they have been applied widely. The attitudes are similar to those found in the South Pacific. At the time of a solar eclipse, islanders blow whistles, shout, and beat drums to frighten the moon into disgorging the sun. When this intervention is challenged, the “therapists” reply, in effect, “Why change, it works.”

As I said earlier, statistical methodology (the probable trend approach) was taken up early by the biologic sciences because it provided a key to the puzzle of interpreting effects which tend to be obscured by variation. The central question in the studies of living things is how to decide whether an observed event is to be attributed to the meaningless play of chance on the one hand, or to causation or planned design on the other. There would be no need to invoke the logic of chance if we had to make inferences only about inevitable effects — nonvarying events. If all babies exposed to supplemental oxygen became blind, I would feel just as certain about one baby as I would about all infants so treated. Whatever the value df the generalization, it is neither greater or less than that of the particular statement: this baby exposed . . . became blind. The deductive process of inferring from the general to the particular in the class of nonvarying events, Venn taught 100 years ago, is not accompanied by the slightest diminution of certainty. He emphasized: if one of the “immediate inferences” is justified at all, it will be equally right in every case. But, unfortunately, this one-and-all characteristic applies to few events in the natural world, and it is rare (Galen-type declarations to the contrary notwithstanding) in medicine. The inferences we must make about medical events have a very different quality: as they increase in particularity, they diminish in certainty. Let me explain. Since some babies exposed to oxygen become blind, I am very hesitant to infer from this that any particular oxygen-exposed infant is afflicted. However, if I examine many oxygen-treated babies, I feel relatively sure that some of them are blind. My assurance increases with the number of observations about which I must form an opinion. In this class of events, there is uncertainty as to individuals, but as I include larger numbers in my assertions I attach greater weight to my inferences. It is with such classes of events (in which there are variations) and such inferences (in which there are uncertainties) that the science of probability is concerned.

It is useful to think of the kind of happening which we commonly encounter in medicine as a series. But it is a particular kind of series: one which combines individual irregularity and aggregate regularity. To return to my example, some infants exposed to oxygen become blind. If this statement is regarded simply as an indefinite proposition, the notion of a series does not seem obvious. It makes a statement about a certain unknown proportion of the whole, nothing more. However, the laws of probability are concerned not with indefinite propositions, but with numerical statements; they refer to a given proportion of the whole. And, with this latter conception it is difficult to avoid the idea of a series. What, for instance, is the meaning of the statement, One out of 20 infants exposed to oxygen becomes blind? It does not declare that in any given group of 20 oxygen-exposed infants, there will be one who becomes blind. The assertion incorporates the notion of results obtained in an examination of a long succession of oxygen-treated infants. And it implies that in this series, there will be a numerical proportion who are blind, not fixed and accurate at first, but which tends in the long run to approach 1 out of 20. This is the central idea of the probable-trend concept. It is necessary to envision a large number of observations or, as Venn emphasized, a series of them, if we are to use the power of the laws of probability to help with the interpretation of events in every-day experience. Inescapably, we come to a confrontation with the seeming capriciousness of the workings of chance. For, it turns out, the reasoning of the gambler (not as devil-may-care as he would have us believe) leads the way to a useful approach to the kinds of questions which bedevil physicians.

The simple coin-tossing game of “heads” and “tails” provides some clear illustrations of the fundamental principles of the theory of probability. The first thing to notice is that when a coin is thrown many times, the results of the successive tosses form a series. The separate throws (like the singly observed events in medicine) seem to occur chaotically, and the disorder gives rise to uncertainty. As long as we confine our attention to a few throws at a time, the series seems to be utterly irregular. But when we consider the overall results of a long succession, an order emerges. Finally, the pattern of chance is distinct and quite striking. In the game, there are runs of consecutive “heads” and of “tails,” but the longer the play continues the less their relative proportion to the whole amounts involved; in the case of hundreds of throws of a coin the ratio of “heads” to “tails” will be very close to one-to-one. And, in a very large experience, runs of successive “heads” and “tails” also will approach fixed proportions. The point here is that in examining things and events in the natural world (those which occur in medicine and in coin-tossing) many of their qualities are variable. But their occurrence is quite predictable in the aggregate (see Gauss’ Law in chapter notes). As a quality or attribute is noted in a long series, the proportion of occurrence is gradually subject to less variation and approaches some fixed value. Order gradually emerges out of disorder.

Obviously, there is an enormous difference between complex events in medicine and the straightforward occurrences in coin-tossing, but some similarities cannot be wished away. I must be quick to point out one difference which has practical significance. The gambler can calculate the probabilities of outcomes before making any real-world observations. He makes some reasonable assumptions:

There are an endless number of physical forces which may influence the outcome of each toss, but these are “indifferent,” they do not align themselves in favor of either heads or tails (indeed, these are the motors of chance which operate haphazardly and account for the variation from toss to toss).
The outcome of each toss is an independent event (the coin has no “memory”).
All of the possible and equally likely outcomes are obvious by inspecting the coin.

If these assumptions are correct, it is perfectly safe to declare that the probability for the outcome of “heads” (or “tails”) at each toss is one-half. And a serious gambler can develop a betting strategy on the basis of calculations which predict the proportion of runs of successive “heads” (or “tails”) which will be approached in an upcoming game. (The a priori computations — see chapter notes — indicate that in game-sets of 100 tosses, he can expect to find about 6 runs of two-heads-in-a-row, 3 runs of three-heads-in-a-row, and 1 run of four-heads-in-a-row; the expected frequencies of longer and rarer runs also can be calculated before the coin tossing commences.) When the results of a long series of actual tosses are inspected, the difference between the number of runs of heads expected by calculation and the number observed in fact provides the gambler with useful information. If the coin seems to defy the prediction consistently, this may lead to a betting scheme which takes advantage of the bias which is found.

The doctor is unable to calculate the “expected” proportions of occurrences of medical events in advance. Because of uncertainties about selection of observations and doubts about the independence of occurrences (for example, the concurrence of RLF in twins, see Chapter 6), it is difficult to support the first two simplifying assumptions made by the gambler. And the third precondition is plainly impossible. In medicine, we cannot even imagine what an analysis of “all of the possible and equally likely outcomes” would mean, although we see very clearly what they mean in tossing a coin. In complex situations, we are obliged to substitute statistical probabilities, determined by experience. We note the variation in the occurrence of blindness and the “fixed proportion” which is approached in a long series of observations in infants who are treated with oxygen. After this information is in hand, we can make predictions with the same confidence as the coin-tossing gambler. The “regularities in the aggregate” make it possible to make inferences from “proportional propositions” (Venn’s term). To return to my example again: Given that 1 infant in 20 treated with oxygen becomes blind, what can be inferred about the prospect in any particular infant? The reply can be couched in the same language used after calculating the a priori probabilities in coin-tossing: the odds for occurrence are close to those of obtaining runs of two-heads-in-a-row in a long series of tosses. Uncomfortable as it is (and far-fetched though it may seem to dwell on the analogy) here the resemblance between the gambler and the physician cannot be denied. Most importantly, the statistical probability of 1 in 20 found by experience may be compared with the proportion found in future series to search for the same kind of information which the gambler finds useful. The gambler’s hopes for “doctored” coins which will defy the a priori calculations of outcome are exactly like the hopes of physicians for favorable treatments. Both dream of winning, but both are forced to test their fantasies in the real world of experience.

When the principles of the laws of chance were first taken up by agricultural scientists, the statistical techniques were not well suited to the needs of every-day research. Long series of observations were needed to estimate the frequency of occurrence of chance variations. For example, a plot of ground might be prepared with a new “treatment” and the subsequent yield from this plot found to be 10-percent higher than in an untreated field. The question would then arise as to how much confidence could be invested in the significance of a 10-percent difference in yields. Using the available statistical methods, it could be calculated, for instance, that 500 years’ experience would be required to provide firm support for a distinction between the observed difference of 10 percent and the variations which occurred from year to year when the fields were treated uniformly. R. A. Fisher, who had been working for several years in the early 1920s with the laboratory staff at Rothamsted Experimental Station for Agricultural Research in Harpenden, England, became aware of the practical difficulties. What was required, he thought, was some sort of test of whether an apparent effect of treatment might be expected to occur reasonably often simply by chance; a test, furthermore, which did not require hopelessly large numbers of observations. He explored the mathematical implications of using small numbers of observations and he developed a practical plan for estimating the magnitude of variations in experiments that might be expected to occur by chance. Fisher proposed a design for field trials of treatments in which a plot of land was subdivided into blocks and within each block there was to be treated and untreated strips arranged in random order. The important point was that the results would now be governed entirely by the laws of chance. Each strip had an equal opportunity of treatment or no-treatment and each block was in fact a replicated trial. The replication now provided the estimate of chance variability (replacing the old direct-test-of-experience approach which frustrated researchers because of the need for a long series of annual yields). Additionally, the process of randomization secured the validity of the estimate of variations: “assignment-by-lot” ensured that the estimate was not biased (“loaded” would be the equivalent term in gambling). These principles of experimental design introduced by Fisher in the mid-1920s enabled experimenters to be free of the previous stringencies of large samples. The approach revolutionized the techniques of agricultural research throughout the world. Fisher’s mathematical tests for small-sample problems (to estimate how often an observed result might be expected to occur by chance) were responsible for improvement in the efficiency of studies in many other fields of applied science, including medicine. Medical research workers began to use these new ideas, experimental designs, and mathematical tests in their own work, and some saw the relevance of the strategies to all medical studies. For many years, however, the new movement was more evident in the laboratory than it was on the hospital ward.

The writings of A. Bradford Hill in England and Donald Mainland in Canada (and later in the United States) played a role in slowly overcoming the aversion of physicians to apply numerical methods in the interpretation of observations. Hill wrote a series of articles, prepared at the request of the editors of The Lancet in 1936, which described statistical methods that could be useful to physicians. The papers were published in book form a year later and the small volume quickly became a classic. A climate of awareness was gradually created which paved the way for the 1946 debut of a distinctive form of bedside research: the randomized clinical trial.

The episode was triggered by the discovery that a new drug, streptomycin, was effective in the treatment of experimental tuberculosis in guinea pigs. Shortly after, in 1945 and 1946, the drug was used in human tuberculosis in the United States; the results were encouraging, but inconclusive. Only a small supply of streptomycin was available in England and the British Medical Research Council was faced with the problem of how to proceed with the scant amount allocated to it for research purposes. (Most of the country’s supply was taken up for two rapidly fatal forms of disease: miliary and meningeal tuberculosis.) The Council decided that its cache would best be employed in a rigorously planned investigation with concurrent controls. (I find it quite interesting that limited resources played a role in initiating this model of caution and safety from the patient’s point of view, and of moral responsibility and scientific excellence from the community’s perspective.) After considerable planning beginning in September 1946, a committee (including A. Bradford Hill) decided to limit the trial to tuberculous patients with closely defined features: acute progressive infection, diagnosis proven by bacteriologic test, status unsuitable for collapse treatment (injection of air into the pleural cavity to collapse and, thus, “rest” the lung), and age 15-25 years. Up to the time of the. proposed trial, bed rest was considered to be the only suitable form of treatment for a patient with these characteristics. After a detailed protocol of procedures was drawn up, patients were recruited from the London area and beyond. The first patients to be accepted (by a panel of physicians) were admitted to designated centers in January 1947. By September of that year, 109 persons were enrolled: 2 patients died within a preliminary week of observation, leaving 107 in the trial. At the end of a week of observation for each patient, assignment to treatment was made by opening a sealed envelope drawn from. a set provided for each center (the sequence of assignments in the envelopes had been prearranged in an unpredictable order determined from tables of random numbers); 55 were allotted to the streptomycin group and 52 to the bed-rest group. Patients in both groups remained in bed for at least 6 months and the outcomes were assessed at the end of that period. Fifty-one of the 55 patients treated with streptomycin and 38 of 52 who were treated with bed rest were alive at the end of 6 months. The difference in outcome between the two groups was declared “statistically significant”: a variation in survival of this size would be expected to occur purely-by-chance less often than one time in a hundred.

It is worth examining the unstated assumptions which underlie the language used in the concluding statement of this pioneering study. The declaration envisioned that in a very large series of tuberculous patients who received the standard treatment, bed rest, the proportion of survivors at the end of a 6-month observation period would be subject to a certain variability as the result of a number of unknown factors (the play of chance). In order to estimate the size of these variations, the committee modelled the design of the trial on Fisher’s strategy of random assignment to treatments. The variability in outcome expected in bed-rest-treated patients was compared with the observed outcome in streptomycin-treated patients, and the question which was posed had the same form as Fisher’s questions in the 1920s: What reason is there to think that the relatively small group of 55 patients who received streptomycin might not have experienced a higher rate of survival even without this treatment? The committee applied the mathematical techniques of analysis which had been worked out for small-sample problems in agricultural research to determine if the observed difference in the trial was more than would be likely to arise merely as a fluke. And the declaration of “statistical significance” was couched in the grammer of chance. The terms were familiar to any gambler: streptomycin treatment was a fairly good bet. (In repeated randomized trials involving 107 patients, one could expect to win 99 percent of the time; this is the equivalent of betting against four heads in a row in repeated games of coin tossing.) It is important to consider the limits of the conclusions which were made at the end of the streptomycin trial. Notice that the committee made no claim that the efficacy of streptomycin treatment had been proven by the experience of treating 55 patients. Also, there was very little assurance that the observed survival rate (93 percent) would hold up in future experience. In fact, there was every reason to expect that the survival rate with streptomycin treatment would be subject to the influence of least as many unknown factors as in bed-rest treatment. A large number of observations of outcomes among future groups of patients treated with the new drug was needed to estimate the “fixed proportion”; an indication of the efficiency of the new treatment.

The conservative tone of the statements which were made by the Research Council’s committee stands in quiet contrast to the extravagant claims made in some of the reports which I reviewed in Chapter 10. And it is also worth noting that the doctors who carried out the carefully planned study were conscious of their responsibility to obtain as much information as possible from the rare experience of a formal comparison, which might never be repeated in the management of pulmonary tuberculosis. In addition to the enumeration of survivors, systematic comparisons were made of every facet of the course of the illness in the two groups. The study provided a wealth of descriptive information, not only about the beneficial effects of streptomycin, but also its limitations (for instance, drug treatment did not close large cavities in the lung) and toxicity (detrimental effects on the vestibular apparatus of the ear). The planning committee noted at the end of the trial that the need for a control group was underscored by the finding of impressive improvement in some patients treated by bed rest alone. The streptomycin trial was the first controlled clinical investigation of its kind (which led to a positive result — see chapter notes). It was followed by a long series of pathfinding clinical trials conducted by the Medical Research Council which evaluated a wide variety of proposed treatments.

The format of present-day trials using the technique of randomized controls evolved from the British experiences. Until recently, the mathematical strategies for statistical analysis tended to receive more attention than was given to the basic logical concepts of the design of trials. In the past few years, however, Feinstein has examined, in minute detail, the conceptual “architecture” of studies involving free-willed physicians and their free-living patients. From his analyses, and those of Sackett, the methodologic discipline has advanced considerably (Appendix D). In planning bedside studies, unlike laboratory experiments, a host of real-world (potentially biasing) influences must be taken into account.

Valid comparison is the sine qua non of a trial of a new treatment. If past events are to be used as a standard of comparison, it must be presumed that everything except the new treatment has remained uniform with the passage of time. Under such circumstances it is unnecessary to use the ponderous machinery of the experimental method since the observational approach provides interpretable information. This was exemplified by the experience with tuberculous infection of the meningeal membranes enveloping the brain: prior to 1946, it had been uniformly fatal. Biologic and environmental factors had no known effect on the outcome of this form of tuberculosis. The results of prior treatments did not vary; patients of all ages, either sex, with or without other complications, etc. all succumbed.

When streptomycin was first used to treat a series of patients with this disorder, it was unnecessary to go beyond the step of Claude Bernard’s “active observations” (“. . . made with a preconceived idea, with intention to verify the accuracy of a mental conception.”). And, the results of the consecutive-treatments approach were secure (since the premise, “all untreated patients die,” was true). A single instance of survival was clear evidence of the effectiveness of treatment. However, a large series was required for an estimate of the variability in survival rate (now the efficiency of treatment was influenced by innumerable biologic and environmental factors). And, in studies to determine whether a new treatment was superior to streptomycin, all of the problems with variability would emerge. As I have emphasized repeatedly, inevitable effects in medicine are rare. Most bedside experiences are of the kind I have described in this book. They are far from invariant, and they are beset with interpretive difficulties. It was this background of experience which accounted for the general uncertainty about the reports from Melbourne and Birmingham (Figs. 4-1 and 4-2) concerning the role of supplemental oxygen in RLF. The skepticism which greeted the first suggestions was a responsible reaction on the part of the medical community as a whole and of individual physicians who were concerned about the well-being of their patients. What could be made of the finding that the frequency of RLF in a nursery fell from 3 out of 13 in 1944-50 (with “high oxygen”) to 0 out of 6 in 1951 (with “low oxygen”). Here, the guidance concerning a hierarchy of evidence provided by Claude Bernard was helpful. For, it is clear that the association found by this “active” observation was a higher order of evidence than the association noted in the previous “passive” observation (the Boston observations concerning RLF, iron, and vitamins had been made by sifting through past records; the observers were not led to the observation by a preconceived idea). Moreover, the pitifully small numbers did not detract from the qualitative strength of the English evidence. The point is that this was in essence a single “active” observation: it advanced the state of knowledge by one notch. The report of the Texas air-lock experience (Chapter 10) quoted very large denominators (the death rate fell from 1.9 percent among 6324 births in 1949 to 1.5 percent among 1372 births in 1950 in association with use of the device), but it also advanced the state of knowledge by one notch. The Houston experience was, in essence, a single active observation. Now, I do not wish to denigrate the potential importance of the Birmingham report, the air-lock report, and all of the accounts of “active observations” which I have cited in this volume (including, I must emphasize, the announcement that 28 infants improved dramatically following epsom-salt enemas, Chapter 10). For, in each instance the observed associations were made with a preconceived idea, and they provided leads to the solutions of very important problems. Moreover, each lead made it possible to frame a question in the form of a “proportional proposition.” Instead of, “Can low oxygen treatment reduce the risk of RLF?,” after the Birmingham observation the question was quantified: “Can low oxygen reduce the risk of RLF from about 20 percent to almost zero?” After the Houston report, the question was “Can use of the air-lock reduce mortality from ca. 2 percent to ca. 1.5 percent”, and so it was in all of the reports of “active observations.” What needs to be understood is that the numbers provided in each of these reports served the important function of converting the indefinite propositions into a form which could be dealt with by the resolving power of the laws of probability. But the numerical differences provided no quantitative information about how often such differences could be expected purely by chance. The “how often” information could only be obtained safely and reliably by an additional step-up in the hierarchical process which seeks to protect patients and improve understanding. A formal test was required of the question How often would the apparent association between low oxygen and a fall in RLF frequency from ca 20 percent to almost zero be expected simply by chance? The direct-test-of experience would entail a long series of observations in groups of identically treated infants (preferably in different hospitals) to obtain an estimate of expected variations; this was the backdrop against which a 20 percent-to-zero change would have to be viewed. (The need for evidence from different institutions was quite important. The populations of patients in any one hospital was so highly selected there could be little confidence that a specific experience would give a reliable estimate. Recall that results observed after lowering oxygen in Birmingham and Oxford in England, Melbourne, and Paris were widely discrepant.) I cannot emphasize this point enough: the magnitude of the interpretive difficulties in the direct-experience test is much greater in bedside medicine than is found in most other fields of inquiry. Hopelessly large numbers of observations are required because the “background” fluctuations are usually large and the outcome differences are relatively small (in medicine we are usually trying to decide about a small change, i.e., from 73 to 93 percent survival, as in the pulmonary tuberculosis trial in England, rather than the striking change in the tuberculous meningitis experience, i.e., from no survivors without streptomycin to 81-percent survival with treatment).

The important contribution of R. A. Fisher to the development of scientific methodology was the novel design of experiments; he did not perform feats of mathematical legerdemain. The statistical techniques which he invented (and earlier mathematical tools which originated with Karl Pearson) were not meant to be used like recipes out of a cookbook in the hope of extracting meaning from any and all sets of numbers derived from unplanned observations. Tests of statistical significance, Mainland stressed, are not magical maneuvers for determining whether there is some hidden bias in an experiment. The conduct of an investigation must be so designed as to minimize the likelihood of systematic “loading” of extraneous factors which will influence the outcomes. If the precautions have not been taken in advance, the mathematical tests are simply inapplicable. For instance, it is misleading to perform statistical arithmetic on the numbers obtained in the Melbourne observations (Table 13-1). A declaration of “statistical significance” is meaningless because infants in each of the three hospitals did not have an equal opportunity of treatment with “high” or with “moderate” oxygen. (The key requirement of Fisherian design was not fulfilled: random assignment of treatments to ensure that the results would now be governed entirely by the laws of chance.) The same problem arose in interpreting the initial results in RLF outcome after ACTH treatments (Table 13-1). Again, only assignments-by-lot within each hospital could satisfy the equal opportunity assumption needed for use of the statistical tests (and, as this experience demonstrated, this is a basic requirement for the protection of patients). The 1953-54 Cooperative Study of RLF subsequently confirmed the “lead” which was indicated by the relatively small difference in RLF frequency observed in Melbourne hospitals. On the other hand, our randomized controlled trial of ACTH failed to confirm the striking difference in results found in the two New York hospitals. Before the controlled trials were conducted there was simply no way to know which of the initial favorable leads in the two countries was false.

Table 13-1

Two “Leads” Concerning RLF (1950-1951)

“Active Observations”			RLF Occurrence	Difference
Melbourne
One “high-oxygen” hospital (123 infants)	vs	Two “moderate-oxygen” hospitals (58 infants)	19% vs 7%	-12%*
New York
No treatment Lincoln Hospital (7 infants)	vs	ACTH Babies Hospital (31 infants)	86% vs 19%	-67%**
* Supported by subsequent controlled trial.
** Refuted by subsequent controlled trial.

I do not wish to minimize the practical difficulties of conducting clinical studies with concurrent controls. But hardship alone cannot account for the fact that the basic requisites of design are met in a very small proportion of present-day studies which assess the efficacy (and the risks!) of new treatments. There is a persistent “straw-man” issue: the multiplicity of clinical variables. Many physicians believe that comparisons of treatments using concurrent controls are impractical in clinical medicine because it is almost impossible to assemble two groups that are matched exactly in every clinical detail. I find it hard to understand how this belief supports the validity of using past events as a standard of comparison. But this aside, the argument reveals a fundamental lack of understanding about the role of randomization. Fisher explained that it is pointless to insist that all the conditions in compared groups must be exactly alike. This is an impossible requirement in all biologic experimentation because the list of possible factors which might influence the outcome can never be exhausted: the number is unknown. It is random assignment of treatments which serves as the fundamental safeguard under these conditions of uncertainty about risk variables (an inescapable condition of medical studies). And, to repeat, scattering-by-chance guarantees the validity of the test of significance by which the results of the trial are judged.

I would be unfair if I suggested that the resistance to lottery-like proceedings is due entirely to lack of awareness of the logical basis for the precautions. Once more, there are some emotional issues to consider. The thought of random allocation to treatments in which blindness or life are at stake, is, at first flush, a repugnant one. I believe the revulsion is an all-too-human denial of reality: chance is in control of our lives to an extent which is too uncomfortable to dwell upon. Bronowski pointed out that the early readers of The Origin of Species were outraged, in their religious and their moral convictions, by the central place of chance in Darwin’s theory of evolution. I suspect it is religion-based morality which stands behind the self-righteous statements I have heard in courtrooms. But, I hasten to add, the unease is deep-seated. Despite all evidence indicating the enormous risks of guessing in medicine, we are all prone to feel that a well-meaning guess is somehow not as cold and unfeeling as the flip of a coin. I can recall very vividly, when we were conducting the formal evaluation of ACTH treatment, that one of my colleagues refused to allow his patient to be enrolled in the trial. He was convinced that the treatment was effective and he proceeded to administer ACTH to his patient who had very early signs of RLF. I can also recall, sadly, the fatal infection which occurred as a complication of that treatment. Another example of the unwillingness of physicians to consider that their attentions may be dangerous was related by Cochrane. It occurred in England during an attempt to evaluate intensive-care units for patients with coronary artery heart disease. He found a considerable vested interest in the results of a randomized controlled trial to compare the outcome in coronary-care units versus care at home. The first report of the trial showed a slightly higher death rate in hospitalized patients than among patients treated at home. Someone reversed the figures and showed them to a coronary-care-unit enthusiast. He immediately declared that the trial was unethical and must be stopped at once. However, when he was shown the correct results, he could not be persuaded to declare the hospital units unethical!

I find it completely understandable that compassionate physicians, struggling for solutions to unsolved medical problems, form emotional attachments to leads which develop during the search. And I do not scoff at these feelings; but I do argue that they should not be hidden. When personal predilections are openly expressed, it is easier to design a test plan which will keep these from entering into treatment decisions. The issue of “experimenter’s bias” was brought home to me in an early (fruitless) trial of the effect of artificial light on the occurrence of RLF. Assignment to “light” or “no-light” was made on the basis of blue and white marbles in a box. One day, I noticed that our head nurse reached into the box for a marble and then replaced it because it wasn’t the color that corresponded to her belief about the best treatment for her babies. I became convinced that we had to shift to sealed envelopes, as used in the British streptomycin trial. When the first sealed envelope was drawn, the resident physician held it up to the light to see the assignment inside! I took the envelopes home and my wife and I wrapped each assignment-sticker in black paper and resealed the envelopes.

What is the alternative to assignments-by-lot in formal testing? Unfortunately, formats which depend on physician-prescribed assignments are not satisfactory. For, if doctors were sufficiently prescient to choose correctly among unknowns, the sad record of the past would not be there to haunt us. Nonetheless, the resistance to randomization as a method of allocating patients to treatment has given rise to proposals for “adaptive” designs (using information obtained during the course of the trial to determine the treatment assignment for the next patient). Some of these adaptive procedures have been given colorful names (e.g., “play the winner”, “two-arm bandit strategy”) and they have attracted considerable interest. But the outcomes of treatments are often not evident until some time has passed; in the history of perinatal treatments, short-term “winners” have sometimes become long-term “losers.” This point is illustrated by the Cooperative Study of RLF; it was one of the first attempts to use an adaptive approach. Only one-third of the infants were assigned to routine unrestricted oxygen in the intial three months of the study to evaluate survival differences under the oxygen regimens. The favorable short-term results (no apparent difference in mortality) of this first phase of the study (involving 212 infants) indicated that it was safe to proceed with curtailed oxygen for the remaining nine months (an additional 574 infants were enrolled). The long-term results of curtailment of oxygen in this very large number of infants were never evaluated. But, as I have indicated (Chapter 9) there is reason to suspect that there were unfavorable late effects. A review of recent methodologic alternatives to randomized trials concluded that the latter type of clinical investigation is very complex, expensive, and time-consuming; but the format remains as the most useful tool which has been devised for comparisons of treatments.

I wish to comment on the myth that patients enrolled in a randomized trial are called upon to take unwarranted risks for the sake of others. The issue becomes inflamed when babies are involved. Despite all evidence from past experience which demonstrates that unsuspected risks of innovation can be minimized by the controlled-trial strategy, the distorted notion persists. Chalmers objected to the assumption that there was greater interest in future patients than in enrollees. An equal case can be made, he noted, for randomization to result in a better chance that a patient will receive the proper treatment. Our experience with sulfisoxazole (Chapter 10) was a convincing illustration of his view. Additionally, a task force of the Department of Health, Education and Welfare undertook a systematic attempt in 1976 to estimate the nature and magnitude of the risk for human subjects who participate in research projects. A survey of 538 medical researchers, involving about 39,000 patients, indicated that the risks in therapeutic trials were no greater than those of treatment in other settings.

Another aspect of this “unjustified-risk” argument arises when it is suggested that controlled trials of promising treatments are unwarranted in conditions of very high mortality; severe hyaline-membrane disease has been used as an example. This is completely logical if the arguments are confined to the category of inevitably fatal conditions (as I noted above, in tuberculous meningitis). But the situation is rarely this simple. A Boston group observed that patients with severe and fatal disease merely represent the “tip of an iceberg.” There are almost always many more with milder forms of the same disease, and increased interest in diagnosis takes place when a new treatment is introduced. This invariably leads to the recognition of less severe examples which were previously overlooked. Consequently, outcome in currently treated patients appears improved, even when the treatment is without effect or actually deleterious. The incident involving epsom-salt enemas to treat hyaline-membrane disease was a tragic example of this phenomenon.

Legal issues surrounding the matter of random assignment of treatments were reviewed by Fried in 1974. He concluded that the law is incomplete on most of the difficult dilemmas posed (e.g., informing patients of the fact of randomization and whether or not randomization does violence to the duty a doctor owes his patient). He also discussed the matter of values (“. . . the question of what is right in principle”) in considerable detail and he made a plea for increased candor, education and participation in the planning and conduct of randomized clinical trials. I believe there is an enormous gulf which separates the law and medicine in these matters. One indication of the semantic distance between the disciplines is seen in use of the word “experimentation”; in the courts the term connotes malpractice (procedures which vary from accepted practice).

A series of important questions about the issue of properly designed studies to evaluate proposed treatments was posed by Kabat:

“Is it ethical to do a study on human subjects with a design such that one may come up with the wrong answer? Is it fair to the participants? Is it fair to those who subsequently become the recipients or victims of what becomes a prescribed but useless prophylactic or therapeutic measure? Once a large-scale field trial has yielded a conclusion, right or wrong, and becomes official and sacrosanct, how many potentially better studies will not be carried out because of it? How much of a false sense of security will it give patients, their families and society? Suppose the conclusion was reached because of inadequate controlled experimentation?”

Viewed with the hindsight of almost a quarter of a century, I see the Cooperative Study of RLF as an example of the very situation envisioned by Kabat. The trial did not provide a wrong answer, but it came up with an incomplete answer. And once the results were announced, it became unthinkable to test the unexpected leads which were found. (Recall that the Cooperative Study had not been designed to test the relationships between varying durations of oxygen exposure, varying concentrations of supplemental oxygen and RLF. The associations described in the final report had been disclosed by “data-dredging”: the mass of information collected in the study was inspected in a search for correlates. The finding that duration rather than concentration of oxygen correlated with RLF-risk constituted an untested hypothesis. And the provisional quality of this ex post facto evidence was the same as that found in- so many of the previous analytic surveys: associations which had failed to pass a critical test.) However, when it became possible to measure the state of oxygenation of blood in the 1960s, the unresolved issues were approachable. Now a question of trade-off of risks could be posed: Can the risk of brain damage and death be reduced substantially, at the cost of a minimum increase in the risk of RLF, if oxygen is administered in amounts to maintain oxygen in the blood at the upper part of the so-called “normal” range rather than low “normal” (see Chapter 9)? I mentioned earlier that a Cooperative Study was undertaken in 1969 to explore the relationship between blood oxygen and RLF risk; however, the design of the study doomed it from the start. Measurements were made of oxygen in the blood of infants who received supportive treatment according to individual-physician prescription (the “active” observations approach) rather than by random assignment to prescribed conditions of blood oxygen (the experimental approach). At the end of 8 years of effort (3 years of observations and 5 years of analysis of the results!), there were no interpretable findings. To this day, when oxygen is administered to premature infants, they are exposed to the intertwined risks of brain damage, death and RLF with nothing more than authoritative guessing as protection.

It is painful to hear some of the questions raised in malpractice suits against conscientious physicians who treated infants prior to September 1954. One question in particular reveals the gulf of misunderstanding between medicine and community; it is, On what day was the truth concerning the association between oxygen treatment and the risk of RLF established? The accusatory climate created by this kind of absolutist thinking has had the ruinous effect of downgrading the role of doubt in medicine. -And, the combined resistance of social, political and ethical forces to the use of the experimental method in studies involving children has encouraged a return to the hazards of Galenist reasoning. Somehow, the community must come to the realization that excellence in medical research is to be fostered as a public safety measure. Even a society made wary of science, because of the misapplication of technical developments, must know that the underlying logical machinery of the scientific method is in the public interest. Science, Popper noted, is one of the very few human activities in which errors are systematically criticized, and fairly often, in time, corrected.

Return to Contents Page

Last Updated on 02/28/24