# forum 9: week of 12 March: Fisher and the design of experiments

I commend all participants to use this space for communicating ANY difficulties that arise during your reading of Fisher BEFORE next Tuesday (March 12), and I will be sure to accommodate your requests into my presentation!

Endorsing what Nicole said: let's get really clear about the idea here, as epistemology as well as take-it-or-leave-it scientific method. So let's have a variety of questions, problems, puzzles. One issue that I think is central is this

When we do an experiment the results are interesting if they are surprising. Take 'surprising' as 'improbable'. Improbable given what, given which assumptions about probabilities? (You're testing a coin for bias so you do a long series of tosses, and they're mostly heads. This is not surprising if we assume that it is biased, but is if we assume that it is fair. But we don't want to assume either: that's just what we want to find out.) Different attitudes to this separate different philosophies of testing.

In response to Dr. Morton's question, from the Fisher reading, as well as some rudimentary knowledge of statistics (most of which is also based on Fisher's work, obviously), it follows that results of experiments are interesting if they are statistically significant, i.e. it is improbable that these results have occurred purely by chance. Given this, we are functioning under the assumptions that we, as experimenters, have considered every possible combination of the results before the experiment has been conducted, and it would be highly unlikely that the data that would falsify the null hypothesis occurred by accident. This would also depend on the sample size used in the experiment, and Fisher stresses the importance of this by stating, *"The odds could be made much higher by enlarging the experiment, while if the experiment were much smaller, even the greatest possible success would give odds so low that the result might, with considerable probability, be ascribed to chance"*
Nicole, I found (and I wouldn't be surprised if others felt the same way) this reading to be a lot more accessible than last week's, and some basic knowledge of statistics has primed my understanding of Fisher this time, so I haven't encountered any difficulties.

What I got from reading about the experimental design in this reading and also some previous knowledge that i have from taken research methods is that there is always a possibility that your results could be due to chance or other factors not accounted for. You can set a high p-value or make your experiment larger or even increase the reliability by test-retest methods but though the the possibility of it being due to chance will get smaller and smaller you can never know for certain that your results indicate an actual effect. Another factor besides the possibility of the results being due to chance is that no matter how hard you try to control the noise and other third variables that could be contributing to your results you still could have not taken something into account without being aware of it and your results could have been the effect of that. That is why in science we take all these precautions but you can never KNOW for certain that your hypothesis is true and that is why you always say "based on the data we can conclude so and so" and can never say that it IS so and so way for certain. I think the experimental method yet again reveals the interesting aspect of knowledge in that we can never fully 100 percent know anything. we think we know but we can come to find out later that we were that one percent chance or there was another variable we didn't account for or some other factor that distorted our knowledge.

I have a lot of respect for experimental design, and the amount of knowledge we can gain from it. My biggest uncertainty, (I realize this would be a lot more useful if I had posted this prior to Nicole's presentation) is what made researchers come to the standard 5% level of significance that is used? And although this is the most common significance level, what circumstances lead researchers to sometimes use a 1% significance level?

Thanks for your question, Andrea. Both the 5% and 1% significance levels are common. For the ones who think 5% is *too high*, most of them tend to go with 1% for the significance level because a lower significance level is (supposedly) more desirable. Why 5% is too high a significance level is incredibly subjective, and more often than not relies on *factors specific to the discipline* undergoing investigation that I am unable to speak of here. In any case, this is the view on significance levels among researchers that is most articulated. The 5% significance level has become known as a *standard* because of how frequently it has been used, *and* the fact that some journals have made it as a rule for publications. That is, *if* results don't reach the 5% significance level, then those studies are *generally not accepted* in those journals. I am quite skeptical myself on the merits of this idea about the 5% significance level as *the standard*, and support a view on significance levels that does *not agree* with the most articulated view I mentioned earlier. Thus, I am not quite the right person to ask what circumstances lead researchers to sometimes use a 1% significance level. Needless to say that the debate on how to *correctly* interpret significance levels is nowhere near reconciliation, though one of the goals of my term paper for PHIL 440 is to shed light on this topic. Also, I found one article on the internet that tries to address this question of why 5% is a common significance level: http://www.jerrydallal.com/LHSP/p05.htm. I hope this helps. Let me know if you have any more questions on this topic.

In your presentation, may you discuss Sir Ronald A. Fisher’s, page 12, first sentence under the heading 6. Interpretation and its Reasoned Basis: “In considering the appropriateness of any proposed experimental design, it is always needful to forecast all possible results of the experiment, and to have decided without ambiguity what interpretation shall be placed upon each one of them.” How does this statement include experimentation to learn experimental development results, or new theory. Also of interest is your discussion of the concept of Null Hypothesis, and its purpose.

May you include in your presentation the Boeing Advanced Quality System Tools manual:

http://www.boeingsuppliers.com/supplier/d1-9000-1.pdf

Your discussion of Section 1.17, pages 208 to 214 [pages 211 to 218 in the pdf copy], on Statistically Designed Experiments, and any other Boeing content you would like to relate to Sir Ronald Fisher’s The design of Experiments will be appreciated.

Thank you to the ones who have contributed to this forum, so far. As I prepare for the presentation tomorrow, I should note that I will *not* be accepting comments/questions *after 9:30 pm tonight*. If anyone has any burning questions between now and tomorrow morning, it's best to ask it now (or before 9:30 pm tonight). Remember, the more comments I receive, the more likely I will satisfy the needs of the audience. Hence, the presentation will go much more smoothly if everyone gets their considerations taken into account.

Thank you to everyone for all the questions yesterday - it is great to see that some of the participants do have interest in this subject of experiments and the design of them! Also, any feedback pertaining to my presentation would be greatly appreciated.

Now that my presentation has been finished, I will accept any additional questions anyone may have pertaining to Fisher's work on significance tests and design of experiments! I will be sure to answer your questions to the best of my ability.

Nicole,

May you comment on Fisher's statements page 7, heading 4. The Logic of the Laboratory: "Inductive inference is the only process known to us by which essentially new knowledge comes into the world." And page 8, "Experimental observations are only experience carefully planned in advance, and designed to form a secure basis of new knowledge; that is, they are systematically related to the body of knowledge already acquired, and the results are deliberately observed, and put on record accurately."

In reply to the concern over the need to know in advance all possibilities in order to learn something from an experiment: I think the problem might lie in the fact that in the paper there is no distinction between 'learning' and 'contributing to scientific knowledge'. We may well learn that under certain experimental conditions a possibility that we hadn't foreseen does in fact obtain, and use this result as a basis for further investigation. But for the purposes of gleaning some legitimate scientific knowledge, those results are irrelevant because they don't substantiate either of the hypotheses in the experiment.

Thomas, In the 2011 NOVA film series, The fabric of the Cosmos, physicist Dr. Leonard Susskind argues from the perspective that there are 10 to the 500 different String Theories. He claims this is exactly what cosmologists are looking for. This fits with the ideas of a multiverse, a huge number of universes; each different. In Dr. Susskind’s 2006 book titled The Cosmic Landscape, Dr. Susskind, page 381, provides a distinction between the words he used in his book as landscape and megaverse. In the film Dr. Susskind used multiverse in place of megaverse. On the megaverse [multiverse] he wrote, “The megaverse [multiverse], by contrast is quite real. The pocket universes that fill it are actual existing places, not hypothetical possibilities.

I think any testing devised challenges Dr. Fisher’s concept to forecast all possible results.

One individual asked me after class about "the black swan problem", and whether the Fisherian way of testing hypotheses would relate to that. Before I respond to this question, I should clarify that "the black swan problem" is about falsification of hypotheses (http://en.wikipedia.org/wiki/Falsifiability#Inductive_categorical_inference) as a 'solution' to the problem of induction - at least that is how I take it. Now that we know somewhat what the black swan problem refers to, my *short* answer to the question is that Fisher's significance tests *do NOT* provide a means to falsify hypotheses. Yes, Fisher says that there's a chance at disproving the null hypothesis (and that the null hypothesis could never be proved), but this *does NOT* (necessarily) mean that the primary objective of significance tests is to falsify the null hypothesis! My *longer* answer follows, if anyone cares to read it. Deborah Mayo would agree with me that significance tests *should NOT be used* to falsify hypotheses in the way Popper describes falsification. In fact, I quote verbatim an excerpt from Mayo's 1996 book, *Error and the Growth of Experimental Knowledge* (page 2):

For Popper, learning is a matter of deductive falsification. In a nutshell, hypothesis

His deductively falsified ifHentails experimental outcomeO, while in fact the outcome is~O. What is learned is thatHis false. ... We cannot know, however, which of several auxiliary hypotheses is to blame, which needs altering. OftenHentails, not a specific observation, but a claim about the probability of an outcome. With such a statistical hypothesisH, the nonoccurrence of an outcome does not contradictH, even if there are no problems with the auxiliaries or the observation.

As such, for a Popperian falsification to get off the ground, additional information is needed to determine (1) what counts as observational, (2) whether auxiliary hypotheses are acceptable and alternatives are ruled out, and (3) when to reject statistical hypotheses. Only with (1) and (2) does an anomalous observation O falsify hypotheses

H, and only with (3) can statistical hypotheses be falsifiable. Because each determination is fallible, Popper and, later, Imre Lakatos regard their acceptance as decisions, driven more by conventions than by experimental evidence.

Mayo later states in the same book, "A genuine account of learning from error shows where and how to justify Popper's 'risky decisions.' The result, let me be clear, is *not a filling-in* of the Popperian (or the Lakatosian) framework, but a wholly different picture of learning from error, and with it a different program for explaining the growth of scientific knowledge" (page 4, emphasis mine). In other words, Popperian falsification is *NOT* the right way to think about hypothesis tests! Hypothesis tests, whether Fisherian or from the Neyman-Pearson methodology, are *NOT* about falsifying statistical claims. Hence, the Fisherian way of testing hypotheses *does NOT apply* to the black swan problem. I quoted Deborah Mayo's view on Popper's falsification to show that my view against Popper's falsification came from her work. My term paper for PHIL 440 will address how to *correctly* interpret Fisher's significance tests. I hope this helps. Are there any questions about my answer to this person's question?

<wikieditor-toolbar-tool-file-pre>]]

Nicole, Does the above chart copy, and the associated text copy that follows below, that are copied from the experiment described on page 248 of the Boeing Advanced Quality Sytem Tools document, satisfy your concept of significance test methods, and the significance test methods of Deborah Mayo, and Sir Ronald A. Fisher

Robust design: testing process parameters Parts in a heat-treat process were experiencing unpredictable growth, causing some parts to grow outside of the specification limits and be rejected as scrap. It was surmised by the engineering team that irregular growth was due to the orientation of the part in the oven and the part’s location in the oven. Since it was desirable to heat treat a maximum number of parts in each oven load, it was important to be able to determine a set of heat-treat processing conditions that would result in minimum growth for heat-treated parts in both a horizontal and vertical orientation, and at both the top and bottom locations in the oven.

Four process factors were identified: hold temperature, dwell time, gas flow rate, and temperature at removal. The team defined two settings for each of the process factors. The experiment used eight runs of the oven, as shown in figure 2.7 (a fractional factorial design, that is, a particular selection of half of the 16 possibilities defined by all combinations of the process factors at two settings). For each oven run, parts were placed at both the top and the bottom of the oven and in both orientations.

The experimental results indicated an unsuspected effect due to oven location, with parts in the bottom of the oven experiencing less growth than those in the top of the oven. The analysis indicated that a particular combination of hold temperature and dwell time would result in part growth that is insensitive (or robust) to part orientation and part location. Furthermore, the experiment indicated that temperature at removal did not affect part growth, leading to the conclusion that parts could be removed from the oven at a higher temperature; thus resulting in savings in run time.

Unless I have access to the analysis of variance (ANOVA) table, I *cannot* comment on the last paragraph (pertaining to what the results indicated, or the conclusion that was drawn from the experiment). Also, I should mention that ANOVA (e.g., see http://www.stat.columbia.edu/~gelman/research/unpublished/econanova.pdf) is a separate technique in itself, *distinct* from the type of significance tests that Fisher introduced in Chapter 2 of his book (i.e., the reading for this past week). To answer your question, page 248 of the Boeing Advanced Quality System Tools document *does not* satisfy my concept of significance test methods insofar as it *not relating* at all to the significance test introduced in my presentation earlier this week. I hope this helps. If you have any more questions pertaining to the example you gave me in the Boeing Advanced Quality System Tools document, then I suggest we *do not* communicate on this forum but in other ways that will not disturb the focus of the discussions in this course.

This comment might be a little late but I just wanted to rephrase my question in because I am a notorious mumbler. The significance level corresponds to the probability that your results in a an experiment were obtained by chance (coin landed heads 5 times in a row) The P-value is the probability that your null hypothesis (This coin is unfair and obtains heads more often than tails) is true. Your results are significant if the P-value is less than the significance level. That would then mean that if there was a higher probability that your 5 coin flips were by chance than the probability that the coin was unfair your results would be significant in disproving the null hypothesis (that is was a faire coin). ...Well now that I've written it our it makes sense to me. I'll post this on the forum anyway in case anyone else had trouble. I was confused because i didn't understand that the goal was the disprove the null hypothesis and show that it was all due to chance instead.

After the discussion last class on the difficulty of attaining knowledge of causation, I've been wondering why we place such emphasis on causation anyways? Since the dawn of philosophy in Ancient Greek, philosophers have been searching for the arche, the cause, of things. But causation if incredibly difficult - if not outright impossible - to know for certain. I could observe two events, and find their correlation. This is a pretty plain-vanilla observation of the physical world.

But causation is a special form of correlation. And getting there takes a quantum leap in effort whereas the marginal utility gained is relatively much smaller. For our pragmatic interests, the two aren't that different. Suppose we know there's a negative correlation between greenhouse gas emission and global warming such that if we decrease our greenhouse gas emissions, the temperature will stabilize or go down. Great. Now we have a powerful tool for action and policy. It's not really necessary (for our practical concerns) to know whether the decrease in temperature is really brought about by a decrease in greenhouse gas emissions, or by some unknown and unconsidered third factor.

If it's not our practical concerns that is driving us, then could our search for causal chains and causes be theoretical in nature? That present a problem as well. You can never get inside causes, and examine them, and say for sure. The system is simply too complex to allow you to draw that conclusion most times. Besides, there's also the theoretical worry that causes can never be known for sure. Causes operate in the physical world. There is no cause in theories and formulae. I can't say "1" is caused by "1+1", or that the Earth is caused by the Sun just by looking at general relativity. I can't remember where I read it, but someone pointed that no matter how long you observe a watch from the outside, you can never tell for sure how it works. You can only guess, make predictions, see if those predictions materialize, and if not, refine your theory and repeat. The universe is rather like a large watch, of which you will never see the inside.

So if it's not practical or theoretical, why are we so obsessed with causes?

Edward, I agree with you to some degree on questioning why we (scientists, philosophers of science) are obsessed with causes. While the idea of causation has been around for a long time, the idea of correlation representing causation has *not* been around for that long! Karl Pearson insisted that correlation is fundamental to science, and that correlation is to replace causation. Hence, the idea about causation being a special form of correlation was driven primarily by Karl Pearson in the *(late 19th)/(early 20th) century*. From my (limited) exposure to studying the history of statistics, it seems that Pearson's argument about correlation being fundamental to science has *gone out of hand*, with numerous scientists not using the concept properly and almost always misinterpreting the inference from observed correlation. In other words, Pearson presented his argument for correlation replacing causation, then many scientists have been *misguided* in their use of Pearson's correlation coefficient as thinking that statements of causation are *the norm* (or *de facto standard*) when inferring correlations. I say that they (scientists) have been (and still are) misguided because that's what Pearson taught them, yet no one has questioned the grounds for inferring those causal statements from correlations *until* much later (e.g., since Nancy Cartwright came along in the *1980s*)! Even though I am unable to (fully) answer your question, I hope that I have been able to shed light on the issue in an effective manner. I am able to give insight to this question, even though it is not *really* related to the reading on Fisher, because of my research interests in (probabilistic) causal inference. Yes, causal inference is a *whole separate topic in itself*, distinct from design and analysis of experiments! I hope that everyone can see by now that there are *several* sub-fields within the domain of statistics as a discipline. There's so much work to be done in accounting for the philosophical issues surrounding statistical techniques, it's not even funny!

Not sure if this reply in counted too, so here I am, replyin'

It seems data are to be more verifiable or convincing if the probability showed a smaller significance level within experiments. As discussed in class, it seems the likelihood of it being accepted or examined from other scholars of the given area would be higher if it were not 'surprising', or appeared to have resulted by chance. (As we accept probability being small and incremental vs. too 'perfect' or 'surprising'). Albeit, perhaps at he same time it represents an inexactitude in the progress of the research and results, it perhaps only leads to a direction, reveals the quality of the research (substantial sample and randomization). Significant gaps exist within scientific experiments, although it seems the peer-reviewed structure of determining both the design of the experiment and its acceptance or validity. It seems the strongest structures are ones with small probabilities and small significance levels with modest results and aim to clarify an area of research.