The Basics of Experimental Design [A Quick and Non-Technical Guide]

Sid Sytsma

Website Administrator's Note: I have always considered Sid Sytsma's short article on experimental design one of the best short pieces on the subject I have ever seen, and provided a link to it from my Lutherie Information Website. Professor Sytsma retired and no longer felt the need to retain his site, and when this happened I asked if I could please host this article on my own site. Recently I have been informed that a number of other folks have seen the value of Professor Sytsma's article and have provided links to it, but unfortunately a number of these links attribute this work to me. Please, if you do link to this page, give credit where credit is due. This wonderful article is by Sid Sytsma. Thanks.

- R.M. Mottola 2009

What is Experimental Design All About?

Experimental design is a planned interference in the natural order of events by the researcher. He does something more than carefully observe what is occurring. This emphasis on experiment reflects the higher regard generally given to information so derived. There is good rationale for this. Much of the substantial gain in knowledge in all sciences has come from actively manipulating or interfering with the stream of events. There is more than just observation or measurement of a natural event. A selected condition or a change (treatment) is introduced. Observations or measurements are planned to illuminate the effect of any change in conditions.

The importance of experimental design also stems from the quest for inference about causes or relationships as opposed to simply description. Researchers are rarely satisfied to simply describe the events they observe. They want to make inferences about what produced, contributed to, or caused events. To gain such information without ambiguity, some form of experimental design is ordinarily required. As a consequence, the need for using rather elaborate designs ensues from the possibility of alternative relationships, consequences or causes. The purpose of the design is to rule out these alternative causes, leaving only the actual factor that is the real cause.

For example, Treatment A may have caused observed Consequences O, but possibly the consequence may have derived from Event E instead of the treatment or from Event E combined with the treatment. It is this pursuit of clear and unambiguous relationships that leads to the need for carefully planned designs.

The kinds of planned manipulation and observation called experimental design often seem to become a bit complicated. This is unfortunate but necessary, if we wish to pursue the potentially available information so the relationships investigated are clear and unambiguous.

The plan that we choose to call a design is an essential part of research strategies. The design itself entails:

By convention, the problems of design to not ordinarily include details of sampling, selection of measurement instruments, selection of the research problem or any other nuts and bolts of procedure required to actually do the study.

Considerations in Design Selection

The selection of a specific type of design depends primarily on both the nature and the extent of the information we want to obtain. Complex designs, usually involving a number of "control groups," offer more information than a simple group design. If "more information per project" were the sole criterion for selection of a design, we would be led to more and more complex designs. However, not all of the relevant information may be needed can be derived from any given design. Part of the information will be piggy-backed into the study by assumptions, some of which are explicit. Other information derives from a network of knowledge surrounding the project in question. Theories, accepted concepts, hypotheses, principles and empirical evidence from related studies contribute. To the extent that this knowledge is already available, the task of extracting the exact information needed to solve any research problem is circumscribed.

Collecting information is costly. The money and staff resources available have some limits. Subjects are usually found in finite quantities only. Time is a major constraint. The information to be gained has to be weighed against some estimate of the cost of collection. This points ou two ways of checking potential designs:

  1. What questions will this design answer? To do this, we must also be able to specify many of the questions the design won't answer as well ones it will answer. This should lead to a more realistic approach to experimental design than is usually given. Some simple and useful designs have been labeled as "poor" because they are relatively simple and will not answer some questions. Yet, they may provide clear and economical answers to the major questions of interest. Complex designs are not as useful for some purposes.
  2. What is the relative information gain/cost picture? There is no specific formula or strategy for deriving some cut-off point in this regard. The major point here is that the researcher must take a close look at the probable cost before selecting a design.

Experimental Design Terminology

The group in an experiment which receives the specified treatment is called the Treatment Group or the experimental group. However, the term Control Group refers to another group assigned to the experiment, but not for the purpose of being exposed to the treatment. Thus, the performance of the control group usually serves as a baseline against which to measure the effect of the full treatment on the treatment group.

A variable refers to almost anything under the sun. There are only two kinds of stuff in the world for researchers: variables and constants. As a result, almost any concept, or thing, or event they are interested in, that varies or can be made to vary, and that is related to their research can be called a variable. Researchers pay particular attention to variables that may influence the results (this is of MUCH concern to researchers).

Extraneous variables (external to the experiment) are variables that may influence or affect the results of the treatment on the subject.

A variable of specific experimental interest is sometimes referred to as a factor. Ordinarily the term factor. Ordinarily, the term is used when an experiment involves more than one variable. These variables are often identified as factors and are labeled "Factor A" and "Factor B," etc. Level refers to the degree or intensity of a factor. Any factor may be presented in one or more of several levels, including a zero level.

Randomness refers to the property of completely chance events that are not predictable (except in the sense that they are random). If they are truly random, examining past instances of occurrence should give the researcher no clues as to future occurrences. Thus, if we were to predict outcome from perfect pairs of dice rolled in an unbiased way (which are random events), previous rolls give no clue. Randomness becomes important in the design of the experiments primarily in the assignment of subjects to groups. Researchers feel more secure about the results of their studies if subjects have been randomly assigned to groups. Random assignment of subjects to groups tends to spread out differences between subjects in unsystematic (random) ways so that there is no tendency to give an edge to any group.

Randomization, or random assignment, refers to a technique of assignment or ordering such that no consistent or systematic effect in the assignment is tied in with the method. Elimination of such systematic influence upon assignment or selection allows for chance assignment. Approved ways of generating chance assignments involve tables of random numbers or the use of computer software with random number generators. However, typically, researchers frequently resort to simple counting off, flipping a coin, and other short cuts.

Another way of selecting subjects is simply to use intact groups: such as all the students in a given classroom, or all of the patients in a hospital. Researchers are usually worried whether the students were assigned to the classroom in a non-random way, or whether certain patients self-selected a hospital for a particular reason. The problem is whether some subtle factors were operating to exert a bias of selection factors in the assignment to groups.

Ex post facto refers to causal inferences drawn "after the fact." For in the ex post factor study, the causal event of interest has already happened. These are known as non-experimental studies and are often contrasted with experimental studies. A typical example of this type of research would be to compare two groups of patients in a hospital, one treated with Drug A and the other treated with Drug B and then trying to infer a difference in the performance of the two drugs.

Variance refers to the variability of any event. If one uses a fine enough measuring device, one can find differences between any two objects or events.

The inside logic of an experiment is referred to as internal validity. Primarily, it asks the question: Does it seem reasonable to assume that the treatment has really produced the measured effect? Extraneous variables which might have produced the effect with or without the treatment are often called "threats to validity."

External validity, on the other hand, refers to the proposed interpretation of the results of the study. If asks the question: With what other groups could we reasonably expect to get the same results if we used the same treatment? If Treatment X resulted in lowered blood pressure in middle age men, could you logically claim that it will produce the same effect in older women?

Blocks usually refers to categories of subjects with a treatment group. For example, we might divide the group into older, middle aged, and younger patients and further divide the groups into a group treated with Drug A and another treated with Drug B. The advantage is to enable us to discover how the treatment affects each of the age groups. For example, we might find that overall, Drug B out performs Drug A, except for older patients, where Drug A out performs Drug B. This phenomenon is known as an interaction between treatment (the Drug) and subject characteristics (age).

Interaction refers to variables in the treatment which may interact with each other. It may make a difference whether a variable is used by itself, with another, or with different levels or degrees of another. Higher order interactions are possible. One factor may depend on the presence or absence of two other factors; termed a second-order interaction.

The Hawthorne Effect refers to the behavior of interest being caused by subject being in the center of the experimental stage, e.g., having a great deal of attention focused on them. This usually manifests itself as a spurt or elevation in performance or physical phenomenon measured. Although the Hawthorne Effect is much more frequently seen in behavioral research, it is also present in medical research when human subjects are present. Dealing with this problem is handled by having a control group that is subject to the same conditions as the treatment groups, then administering a placebo to the control group. The study is termed a blind experiment when the subject does not know whether he or she is receiving the treatment or a placebo. The study is termed double blind when neither the subject nor the person administering the treatment/placebo knows what is being administered knows either.

There are six major classes of information with which an experimental designer must cope. They include:

Types of Data Yield

[P1] Post-Treatment Behavior or Physical Measurement

In a typical experiment, this is the data, the class of information of primary interest. What was the physical measurement or behavior of the subject after treatment? All designs shed some light on this class of information. Usually only immediate or short-range results are obtained. More complicated kinds of information derive from, and concern questions of comparing post-treatment behavior between groups who have had various kinds, levels, or even absences of treatment. Five categories of post-treatment behavior or physical measurement can be identified:

  1. P1-1: behavior or measurement immediately after treatment
  2. P1-2: a comparison of post-treatment behavior between experimental and control groups
  3. P1-3: a comparison of the post-treatment behavior between experimental groups or blocks
  4. P1-4: long-term effects with continuing treatment and periodic observations
  5. P1-5: long-term effects without continuing treatment but with observation(s)

[P2] Pre-Treatment Behavior or Physical Measurement

Information concerning pre-treatment behavior or condition requires come observation, a test, or measurement, to be administered before the experimental manipulation. Without such observations, the design itself will not answer any questions about the subjects before the experimental conditions have been introduced. Such information, however, may be accrued from general knowledge or other studies. Direct acquisition of this information adds to the cost of an experiment. Furthermore, it may have a confounding effect, that is, sometimes the pre-treatment observation or measurement influences the subsequent behavior of the subject. When it is over, it may not be clear whether the behavior was due to the treatment, the pre-treatment observation or measurement, or both. Several classes of pre-treatment information can be acquired:

  1. P2-1: behavior or measurement immediately before treatment
  2. P2-2: comparing pre-treatment to post-treatment behavior or measurement
  3. P2-3: a comparison of pre-treatment behavior or measurement between different pairs of subjects
  4. P2-4: a comparison of the differences between pre-treatment and post-treatment behavior among groups of subjects
  5. P2-5: the effect of the pre-treatment observation or measurement on subsequent behavior or measurement of the subject

[I] Internal Threats to Validity

This class of information refers to some rival hypothesis that threatens clear interpretation of the experiment. A common group of rivals threatens most experiments, particularly those using human subjects. Typically, the rival hypothesis asserts that something outside of the experiment proper produced the behavior or measurement of interest. To discover whether or not such rival events exert an influence, the designer must usually provide for one or more control groups. Typically, internal threats to validity include:

  1. I-1: the subjects exhibited behavior because of some event other than the treatment
  2. I-2: the subject could or would perform the behavior, or would have exhibited the measurement without the treatment

[C] Comparable Groups

This class of information, available only when two or more experimental units or groups of subjects are used, deals with whether the subjects in the different units were about the same in relevant attributes before the treatment, and during the treatment, except for the treatment condition itself. If the experimental designer cannot provide information as to the comparability of groups, he/she must be prepared to admit the possibility that the groups differed in some essential aspect which produced the results observed. Equating the groups by some pre-test or measurement or random assignment are the two major techniques of providing this information. Thus, there are two types of comparability information:

  1. C-1: were the groups (either experimental or control) comparable before the treatment?
  2. C-2: did the groups receive a comparable degree of experiences during the time of the study (except for differences in treatment?

[E] Experiment Errors

Experiment error refers to some unwanted side effect of the experiment itself which may be producing effect rather than the treatment. The Hawthorne Effect alluded to earlier is a continuing source of experimental error in both behavioral and medical research. Two types of strategies exist to deal with the Hawthorne effect.

  1. E-1: provide for a placebo treatment group which gets the attention, but not the "real" treatment and use blind and double blind strategies as needed
  2. E-2: continue the treatment over a longer period of time; research shows that the Hawthorne effect tends to be short-lived

[R] Relationship to Treatment

This class of information deals with the possible interaction of the treatment effects with: different kinds of subjects, other treatments, different factors within a complicated treatment, different degrees of intensity, repeated applications or continuation of the treatment, and different sequences or orders of the treatment or several treatments. Typically, information of this type is acquired from blocking, from factorial designs, and various repeated measures designs.

  1. R-1: did the treatment interact with subject characteristics so that subjects with different characteristics behaved or reacted differently?
  2. R-2: how does the treatment interact when combined with other sorts of treatment?
  3. R-3: does the treatment contain different factors which may operate differentially on the subjects?
  4. R-4: what is the effect of different levels or degrees of the treatment?
  5. R-5: what is the effect of different orders or sequences of various treatments?

Describing Experimental Designs

The following letters will be used to describe the various experimental design activities:



Selection of the group or experimental unit


Random assignment to a group


Blocking subjects, or other variables, into sets


Administering a treatment to a group


Observing (measuring) results



Basic Experimental Designs

Eleven commonly used experimental designs will be described. They include:

  1. One-Shot
  2. One-Group, Pre-Post
  3. Static Group
  4. Random Group
  5. Pre-Post Randomized Group
  6. Solomon Four Group
  7. Randomized Block
  8. Factorial
  9. One-Shot Repeated Measures
  10. Randomized Groups Repeated Measures
  11. Latin Square

Obviously in a treatment of experimental design of fewer than 15 pages, not all possible designs are covered. Some of those not treated include, incomplete block designs, Youden square designs, lattice square designs, Taguchi designs, fractional factorial designs, Graeco-Latin square designs, split-plot designs, covariance designs, or time-series designs. The reader is referred to the references at the end of this paper, many of which treat these more specialized designs.


The One-Shot is a design in which a group of subjects are administered a treatment and then measured (or observed). In experimental research, an experimental treatment should be given to the subjects, and then the measurement or observation made. Usually, with this design, an intact group of subjects is given the treatment and then measured or observed. No attempt is made to randomly assign subjects to the groups, nor does the design provide for any additional groups as comparisons. Thus, one group will be given one treatment and one "observation." This design is diagrammed as follows:


The One-Shot Design is highly useful as an inexpensive measure of a new treatment of the group in question. If there is some question as to whether any expected effects will result from the treatment, then a one-shot may be an economical route. In cases where other studies, or the cumulative knowledge in the field provide information about either pre-treatment baseline measurements or behavior, the effects of other kinds of treatments, etc., the experimenter might sensibly decide that it si not necessary to undertake a more extensive design. Simplicity, ease, and low cost represent strong potential advantages in the oft-despised one-shot.

This design answers only one question and that is in reference to post-treatment behavior, P1-1. It will describe the information about the behavior of the subjects shortly after treatment.

One-Group, Pre-Post

In this design, one group is given a pre-treatment measurement or observation, the experimental treatment, and a post-treatment measurement or observation. The post-treatment measures are compared with their pre-treatment measures. This design is diagrammed as follows:


The usefulness of this design is similar to that of the one-shot, except that an additional class of information is provided, i.e., pre-treatment condition or behavior. This design is frequently used in clinical and education research to determine if changes occurred. It is typically analyzed with a matched pairs t-test.

This design will answer the same question as the one-shot design P1-1, so that not only the post-treatment behavior of the subjects is answered, but it will also answer some questions in pre-treatment condition or behavior, namely P2-1 and P2-2.

Static Group

In this design, two intact groups are used, but only one of them is given the experimental treatment. At the end of the treatment, both groups are observed or measured to see if there is a difference between them as a result of the treatment. The design is diagrammed as follows:



This design may provide information on some rival hypotheses. Whether it does or not depends on the initial comparability of the two groups and whether their experience during the experiment differs in relevant ways only by the treatment itself.

Whether the groups were comparable or not is crucial in determining the extent of information yielded by this design. The design is could be used to compare the value of a drug. If the designer cannot, on the basis of information outside the experiment itself, assume the comparability of the groups, the design will yield only information regarding P1-1 and P1-2. However, IF additional information is available to equate the two groups initially, it may handle some of the class I questions. Without additional information, on the basis of the design alone, it cannot.

Random Group

This design is similar to the Static Group design except than an attempt is made to insure similarity of the groups before treatment begins. Since it is difficult to have exactly similar subjects in each of two groups (unless you separate identical twins), the design works toward a guarantee of comparability between groups by assigning subjects to groups at random. If the researcher does this there is likely to be reasonable comparability between the two groups. This design can be diagrammed as:



This design is a real workhorse. It was developed by the great statistician, Sir Ronald Fisher as part of his work in agricultural statistics and has been a primary experimental design for more than 100 years. It is economical. It provides fairly clear-cut information as to the relationship between treatment and post-treatment measurement or behaviors. Since this is often the sole reason for the research, the randomized group design is frequently the appropriate selection.

Good that it is, it does not provide information about pre-treatment behavior. This design will answer questions P1-1 and P1-2. Since the groups are randomized, the design will cope with the internal threats to validity. Subject changes due to other causes should affect the control group so that a comparison of the post-treatment behavior should reveal any differential effects of the treatment.

Class C, comparable group questions C-1 and C-2 are also answered in the randomization provided that there are no probable differences between the groups entering the experiment except for the treatment. The design will not cope with class P1-3, P1-4, P1-5, P2's questionss of pre-treatment behavior, class E questions, or class R questions. It is frequently analyzed with a two-sample t-test assuming equal variances of the groups.

Pre-Post Randomized Group

This design adds a pre-test to the previous design as a check on the degree of comparability of the control and experimental groups before the treatment is given. This experimental design could be diagrammed as:



This yields information of P1-1, P1-2 as to post-treatment behavior and a comparison of post-treatment behavior between groups. It also answers must P2 questions on pre-treatment behavior, and questions P2-1 through P2-4. It answers most of the class I questions, that is, threats to internal validity. It handles class C questions, the groups are comparable because thar are randomized.

The design does not answer P1-3, P1-4, or P1-5; nor the class E questions relating to experiment errors. It does not answer the class R questions regarding the relationship of the treatment nor P2-5, the effect of the pre-treatment observations on the subsequent behavior or measurements of the subjects.

Solomon Four Group

The Solomon Four Group design attempts to control for the possible "sensitizing" effects of the pre-test or measurement by adding two groups who have not been a part of the pre-test or pre-measurement process. This design can be diagrammed as:





Although this design is not frequently used in clinical studies, it is frequently used in both behavior and educational research and in medical studies involving the physical activities of patients (physical therapy, for example where the pre-measurement involves some sort of physical activity or testing). The additional cost of this design must be justified by the need for information regarding the possible effects of the pre-treatment measurement.

The Solomon Design answers the P1 questions, P1-1, P1-2, and P1-3. It answers the P2 pre-treatment questions, including the effect of the pre-treatment measurement process. It handles threats to internal validity, the class I questions. It also handles the class C questions because the groups are randomized. It does not handle P1-4, P1-5, the class E questions nor the class R questions.

Randomized Block

This design is of particular value when the experimenter wishes to determine the effect of a treatment on different types of subjects within a group. This design can be diagrammed as:







Typically, this design refers to blocking or grouping of subjects with similar characteristics into treatment subgroups. The group to be used in an experiment is usually given some pre-treatment measure, or previous records are examined, and the entire group is blocked or sorted into categories. Then equal numbers from each category are assigned to the various treatment and/or control groups.

While blocking according to subject characteristics is most typical of this design, blocking could be based on other relevant attributes. For example, if subjects are to be treated during different times of the day, such as morning and afternoon, we might block a morning and an afternoon group within each treatment condition.

The importance of this design lies in the probability that the variable upon which the blocking is based may interact with the treatment. Frequently, no overall treatment effect is observed because subjects with different characteristics react differentially to treatment. If they were blocked on the appropriate attributes, differential treatment effects would be revealed.

The randomized block design will handle the P1-1, P-2 and P1-3 questions. It does handle the class I questions and the class C questions of comparable groups. It does not handle the P1-4 and P1-5 questions or the P2 questions of pre-treatment behavior. It does not handle class E questions and only one of the class R questions, R-1, the relationship of the treatment to subjects characteristics, by virtue of blocking.


As you saw above in the blocking design, the subjects were assigned to different groups on the basis of some of their own characteristics such as age, weight, or some other physical characteristic. Sometimes we wish to assign different variations of the treatment as well, and the procedure is similar. For example, we may wish to try two kinds of treatments varied in two ways (called a 2x2 factorial design). Some factorial designs include both assignment of subjects (blocking) and several types of experimental treatment in the same experiment. When this is done it is considered to be a factorial design. A diagram of a 2x2 factorial design would look like:

A1 B1


A1 B2


A2 B1


A2 B2

The factorial design as we are describing is really a complete factorial design, rather than an incomplete factorial, of which there are several variations. The factorial is used when we wish information concerning the effects of different kinds or intensities of treatments. The factorial provides relatively economical information not only about the effects of each treatment, level or kind, but also about interaction effects of the treatment. In a single 2x2 factorial design similar to the one diagrammed above, information can be gained about the effects of each of the two treatments and the effect of the two levels within each treatment, and the interaction of the treatments. If all these are questions of interest, the factorial design is much more economical than running separate experiments.

The factorial handles some of the same classes of information as the previously described randomized group design: P1-1, P1-3 and class C questions. It provides little support for class I questions, but some weak inferences might be drawn. The design also answers class R questions, R-2, R-3 and R-4 concerning relationships among treatments, factors, or levels. It does not answer questions P1-2, P1-4, P1-5, or P2 questions, nor does it answer either class E or questions R-1 or R-5. Factorials can run to enormous and uninterpretable proportions. 2x3x4 designs are not uncommon, particular in agricultural experimentation.

One-Shot Repeated Measures

This design, or variations of it, is used to assess the effects of a treatment with the same group or the same individual over a period of time. A measure, or observation is made more than once to assess the effects of the treatment. This design can be diagrammed as:


This design is an extension of the simple one-shot and adds only information regarding the effects of repeated or continued treatment. Often an economical trial balloon, the design can acquire high yield when other extra-design sources of knowledge can be related to it.

This design handles only the questions related to class P1 or post-treatment behavior, question P1-1 and P1-4. It answers questions about behavior or effect shortly after treatment and the longer-term effects related to subsequent treatments. It might handle class R questions, R-5 relationships of the treatment in that the effect of the repeated treatments may be observed. It will not answer questions P1-2, P1-3, P1-5 nor any class P2, I, C, E, or R-1 through R-4 questions.

Randomized Groups Repeated Measures

The Randomized Groups Repeated Measures design is a variant of the previous design in which two or more experimental methods are compared and repeatedly measured or observed. Although we are diagramming but two groups, more groups may be used.



This design handles answers P1-1, P1-3, and P1-4 questions. It handles class C questions. The groups are comparable because of random assignment. Some light is shed upon the relationship of repeated treatments, question R-5.

It fails to handle P1-2, P1-5, or class P2 pre-treatment behavior or measurement questions. It does not handle class I questions on internal threats to validity unless a non-treatment group is added to yield class I information. It further fails to handle class E and most class R questions except for R-5.

Latin Square

A researcher may wish to use several different treatments in the same experiment, for example the relative effects of an assortment of perhaps three or more drugs in combination in which the sequence of administration may produce different results. A diagram of a three treatment Latin Square design is:




If all possible sequencing permutations had to be addressed, there would be 3x2=6 possible arrangements--a doubling of cost. The original 3-treatment design would, however, answer the question whether sequencing made a difference without testing all possible sequences.

The Latin Square design answers questions P1-1, P1-3, C-1, C-2, and R-5. It will not provide information on P1-2, P1-4, P1-5 nor any questions in class I or class E. Questions R-1 through R-4 are not addressed as well.

Sometimes in psychological, physiological or medical research, individual subjects form the rows, time-the columns, and the cells form the treatments with an observation taken at each treatment cell. When this is done, and the measurements between each cell are to be compared, clear interpretation may be difficult because of possible carry-over effects of previous treatments (e.g., earlier treatments affecting later ones).

The Question of External Validity

Questions of a different sort than we have faced arise from our need to generalize from a limited set of observations. No one is interested in observations than in no way extend beyond this particular restricted set of data. Generalizability depends on whether the observed behavior measurement [O] is representative of the people, the surrounding conditions and the treatments to which we now wish to extend it. Classes of questions include:

The important thing is to clarify where the results of your observations may be legitimately extended and where they can not yet be legitimately extended. Helpful in this regard is a comprehensive description of the demographic characteristics of the subjects of the research and a complete and comprehensive description of the methodology used so that the reader of the research can judge for himself or herself whether the results can be generalized to his or her situation.


Brownlee, K. A. Statistical theory and methodology in science and engineering. New York: Wiley, 1960.

Campbell, D. and Stanley J. Experimental and quasi-experimental designs for research and teaching. In Gage (Ed.), Handbook on research on teaching. Chicago: Rand McNally & Co., 1963.

Cornfield, J. and Tukey, J. W. Average values of mean squares in factorials. Annals of Mathematical Statistics, 1956, 27, 907-949.

Cox, D. R. Planning of experiments. New York: Wiley, 1958.

Fisher, R. A. The design of experiments. (1st ed.) London: Oliver & Boyd, 1935.

Winer, B. J. Statistical principles in experimental design. New York: McGraw-Hill, 1962.