Stats #52: Scientific Validity, Statistics, and IRB Review
Content: When the Institutional Review Board (IRB) reviews your research, they will evaluate (among other things) the scientific validity of your study. Consulting with a statistician prior to IRB submission helps. Not all aspects of scientific validity touch on Statistics, but some do. In particular, a statistician can provide help with the selection of your sample, the size of your sample, and the plan for data analysis.
Teaching strategies: Didactic lectures and small group exercises.
Objectives: In this class you will learn how to:
- assess how restrictions on your sample can hamper generalizability;
- recognize factors that influence the sample size of a study;
- identify the important components of a data analysis plan; and
- explain the rationale for IRB review of scientific validity.
This class should qualify for one (1) hour of IRB Education Credits (IRBECs).
Contents
Overview of the STATS web pages (January 21, 2000)
What are the STATS web pages?
The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.
Where can I find STATS?
If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,
which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.
Some of the fun stuff you can find on the STATS web pages.
Ask Professor Mean. For the tough Statistics questions that Dear Abby won't touch.
Planning Your Research Study. Things you need to plan for before you start collecting your data.
Selecting An Appropriate Sample Size. How much data do you really need?
Managing Your Research Data. Everything you want to know before you step to the keyboard.
Steps In a Typical Data Analysis. I have my data on the computer. Now what?
How to Read a Medical Journal Article. Reading a journal is hard work. Here's some help.
Professor Mean's Library. Good books and good web sites about Statistics.
... and even more good stuff!!!
This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details
For CMH employees only: Statistical Consulting Services.
You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.
Here area some examples of the services that we have provided:
- setting up your research hypothesis,
- selecting and justifying your sample size,
- writing the statistical methods section for your grant,
- preparing randomization tables for your study,
- reviewing your surveys for content and quality,
- developing a system for entering your data,
- choosing an appropriate statistical model for your data,
- establishing validity and/or reliability for your measurement scales,
- checking for violations of statistical assumptions in your data,
- producing graphs and tables for your research publication, and
- providing references for new and unusual statistical methods.
Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.
How to get in touch with a statistician
If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).
This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details
Directions to my new office (April 25, 2008).
I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.
- Take the elevator of the research tower down to the yellow level. Exit the employee parking garage on 23rd Street, walk to Kenwood and cross 23rd Street. Your destination is Building M 3 which is the building closest to 22nd Street. However, the entrance to our building faces Building M 2. It’s best to walk into the parking area that is just north of Building M 1 and follow the sidewalk around the west side of building M 2 in order to get to our building’s entrance on its south side. Another route would be to exit the Hospital Hill Center Building on Holmes and then walk ½ block north to 23rd Street, cross 23rd Street, walk west to Kenwood then north to building M 3 address 2220 Kenwood.
This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details
Getting IRB approval for your research (October 9, 2002)
Dear Professor Mean: I am submitting a proposal to our Institutional Review Board. Is there anything you can do to help me get IRB approval? --Terrified Terri
Dear Terrified:
Why not bring a freshly baked batch of chocolate chip cookies to the IRB meeting? I'd be glad to sample the batch first to make sure it tastes okay.
Disclaimer
In a perfect world, everyone would listen when Professor Mean talks and they would decide things exactly the way he would. Alas, it's not a perfect world. Our IRB here at Children's Mercy Hospital uses criteria that differ from the guidance I give below, and your IRB probably does also. I'm working with our IRB to better understand the criteria they use and when I get a better understanding, I'll update these web pages accordingly.
But don't try the PMSS defense: You should approve this protocol because Professor Mean Said So. Sadly, it does not work.
By the way, if you serve on an IRB, I'd love some feedback from you on how your IRB assesses scientific validity.
Short answer
The IRB does look at a variety of issues, but the one with particular relevance to statistics is whether the study has scientific validity. It is unethical to expose research subjects to any risks, discomforts, or inconveniences if the study has dubious validity. The Declaration of Helsinki states
Medical research involving human subjects should only be conducted if the importance of the objective outweighs the inherent risks and burdens to the subject. www.wma.net/e/policy/17-c_e.html
Justification for scientific validity also appears in the Nuremberg Code.
The experiment should be such as to yield fruitful results for the good of society, unprocurable by other methods or means of study, and not random and unnecessary in nature. ohsr.od.nih.gov/nuremberg.php3
Good statistical design can touch on several aspects of scientific validity:
- Is your sample chosen appropriately?
- Is your sample size large enough?
- Are you measuring things well?
- Do you have a good plan for analysis of the data?
Make sure that you provide enough documentation in your proposal to convince the IRB that the answer is YES! to all these questions.
Is your sample chosen appropriately?
Who you choose to participate in your research study will say a lot about how easily you can generalize your results to the real world. No sample is perfect, and even just the process of asking for informed consent can hurt generalizability.
If you randomly select subjects and/or randomly assign them to treatment and control, that's good. But more important is the pool of subjects that you are drawing your sample from. Ideally, your pool of subjects should include the full spectrum of the rainbow. In practice, logistical constraints make this ideal impossible.
Watch out when you select subjects only when your research coordinator is on the clock, or only from a tertiary care center. These are examples, where you may not have much success in extrapolating your findings to a more general group of patients. You can't generalize to all fruit when your sample is restricted to apples.
Sometimes there are hidden restrictions on your sample. Some studies may implicitly exclude patients if they:
- speak English poorly,
- move around a lot, or
- lack a primary care physician.
The logistics of your research and limitations on your time and trouble may also place restrictions on your sample by excluding patients who arrive on weekends and evenings.
Sometimes these restrictions are trivial and sometimes not. It's best to acknowledge these implicit restrictions and be honest about the extent to which they hurt your ability to generalize.
Also, you need to be very careful about selecting your control group. The control group needs to be identical to the treatment group, except for the therapy or exposure being studied. If the control group differs on other factors, especially factors that affect prognosis, then you have problems. You need to control for these other factors, through randomization, matching, or covariate adjustment.
Is your sample size large enough?
The size of your sample plays a vital role in scientific validity. You can't ignore this issue. Every single research study, no matter what the type, should have an explicit justification of the sample size. Virtually every research area has identified and documented problems with inappropriate sample sizes. Failure to consider sample size represents one of the biggest problems with research today.
With a small sample size, you may not have enough precision to make any useful statements about your research data. This is a waste of research dollars, but it is also unethical. An inadequate sample size needlessly puts subjects at risk without any benefit to society.
The opposite problem can also occur. Some research studies include too many research subjects, but this problem is rarer. Including too many research subjects is also a waste of research money and it is also unethical. You are exposing more patients to the risks, discomforts, and inconveniences of the research study than you need to make precise statements about your data.
The justification of your sample size could take the form of a power calculation, if you have a formal research hypothesis. If your study will produce some simple descriptive statistics, then you should show that the confidence limits about these statistics will be reasonably narrow. Even if your study has a non-quantitative objective, you should still justify your sample size, possibly using a non-quantitative criteria.
There are many complex formulas for determining sample size; here is some general advice.
First you need to think about the size of the difference you are trying to detect and compare that to the standard deviation of your outcome measurement. If you are trying to detect differences that are small relative to your standard deviation, then you need a very large sample size. Detecting a difference that is about one fifth of a standard deviation, for example, might require a sample size in the hundreds.
If you are trying to detect a difference that is very large relative to your standard deviation, then you can get by with a smaller sample size. Detecting a difference that is about the same size as a standard deviation would only require a few dozen subjects.
Be careful! You might be tempted to say that you are only looking for differences that are large relative to the standard deviation, but you may end up painting yourself into a corner. If you suspect that your control group is a full standard deviation or more away from the treatment group, then this difference is one that would be so large as to be visibly different.
For example, Jacob Cohen points out that 13 year old girls and 18 year old girls differ in average height by about 0.8 standard deviations. He also mentions that the Ph.D. holders and college freshman differ in average IQ by about the same amount. Do you really believe that your study will show such a large difference?
Second, when you are counting events, discrete events like deaths, it is the number of these events, not the total number of subjects studied, that determines the precision of your results. When the events are very rare, this means that you have to sample a large number of patients in order to accumulate enough events.
As a very rough guide, you should strive for at least 25 to 50 events per group. If your event occurs only 1% of the time, that means that you might need as many as 5,000 patients per group. If an event occurs one fourth of the time, you might be able to get by with one or two hundred patients per group.
Event Rate Recommended
sample size25% 100 to 200 5% 500 to 1,000 1% 2,500 to 5,000 0.2% 12,500 to 25,000 Finally, if the sample size you need is unattainable--you don't have the budget, perhaps, or the study would take too long--then consider redesigning your experiment. Find a way to reduce the variability of your outcome measure. A cross-over design, for example, will usually have much less variability because each patient serves as his/her own control. Sometimes intermediate measurements (often called surrogate measurements) will improve your sensitivity enough so you can attain a reasonable amount of precision with a limited sample size.
Sometimes research will have a qualitative rather than quantitative goal. We might be interested, for example, in the issues that children with sickle cell disease face, or teenagers reasons for starting to smoke cigarettes. For qualitative studies, there is no mathematical formula that you can apply to justify your sample size.
The sample size needs to be large enough to ensure a rich and complete set of responses. Look for a sample size large enough to ensure that both ends of the spectrum (and the middle) are represented. If the population you are studying is very homogenous, then as few as a dozen patients may be enough. You may also wish to depart from random sampling and use a purposive sample instead. You can also justify a small sample size if you use purposive sampling. A purposive sample deliberately looks for patients with certain characteristics and can ensure that you have included all relevant viewpoints and perspectives in your study.
Another way to assess the sample size is by saturation. Saturation occurs when the same themes get repeated over and over and no new ideas are generated.
Are you measuring things well?
There are a lot of scientific issues that I can't answer here. Is arterial distensibility is a good marker of heart disease? What is the best way to determine gestational age? Should you measure blood pressure in the left arm or the right arm?
I can, however, ask some questions that will help you determine whether your measures are clinically relevant.
Is your measure valid and reliable?
Every discipline has slightly different definitions and standards for validity and reliability. As a general rule, the issues of validity and reliability become most important when you are measuring something abstract, like stress, or something subjective, like quality of life.
The easiest way to ensure validity and reliability is to use measures that have already been established in the peer reviewed literature. You can also hedge your bets by including several measures of the same outcome.
If you have concerns about validity and reliability, you might reserve a fraction of your sample (from 5% to 20% is a good starting point) for more thorough analysis. These patients might receive additional tests to verify that your simple outcome measure actually works well. Or you might have these patients evaluated by two different people and measure the level of agreement.
Be cautious about sources of information that are known to be imperfect. For example, in a study of 295 deaths from child maltreatment, only half were identified as such on the death certificates. The gender of the child, whether the perpetrator was a parent, and whether the child died in a rural or urban county, had a differential impact on ascertainment.
Do you define all your terms objectively?
Research must be repeatable, so you need to use terms that are defined well enough so that another expert could reproduce your work and come up with roughly comparable findings.
You need to provide operational definitions for any events that are subject to differing interpretations. For example, the Scottish Intercollegiate Guidelines Network defines life threatening asthma as:
"Features of life threatening asthma include agitation, altered level of consciousness, fatigue, exhaustion, cyanosis, and bradycardia. Air entry is often greatly reduced, which may lead to a 'silent chest'. The peak flow, if recordable, is usually less than 33% of best or predicted." www.sign.ac.uk/guidelines/fulltext/38/section2.html
Up to 1992, the National Center for Health Statistics defined current and former smokers by asking the following two questions:
"Have you ever smoked 100 cigarettes in your lifetime?"
"Do you smoke now?" www.cdc.gov/nchs/datawh/nchsdefs/currentsmoker.htmThe Social Security Administration defines blindness as:
"when your vision cannot be corrected to better than 20/200 in your better eye, or if your visual field is 20 degrees or less, even with corrective lens. Many people who meet the legal definition of blindness still have some sight and may be able to read large print and get around without a cane or guide dog." www.rcep7.org/socialsecurity/faq/blind/default.html
Is your outcome important to your patients?
Patients are usually interested in one of three things: morbidity (will I develop diabetes?), mortality (will I die?), or quality of life (will I be able to lift and carry a bag of groceries?). Ideally, you should try to measure one or more of these things directly. If you can't measure them directly, then does your indirect measurement (sometimes called a surrogate measurement) have a strong link with morbidity, mortality, or quality of life?
Also, are you focusing on a short term outcome because of your convenience, when your patients are most interested in long term outcomes? It is easy to get someone to quit smoking for a week, but it is much harder to get them to persist through a full year.
Do you have a good plan for analysis of the data?
It is important to have a plan. If you don't tell the IRB what you expect to do with your data, they won't be able to decide if the goal of your research is worth the risks, discomforts, and inconveniences of the patients in the study.
This does not have to be very detailed. If all you want to do is a descriptive study where you estimate a few means and proportions, then that's all you need to say. A lot of very valuable research does nothing more than this. Here's an example:
In this research study, we will study children with severe hearing loss in order to estimate the proportion who lose a hearing aid, and the average expense associated with these losses.
It's a myth that all research requires a hypothesis specified prior to the collection of the data. Most (but not all) qualitative research lacks a formal hypothesis. A descriptive study like the one described above does not have a research hypothesis. Some other examples of research without a formal hypothesis include:
- pilot testing of a questionnaire,
- studies assessing validity or reliability, and
- exploratory or hypothesis generating research.
You can sometimes artificially contrive a hypothesis in these situations, but it is usually better to explicitly state that you don't have a research hypothesis. Instead identify the alternative goal you are trying to achieve or the question you are trying to answer. For example;
There is no research hypothesis for this pilot study. Our goal instead is to identify ambiguous language, missing categories, and other problems with the patient satisfaction questionnaire.
If you are testing a hypothesis, you need to specify that hypothesis as well as how you will test that hypothesis. This may appear difficult to you, but if you don't muck this up too badly, the IRB will probably give you a pass. You need to show enough detail so you don't appear totally incompetent.
If your data analysis plan is bad, it can still be fixed after the data are collected. In contrast, if you have a lousy control group or your sample size is grossly inadequate, you need to do something before you start collecting data.
So don't worry about the details too much. If you specify a Mann-Whitney test and you really needed to use a Kruskal-Wallis test instead, the IRB will probably still approve your study contingent on fixing that detail. Still, there are some statistical details that you need to worry about.
- If your data are paired or matched, you must use a statistical approach that acknowledges this.
- If some of your outcome variables are categorical and some of them are continuous, you have to use a different statistical model for each of these data types.
- If you plan to remove outliers or possibly stop your study early, you need to be explicit about the rules and conditions for these actions.
Specify what your alpha level is (usually 5%) and whether your hypothesis is one-sided or two-sided. A one-sided test looks at changes in a single direction. Changes in the opposite direction are considered either impossible or irrelevant. One-sided tests are often used when changes in the opposite direction would have the same implications as a null finding. For example, we might find that a new drug is equivalent to a placebo, or that it performs worse than a placebo. We would refuse to adopt the drug in either situation. So comparisons to a placebo are usually one-sided.
Contrast this with testing a standard drug to a new drug. If the new drug performs worse, we would never use it, but if it is equivalent, then we would use part of the time based on other factors like cost, convenience, and patient preference. Comparisons of two active drugs are usually two-sided. This might change, however, if the side effect profile of one drug is so harsh that you would only prescribe it when it is superior.
Further reading
- Assert: A standard for the review and monitoring of randomized clinical trials. Howard Mann. (Accessed on October 14, 2002). http://www.assert-statement.org/ Excerpt: "The ASSERT statement is the articulation of A Standard for the Scientific and Ethical Review of Trials. It proposes a structured approach whereby research ethics committees review proposals for, and monitor the conduct of, randomized controlled clinical trials. In order to ensure the ethical conduct of research involving human subjects, the ASSERT checklist comprises items that need to be addressed by investigators applying for approval to conduct a clinical trial. These items are chosen to enable fulfillment of certain universally applicable requirements for the ethical conduct of research: social and scientific value; scientific validity; fair subject selection; favorable risk-benefit ratio; and respect for potential and enrolled subjects."
- Content and quality of 2000 controlled trials in schizophrenia over 50 years. Thornley B and Adams C. British Medical Journal 1998:317(7167);1181-1184. [Abstract] [Full text] [PDF]
- Underascertainment of Child Maltreatment Fatalities by Death Certificates, 1990-1998. Crume TL, DiGuiseppi C, Byers T, Sirotnak AP and Garrett CJ. Pediatrics 2002:110(2);e18. [Abstract] [PDF]
Very bad joke: How many IRB members does it take to screw in a light bulb?
As documented in 45 CFR 46.107(a), this review board must consist of five (5) or more members, and at least one of these members must possess a background in Electrical Engineering. In addition, at least one of the members must come from a home without any electricity. Any member of the IRB who owns stock in an electrical utility or who regularly pays bills to an electrical utility should recuse themselves from participation in the review of this research.
If the bulb should burn too brightly, burn too dimly, or flicker, then an adverse event report should be sent to the IRB (21 CFR 312.32). If the light bulb is dropped, then a serious adverse event report should be sent to the FDA by telephone or by facsimile transmission no later than seven (7) calendar days after the sponsor's initial receipt of the information.
If this is a multi-center light bulb trial, then a data and safety monitoring board (DSMB) may be needed (NIH Policy for Data and Safety Monitoring, June 10, 1998, http://grants.nih.gov/grants/guide/notice-files/not98-084.html, accessed on October 9, 2002). The DSMB should review any adverse event reports and interim results. If the clinical equipoise of the light bulb is lost, then the DSMB should terminate the study and provide all previously recruited light bulbs with the best available light bulb socket.
In order to maintain scientific integrity, the use of a placebo socket may be necessary. The placebo socket should have the same taste, appearance, and smell of a regular socket and the fact that this socket has no electricity should be hidden from the light bulb and from the person screwing in the light bulb. According to the 2000 revision of the Declaration of Helsinki, paragraph 29, the use of placebo sockets is acceptable where no proven prophylactic, diagnostic, or therapeutic socket exists.
A systematic review of all previous research into light bulbs must be presented so that the IRB can determine, per 45 CFR 46.11(a)(2), that the risks to the light bulb are reasonable in relation to anticipated benefits. The IRB should also ensure that the selection of light bulbs is equitable (45 CFR 46.11(a)(3)). If the light bulb has less than 18 watts of power, then additional requirements (45 CFR 46.401 through 409) apply.
The IRB must ensure that an informed consent document be prepared in language that the light bulb understands (45 CFR 46.116). This document should explain the expected duration of the light bulb's participation in the research, any reasonably foreseeable risks, and the extent to which the confidentiality of the light bulb will be maintained. This document should also emphasize that participation is voluntary and the light bulb can withdraw itself from the socket at any time without any penalty or loss of benefits.
The clipart on this page was courtesy of the clipsahoy web site: http://www.clipsahoy.com/index2.html. The remainder of the material is licensed under a Creative Commons This page was last modified on 07/08/08 . Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Ask Professor Mean,
Three things you need for a power calculation (November 8, 2001) Category: Ask Professor Mean, Category: Sample size justification
Dear Professor Mean, I want to do research. Is forty subjects enough, or do I need more? Didn't I hear you mention something about three things you need for a power calculation? -- Eager Edward
Dear Eager,
That reminds me of a cute joke. How many research subjects does it take to screw in a light bulb? At least 300 if you want the bulb to have adequate power.
Sorry, I was digressing. Is forty subjects an adequate sample size? That depends on a lot of factors. The basic idea, though, is to select a sample size which ensures that your study has adequate power. Power is the probability that your research study will successfully detect a difference, assuming that the treatment or exposure you are examining actually can cause an important difference. If you don't care whether your experiment is successful or not, then you can use just about any sample size.
Short answer
Power is to a research design like sensitivity is to a diagnostic test. A diagnostic test with good sensitivity is normally able to detect a disease when the disease is present. A research study with good power is normally able to detect a change when your treatment is indeed effective.
The actual calculation of power requires three pieces of information:
- your research hypothesis,
- the variability of your outcome measure, and
- your estimate of the clinically relevant difference.
Calculating power is sometimes difficult and it may require you to go to the time and expense of running a pilot study. But you should NEVER start a research project without knowing what your power is. That would be like using a diagnostic test with unknown sensitivity.
Research hypothesis
A research hypothesis will provide specific information that will determine what type of analysis is needed. A common structure for a research hypothesis is specification of the subject group you are testing, the treatment or exposure that this group will receive, the outcome measure, and the comparison or control group.
Some exploratory studies may not have a research hypothesis, of course, and for those studies you determine an appropriate sample size in a different way (for example, by insuring that the estimates from this exploratory study have adequate precision).
Variability of your outcome measure
You also need to have an estimate of the variability of your outcome measure. I'm assuming here that your outcome measure is continuous variable like birth weight or cholesterol level. If you are using a categorical outcome measure like mortality or cancer remission, then you need some estimate of the rate of mortality or remission in your control group.
Your literature review (you did do a literature review before you started this research, I hope), will usually provide you with an estimate of variability. Select a study that is reasonably similar to what you plan to do, and find out what that study reported for the standard deviation for your outcome measure.
Although I prefer a standard deviation, other estimates of variability are also acceptable. If the paper reports a variance, a standard error, a confidence interval, or a coefficient of variation, then there are simple formulas for converting these into standard deviations. If the study priveds a range, then you can divide the range by four to get a good approximation for the standard deviation.
Many of the people I see have a difficult time providing any estimate of variability. This area hasn't been studied before, so no one knows what the variability will be. But don't give up too easily.
First keep in mind that you only need a crude estimate of variability. Power calculations are capable of determining if you are "in the right ball park." They are good at specifying your sample size down to an order of magnitude perhaps but not much more than that. In other words, might tell you whether you need hundreds of subjects dozens of subjects instead of hundreds of subjects, or possibly if you need thousands of subjects.
Second, although most research is innovative and therefore unique, this innovation is often in the treatment and not in the outcome measure. So look for studies that used the same outcome measure, even if the treatment is quite different than yours.
Third, try to characterize variability in your control group and we can try to extrapolate what the variability will be in the treatment group. A retrospective chart review, for example, will provide a rough estimate of variability of your outcome measure under the current standard of care.
Third, you may have to use a clearly flawed estimate, but a flawed estimate of variability may still be better than no estimate at all. An estimate of variability in adults, for example, may not be an ideal estimate for a pediatric study, but at least it tells you if your study will have adequate power assuming that the variation in a pediatric population is comparable to variation in an adult population. That's still better than having no idea whether your study has adequate power.
If you've tried and you still can't come up with an estimate of variability, then don't despair. A pilot study can provide you with an estimate of variability when all else fails. Usually 20 to 30 subjects produce a reasonably stable estimate of variability. A pilot study is also helpful for finding out how quickly you can recruit subjects. Furthermore, a pilot study will also identify any weaknesses in the logistics of your research. Finally, if the protocol remains substantially unchanged after the pilot study, you can usually include those pilot subjects in the final analysis.
Clinically relevant difference
Wow, that was exhausting! You're not done, though, until you can tell me what a clinically relevant difference would be for your outcome measure. This is a difference that is large enough to be considered important by a practicing clinician.
For just about every type of study, some differences are so small as to be clinically meaningless. From a theoretical viewpoint, perhaps, changes of any size might be interesting. But theory and practice are very different. If a six month diet program produces an average weight loss of three pounds, a fever medicine reduces average temperature by half a degree Fahrenheit, or a smoking cessation program helps an additional two percent to quit, who cares what the theoretical implicaitons might be.
It's not easy but this is something that you have to do for yourself. The clinically relevant difference is determined by medical experts and not by statisticians. Hey, I'm still trying to understand the difference between good and bad cholesterol; I wouldn't even be able to start thinking about how much of a change in cholesterol is considered clinically relevant. You might start by asking yourself "How much of an improvement would I have to see before I would adopt a new treatment?" Also, try talking with some of your colleagues. And look at the size of improvements for other successful treatments.
Still, there are some general guidelines that might help. Try looking at the resolution of your measuring device, thinking in terms of relative changes, or specifying changes with respect to your standard deviation.
Average changes that are smaller than the resolution of your measuring instrument are probably not clinically relevant. For example, Apgar scores can take on any whole number between 0 and 10. Gestational age can only be measured accurately to within a week In these contexts, it is clear that average changes should probably be greater than one unit in order to achieve relevance.
Still this is not a perfect rule. We can measure weights to within a gram, but changes in birth weight would have to be in the hundreds of grams or more to be meaningful. And while no family can have a fractional number of children, decreasing the average family size by 0.2 children can have a profound effect on society.
It also may help to think in terms of relative changes. If you can change something by 25 percent or 50 percent, that is considered relevant in most contexts. It becomes harder to argue clinical relevance for changes of less than 10 percent. Again, this is not a perfect rule.
Finally, you might find it easier to specify changes with respect to your standard deviation. This type of change is called an effect size. A common classification is that 0.2 standard deviations is considered a small effect size, 0.5 standard deviations is considered a medium effect size, and 0.8 standard deviations is considered a large effect size.
An effect size of 0.2 is small enough that there is no obvious visible separation between the two groups. The difference in average heights between 15 and 16 year old girls is 0.2 standard deviations. An effect size of 0.8 is clearly visible. The difference in average heights between 14 and 18 year old girls is 0.8 standard deviations.
It may be unrealistic to look for changes much smaller than 0.2 standard deviations because the sample sizes become prohibitively large. It may also be unrealistc to expect to see changes much larger than 0.8 standard deviations since this size change does not seem to occur too often in the published literature.
Like the other two rules, this rule is also not perfect. In some animal experiments, for example, the similarity in the gene pool can often reduce variation to such an extent that changes of more than a full standard deviation are quite realistic. If you are trying to specify a clinically relevant difference, there is no substitute for a good understanding of the context of your research.
But I can't do it.
A lot of people tell me that they can't do this. They can't provide an estimate of variability or they can't determine what a clinically relevant difference is, even after I explain all of the above suggestions.
But you have to do it.
The CONSORT Guidelines require you to have an a priori justification of sample size for publication. If you don't do this now, you won't be able to publish the data in any journal that uses these guidelines. What's the point of doing the research if you can't publish it?
If your research requires an ethical review (e.g., through an IRB), they will require the same a priori justification. If the research involves animals, the appropriate animal care and use committee will require this justification.
The bottom line is that if you know so little about this avenue of research that you can't even come up with a preliminary estimate of the variability of your outcome variable, then you shouldn't be doing the research. You need instead to:
- do a more thorough literature review,
- collect some pilot data, or
- switch to an outcome measure whose variability is known to some extent.
But do something, because your ability to perform the research and to publish your research depends on your justification of the sample size.
Example
In a study of two different skin barriers for burn patients, we are interested in three outcome measures: pain, healing time, and cost. We will randomly assign half of the patients to one skin barrier and half to the other.
For pediatric patients we usually measure pain with the Oucher, a five point scale that has been validated for children. A review of previous studies using the Oucher have shown that it has a standard deviation of about 1.5 units. We would be interested in seeing how large a sample size is needed to show a change of 1 unit, the smallest individual change attainable on the Oucher. We want to have a power of .80, or equivalently, the probability of a Type II error of .20.
The formulas for sample size vary from problem to problem. The sample size needed for a comparison of two independent groups is
We use the letter "z" to represent a standard normal distribution. Alpha represents the probability of a Type I error (usually .05). Beta represents the probability of a Type II error (we usually want this to somewhere between .05 and .20). Sigma represents the standard deviation, and this formula allows for the possibility of different standard deviations in group 1 and group 2. Don't forget that the formula requires you to square these standard deviations. Finally, D is the clinically relevant difference. In our example,
We round up. So in order to achieve 80% power for detecting a one unit difference in the Oucher score, which has a reported standard deviation of 1.5, we would need to sample 36 patients in each group.
Healing time is a more difficult endpoint to assess. Medical textbooks cite that the healing time for second degree burns has a range of 4 days (minimum 10, maximum 14). A study of healing times for a glove made from one of the skin barriers showed a healing time range of 6 (minimum 2 and maximum 8 days).
A rule of thumb is that the standard deviation is about one fourth to one sixth the size of the range. So we could have a standard deviation as small as 0.67 or as large as 1.5. An average change of one day in healing time would be considered clinically relevant.
If we use the largest possible estimate of standard deviation, we would get (coincidentally) the exact same sample size of 36 per group. If we used the smallest estimate of the standard deviation, we would need only 7 subjects per group.
Ffor one type of skin barrier, a study of costs showed a range of $4.00 ($5.50 to $9.50). We would like to be able to detect a difference as small as $0.50 in costs.
Using the same rule of thumb, we get an estimate of the standard deviation of either 0.67 or 1.0. Using the smaller estimate of standard deviation, we would need 29 subjects per group using the smaller estimate of standard deviation. We would need 63 subjects per group, using the larger estimate.
A sample size of 63 is untenable, so we decide that we can live with a study that could only detect a $1.00 change in costs. For this size difference, we would need 16 subjects per group using the larger standard deviation.
In summary, to achieve adequate power for all three endpoints, we would need 36 patients per group,. This is larger than we need for the healing time endpoint. It is also larger than what we need for the cost endpoint, unless we wanted to detect a $0.50 change in costs. To detect such a small difference, we need a sample size of 63 subjects per group.
Summary
Eager Edgar wants to know if forty subjects is enough to conduct a research study. Professor Mean explains that it is impossible to determine whether forty is an appropriate sample size without having these three things:
- a research hypothesis,
- a standard deviation for your outcome measure, and
- an estimate of the clinically relevant difference for this outcome measure.
Further reading
Jacob Cohen has an excellent discussion of effect sizes in Chapter 2 of his book and the examples of girls heights comes directly from this book. Bernard Rosner incorporates a discussion of power and sample size issues into every section on statistical testing. Russ Lenth's PiFace software will provide more accurate power calculations than those presented here (or in Rosner's book), which is especially important when you are estimating power for small sample sizes. The range method for estimating staindard deviations gives a more precise rule for converting a range into a standard deviation.
- Power and sample size page.
Russell V. Lenth (Accessed on January 1, 2002).
http://www.stat.uiowa.edu/~rlenth/Power/- Range method for estimating standard deviation.
(Accessed on October 2, 2000)
http://www.uop.edu/cop/psychology/Statistics/range_method.html- Statistical Power Analysis for the Behavioral Sciences, Revised Edition.
Cohen J.
New York NY: Academic Press (1977).
ISBN: 0-12-179060-6.- Fundamentals of Biostatistics, Third Edition.
Rosner B.
Belmont CA: Duxbury Press (1990).
ISBN: 0-534-91973-1.
This page was written by Steve Simon and was last modified on 07/14/2008.
Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.



There
are many complex formulas for determining sample size; here is some general
advice.
There are
a lot of scientific issues that I can't answer here. Is arterial
distensibility is a good marker of heart disease? What is the best way to
determine gestational age? Should you measure blood pressure in the left arm
or the right arm?


