Children's Mercy Hospital
Find a Doctor | Press Room | Careers | Directions & Locations

About Us | Contact Us | Giving to Children's Mercy
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Stats #22: What Do All These Numbers Mean? Confidence Intervals and P-Values

Content:  This two hour training class will teach you how to interpret confidence intervals and p-values.

Objectives:  In this class you will learn how to:

  • distinguish between statistical significance and clinical significance;
  • define and interpret p-values; and
  • explain the ethical issues associated with inadequate sample sizes.

Teaching strategies:  Didactic lectures and small group exercises.

IRB Education Credits:  This class qualifies for two hours of IRB Education Credits (IRBECs).

Outline:

  • Overview of the STATS web pages
  • Consulting services that I provide
  • Confidence Intervals
  • Type I Error
  • Type II Error
  • P-Values
  • Please fill out an evaluation form

Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


For CMH employees only: Statistical Consulting Services.

You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.

Here area some examples of the services that we have provided:

  • setting up your research hypothesis,
  • selecting and justifying your sample size,
  • writing the statistical methods section for your grant,
  • preparing randomization tables for your study,
  • reviewing your surveys for content and quality,
  • developing a system for entering your data,
  • choosing an appropriate statistical model for your data,
  • establishing validity and/or reliability for your measurement scales,
  • checking for violations of statistical assumptions in your data,
  • producing graphs and tables for your research publication, and
  • providing references for new and unusual statistical methods.

Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.

How to get in touch with a statistician

If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).

This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Directions to my new office (April 25, 2008).

I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.

  • Take the elevator of the research tower down to the yellow level. Exit the employee parking garage on 23rd Street, walk to Kenwood and cross 23rd Street. Your destination is Building M 3 which is the building closest to 22nd Street. However, the entrance to our building faces Building M 2. It’s best to walk into the parking area that is just north of Building M 1 and follow the sidewalk around the west side of building M 2 in order to get to our building’s entrance on its south side. Another route would be to exit the Hospital Hill Center Building on Holmes and then walk ½ block north to 23rd Street, cross 23rd Street, walk west to Kenwood then north to building M 3 address 2220 Kenwood.

This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Stats >> Training >> Stats #22: Practice Exercises

Review the abstracts listed below. Interpret each of the confidence intervals and/or p-values presented in the abstract.

Note that as a general rule, it is dangerous to review a journal article on the basis of the abstract alone. To save space and paper, I have not included the full articles, but you can find the full free text of each of these articles on the web if you wish to explore further.

1. The Impact of Incident Vertebral and Non-Vertebral Fractures on Health-Related Quality of Life in Postmenopausal Women. Adachi JD, Ioannidis G, Olszynski WP, Brown JP, Hanley DA, Sebaldt RJ, Petrie A, Tenenhouse A, Stephenson GF, Papaioannou A, Guyatt GH, Goldsmith CH. BMC Musculoskeletal Disorders 2002, 3:11 (22 April 2002) Background: Little empirical research has examined the multiple consequences of osteoporosis on quality of life. Methods: Health related quality of life (HRQL) was examined in relationship to incident fractures in 2009 postmenopausal women 50 years and older who were seen in consultation at our tertiary care, university teaching hospital-affiliated office and who were registered in the Canadian Database of Osteoporosis and Osteopenia (CANDOO) patients. Patients were divided into three study groups according to incident fracture status: vertebral fractures, non-vertebral fractures and no fractures. Baseline assessments of anthropometric data, medical history, therapeutic drug use, and prevalent fracture status were obtained from all participants. The disease-targeted mini-Osteoporosis Quality of Life Questionnaire (mini-OQLQ) was used to measure HRQL. Results: Multiple regression analyses revealed that subjects who had experienced an incident vertebral fracture had lower HRQL difference scores as compared with non-fractured participants in total score (-0.86; 95% confidence intervals (CI): -1.30, -0.43) and the symptoms (-0.76; 95% CI: -1.23, -0.30), physical functioning (-1.12; 95% CI: -1.57, -0.67), emotional functioning (-1.06; 95% CI: -1.44, -0.68), activities of daily living (-1.47; 95% CI: -1.97, -0.96), and leisure (-0.92; 95% CI: -1.37, -0.47) domains of the mini-OQLQ. Patients who experienced an incident non-vertebral fracture had lower HRQL difference scores as compared with non-fractured participants in total score (-0.47; 95% CI: -0.70, -0.25), and the symptoms (-0.25; 95% CI: -0.49, -0.01), physical functioning (-0.39; 95% CI: -0.65, -0.14), emotional functioning (-0.97; 95% CI: -1.20, -0.75) and the activities of daily living (-0.47; 95% CI: -0.73, -0.21) domains. Conclusion: Quality of life decreased in patients who sustained incident vertebral and non-vertebral fractures.

2. A Retrospective Population Based Trend Analysis on Hospital Admissions for Lower Respiratory Illness Among Swedish Children From 1987 to 2000. Björ O, Bråbäck L. BMC Public Health 2003, 3:22 (11 July 2003) Background: Data relating to hospital admissions of very young children for wheezing illness have been conflicting. Our primary aim was to assess whether a previous increase in hospital admissions for lower respiratory illness had continued in young Swedish children. We have included re-admissions in our analyses in order to evaluate the burden of lower respiratory illness in very young children. We have also assessed whether changes in the labelling of symptoms have affected the time trend. Methods: A retrospective, population based study was conducted to assess the time trend in admissions and re-admissions for lower respiratory illness. Data were obtained from the Swedish Hospital Discharge Register for all children with a first hospital admission before nine years of age, a total of 109,176 children. The register covers more than 98% of all hospital admissions in Sweden. The coding of diagnoses was based on ICD-9 from 1987 to 1996 and ICD-10 from 1997. Results: The first admission rates declined significantly in children with a first admission after two years of age. However, an increasing admission trend was observed in children aged less than one year and 35% of first admissions occurred in this age group. The annual increase was 3.8% (95% CI 1.3–6.3) in boys and 5.0% (95% CI 2.4–7.6) in girls. A diagnostic shift appeared to occur when ICD-10 was introduced in 1997. The asthma and pneumonia admission rate in children aged less than one year levelled off, whereas the increase in admissions for bronchitis continued. The re-admission rates for asthma decreased and the probability of re-admission was higher in boys. National drug statistics demonstrated a substantial increase in the delivery of inhaled steroids to all age groups but most prescriptions occurred to children aged one year or more. Conclusion: Hospital admissions for lower respiratory illness are still increasing in children aged <1 year. Our findings are in line with other recent studies suggesting a change in the responsiveness to viral infections in very young children, but changes in admission criteria cannot be excluded. An increased use of inhaled steroids may have contributed to decreasing re-admission rates.

3. Elevated White Cell Count in Acute Coronary Syndromes: Relationship to Variants in Inflammatory and Thrombotic Genes. Byrne CE, Fitzgerald A, Cannon CP, Fitzgerald DJ, Shields DC. BMC Medical Genetics 2004, 5:13 (1 June 2004) Background: Elevated white blood cell counts (WBC) in acute coronary syndromes (ACS) increase the risk of recurrent events, but it is not known if this is exacerbated by pro-inflammatory factors. We sought to identify whether pro-inflammatory genetic variants contributed to alterations in WBC and C-reactive protein (CRP) in an ACS population. Methods: WBC and genotype of interleukin 6 (IL-6 G-174C) and of interleukin-1 receptor antagonist (IL1RN intronic repeat polymorphism) were investigated in 732 Caucasian patients with ACS in the OPUS-TIMI-16 trial. Samples for measurement of WBC and inflammatory factors were taken at baseline, i.e. Within 72 hours of an acute myocardial infarction or an unstable angina event. Results: An increased white blood cell count (WBC) was associated with an increased C-reactive protein (r = 0.23, p < 0.001) and there was also a positive correlation between levels of β-fibrinogen and C-reactive protein (r = 0.42, p < 0.0001). IL1RN and IL6 genotypes had no significant impact upon WBC. The difference in median WBC between the two homozygote IL6 genotypes was 0.21/mm3 (95% CI = -0.41, 0.77), and -0.03/mm3 (95% CI = -0.55, 0.86) for IL1RN. Moreover, the composite endpoint was not significantly affected by an interaction between WBC and the IL1 (p = 0.61) or IL6 (p = 0.48) genotype. Conclusions: Cytokine pro-inflammatory genetic variants do not influence the increased inflammatory profile of ACS patients.

4. Effect of Paper Quality on the Response Rate to a Postal Survey: A Randomised Controlled Trial. [ISRCTN32032031]. Clark TJ, Khan KS, Gupta JK. BMC Medical Research Methodology 2001, 1:12 (17 December 2001) Background: Response rates to surveys are declining and this threatens the validity and generalisability of their findings. We wanted to determine whether paper quality influences the response rate to postal surveys Methods: A postal questionnaire was sent to all members of the British Society of Gynaecological Endoscopy (BSGE). Recipients were randomised to receiving the questionnaire printed on standard quality paper or high quality paper. Results: The response rate for the recipients of high quality paper was 43/195 (22%) and 57/194 (29%) for standard quality paper (relative rate of response 0.75, 95% CI 0.33–1.05, p = 0.1 Conclusion: The use of high quality paper did not increase response rates to a questionnaire survey of gynaecologists affiliated to an endoscopic society.

5. Is There a Clinically Significant Gender Bias in Post-Myocardial Infarction Pharmacological Management in the Older (>60) Population of a Primary Care Practice? Di Cecco R, Patel U, Upshur REG. BMC Family Practice 2002, 3:8 (3 May 2002) Background: Differences in the management of coronary artery disease between men and women have been reported in the literature. There are few studies of potential inequalities of treatment that arise from a primary care context. This study investigated the existence of such inequalities in the medical management of post myocardial infarction in older patients. Methods: A comprehensive chart audit was conducted of 142 men and 81 women in an academic primary care practice. Variables were extracted on demographic variables, cardiovascular risk factors, medical and non-medical management of myocardial infarction. Results: Women were older than men. The groups were comparable in terms of cardiac risk factors. A statistically significant difference (14.6%: 95% CI 0.048–28.7 p = 0.047) was found between men and women for the prescription of lipid lowering medications. 25.3% (p = 0.0005, CI 11.45, 39.65) more men than women had undergone angiography, and 14.4 % (p = 0.029, CI 2.2, 26.6) more men than women had undergone coronary artery bypass graft surgery. Conclusion: Women are less likely than men to receive lipid-lowering medication which may indicate less aggressive secondary prevention in the primary care setting.

6. Effect of Paracetamol (Acetaminophen) and Ibuprofen on Body Temperature in Acute Ischemic Stroke PISA, A Phase II Double-Blind, Randomized, Placebo-Controlled Trial [ISRCTN98608690]. Dippel DWJ, van Breda EJ, van der Worp HB, van Gemert HMA, Meijer RJ, Kappelle LJ, Koudstaal PJ, the PISA-investigators. BMC Cardiovascular Disorders 2003, 3:2 (6 February 2003) Background: Body temperature is a strong predictor of outcome in acute stroke. In a previous randomized trial we observed that treatment with high-dose acetaminophen (paracetamol) led to a reduction of body temperature in patients with acute ischemic stroke, even when they had no fever. The purpose of the present trial was to study whether this effect of acetaminophen could be reproduced, and whether ibuprofen would have a similar, or even stronger effect. Methods: Seventy-five patients with acute ischemic stroke confined to the anterior circulation were randomized to treatment with either 1000 mg acetaminophen, 400 mg ibuprofen, or placebo, given 6 times daily during 5 days. Treatment was started within 24 hours from the onset of symptoms. Body temperatures were measured at 2-hour intervals during the first 24 hours, and at 6-hour intervals thereafter. Results: No difference in body temperature at 24 hours was observed between the three treatment groups. However, treatment with high-dose acetaminophen resulted in a 0.3°C larger reduction in body temperature from baseline than placebo treatment (95% CI: 0.0 to 0.6 °C). Acetaminophen had no significant effect on body temperature during the subsequent four days compared to placebo, and ibuprofen had no statistically significant effect on body temperature during the entire study period. Conclusions: Treatment with a daily dose of 6000 mg acetaminophen results in a small, but potentially worthwhile decrease in body temperature after acute ischemic stroke, even in normothermic and subfebrile patients. Further large randomized clinical trials are needed to study whether early reduction of body temperature leads to improved outcome.

7. Effects of Carrying a Pregnancy and of Method of Delivery on Urinary Incontinence: A Prospective Cohort Study. Eason E, Labrecque M, Marcoux S, Mondor M. BMC Pregnancy and Childbirth 2004, 4:4 (19 February 2004) Background :This study was carried out to identify risk factors associated with urinary incontinence in women three months after giving birth. Methods: Urinary incontinence before and during pregnancy was assessed at study enrolment early in the third trimester. Incontinence was re-assessed three months postpartum. Logistic regression analysis was used to assess the role of maternal and obstetric factors in causing postpartum urinary incontinence. This prospective cohort study in 949 pregnant women in Quebec, Canada was nested within a randomised controlled trial of prenatal perineal massage. Results: Postpartum urinary incontinence was increased with prepregnancy incontinence (adjusted odds ratio [adj0R] 6.44, 95% CI 4.15, 9.98), incontinence beginning during pregnancy (adjOR 1.93, 95% CI 1.32, 2.83), and higher prepregnancy body mass index (adjOR 1.07/unit of BMI, 95% CI 1.03,1.11). Caesarean section was highly protective (adjOR 0.27, 95% CI 0.14, 0.50). While there was a trend towards increasing incontinence with forceps delivery (adjOR 1.73, 95% CI 0.96, 3.13) this was not statistically significant. The weight of the baby, episiotomy, the length of the second stage of labour, and epidural analgesia were not predictive of urinary incontinence. Nor was prenatal perineal massage, the randomised controlled trial intervention. When the analysis was limited to women having their first vaginal birth, the same risk factors were important, with similar adjusted odds ratios. Conclusions: Urinary incontinence during pregnancy is extremely common, affecting over half of pregnant women. Urinary incontinence beginning during pregnancy roughly doubles the likelihood of urinary incontinence at 3 months postpartum, regardless whether delivery is vaginal or by Caesarean section.

8. Breastfeeding Practices in a Cohort of Inner-City Women: The Role of Contraindications. England L, Brenner R, Bhaskar B, Simons-Morton B, Das A, Revenis M, Mehta N, Clemens J. BMC Public Health 2003, 3:28 (20 August 2003) Background: Little is known about the role of breastfeeding contraindications in breastfeeding practices. Our objectives were to 1) identify predictors of breastfeeding initiation and duration among a cohort of predominately low-income, inner-city women, and 2) evaluate the contribution of breastfeeding contraindications to breastfeeding practices. Methods: Mother-infant dyads were systematically selected from 3 District of Columbia hospitals between 1995 and 1996. Breastfeeding contraindications and potential predictors of breastfeeding practices were identified through medical record reviews and interviews conducted after delivery (baseline). Interviews were conducted at 3–7 months postpartum and again at 7–12 months postpartum to determine breastfeeding initiation rates and duration. Multivariable logistic regression analysis was used to identify baseline factors associated with initiation of breastfeeding. Cox proportional hazards models were generated to identify baseline factors associated with duration of breastfeeding. Results: Of 393 study participants, 201 (51%) initiated breastfeeding. A total of 61 women (16%) had at lease one documented contraindication to breastfeeding; 94% of these had a history of HIV infection and/or cocaine use. Of the 332 women with no documented contraindications, 58% initiated breastfeeding, vs. 13% of women with a contraindication. In adjusted analysis, factors most strongly associated with breastfeeding initiation were presence of a contraindication (adjusted odds ratio [AOR], 0.19; 95% confidence interval [CI], 0.08–0.47), and mother foreign-born (AOR, 4.90; 95% CI, 2.38–10.10). Twenty-five percent of study participants who did not initiate breastfeeding cited concern about passing dangerous things to their infants through breast milk. Factors associated with discontinuation of breastfeeding (all protective) included mother foreign-born (hazard ratio [HR], 0.55; 95% CI 0.39–0.77) increasing maternal age (HR for 5-year increments, 0.80; 95% CI, 0.69–0.92), and infant birth weight ≥ 2500 grams (HR, 0.45; 95% CI, 0.26–0.80). Conclusions: Breastfeeding initiation rates and duration were suboptimal in this inner-city population. Many women who did not breastfeed had contraindications and/or were concerned about passing dangerous things to their infants through breast milk. It is important to consider the prevalence of contraindications to breastfeeding when evaluating breastfeeding practices in high-risk communities.

9. Randomised Controlled Trial of a Theoretically Grounded Tailored Intervention to Diffuse Evidence-Based Public Health Practice [ISRCTN23257060]. Forsetlund L, Bradley P, Forsen L, Nordheim L, Jamtvedt G, Bjørndal A. BMC Medical Education 2003, 3:2 (13 March 2003) Background: Previous studies have shown that Norwegian public health physicians do not systematically and explicitly use scientific evidence in their practice. They work in an environment that does not encourage the integration of this information in decision-making. In this study we investigate whether a theoretically grounded tailored intervention to diffuse evidence-based public health practice increases the physicians' use of research information. Methods: 148 self-selected public health physicians were randomised to an intervention group (n = 73) and a control group (n = 75). The intervention group received a multifaceted intervention while the control group received a letter declaring that they had access to library services. Baseline assessments before the intervention and post-testing immediately at the end of a 1.5-year intervention period were conducted. The intervention was theoretically based and consisted of a workshop in evidence-based public health, a newsletter, access to a specially designed information service, to relevant databases, and to an electronic discussion list. The main outcome measure was behaviour as measured by the use of research in different documents. Results: The intervention did not demonstrate any evidence of effects on the objective behaviour outcomes. We found, however, a statistical significant difference between the two groups for both knowledge scores: Mean difference of 0.4 (95% CI: 0.2–0.6) in the score for knowledge about EBM-resources and mean difference of 0.2 (95% CI: 0.0–0.3) in the score for conceptual knowledge of importance for critical appraisal. There were no statistical significant differences in attitude-, self-efficacy-, decision-to-adopt- or job-satisfaction scales. There were no significant differences in Cochrane library searching after controlling for baseline values and characteristics. Conclusion: Though demonstrating effect on knowledge the study failed to provide support for the hypothesis that a theory-based multifaceted intervention targeted at identified barriers will change professional behaviour.

10. Family Structure and Risk Factors for Schizophrenia: Case-Sibling Study. Haukka JK, Suvisaari J, Lonnqvist J. BMC Psychiatry 2004, 4:41 (27 November 2004). Background: Several family structure-related factors, such as birth order, family size, parental age, and age differences to siblings, have been suggested as risk factors for schizophrenia. We examined how family-structure-related variables modified the risk of schizophrenia in Finnish families with at least one child with schizophrenia born from 1950 to 1976. Methods: We used case-sibling design, a variant of the matched case-control design in the analysis. Patients hospitalized for schizophrenia between 1969 and 1996 were identified from the Finnish Hospital Discharge Register, and their families from the Population Register Center. Only families with at least two children (7914 sibships and 21059 individuals) were included in the analysis. Conditional logistic regression with sex, birth cohort, maternal schizophrenia status, and several family-related variables as explanatory variables was used in the case-sibling design. The effect of variables with the same value in each sibship was analyzed using ordinary logistic regression. Results: Having a sibling who was less than five years older (OR 1.46, 95% CI 1.29-1.66), or being the firstborn (first born vs. second born 1.62, 1.87-1.4) predicted an elevated risk, but having siblings who were more than ten years older predicted a lower risk (0.66, 0.56-0.79). Conclusions: Several family-structure-related variables were identified as risk factors for schizophrenia. The underlying causative mechanisms are likely to be variable.

11. Overweight, Obesity, and Colorectal Cancer Screening: Disparity Between Men and Women. Heo M, Allison DB, Fontaine KR. BMC Public Health 2004, 4:53 (8 November 2004) Background: To estimate the association between body-mass index (BMI: kg/m2) and colorectal cancer (CRC) screening among US adults aged ≥ 50 years. Methods: Population-based data from the 2001 Behavioral Risk Factor Surveillance Survey. Adults (N = 84,284) aged ≥ 50 years were classified by BMI as normal weight (18.5–<25), overweight (25–<30), obesity class I (30–<35), obesity class II (35–<40), and obesity class III (≥ 40). Interval since most recent screening fecal occult blood test (FOBT): (0 = >1 year since last screening vs. 1 = screened within the past year), and screening sigmoidoscopy (SIG): (0 = > 5 years since last screening vs. 1 = within the past 5 years) were the outcomes. Results: Results differed between men and women. After adjusting for age, health insurance, race, and smoking, we found that, compared to normal weight men, men in the overweight (odds ratio [OR] 1.25, 95% CI = 1.05–1.51) and obesity class I (OR = 1.21, 95% CI = 1.03–1.75) categories were more likely to have obtained a screening SIG within the previous 5 years, while women in the obesity class I (OR = 0.86, 95%CI = 0.78–0.94) and II (OR = 0.88, 95%CI = 0.79–0.99) categories were less likely to have obtained a screening SIG compared to normal weight women. BMI was not associated with FOBT. Conclusion: Weight may be a correlate of CRC screening behavior but in a different way between men and women.

12. A national survey on the patterns of treatment of inflammatory bowel disease in Canada. Hilsden RJ, Verhoef MJ, Best A, Pocobelli G. BMC Gastroenterology 2003, 3:10 (5 June 2003) Background There is a general lack of information on the care of inflammatory bowel disease (IBD) in a broad, geographically diverse, non-clinic population. The purposes of this study were (1) to compare a sample drawn from the membership of a national Crohn's and Colitis Foundation to published clinic-based and population-based IBD samples, (2) to describe current patterns of health care use, and (3) to determine if unexpected variations exist in how and by whom IBD is treated. Methods Mailed survey of 4453 members of the Crohn's and Colitis Foundation of Canada. The questionnaire, in members stated language of preference, included items on demographic and disease characteristics, general health behaviors and current and past IBD treatment. Each member received an initial and one reminder mailing. Results Questionnaires were returned by 1787, 913, and 128 people with Crohn's disease, ulcerative colitis and indeterminate colitis, respectively. At least one operation had been performed on 1159 Crohn's disease patients, with risk increasing with duration of disease. Regional variation in surgical rates in ulcerative colitis patients was identified. 6-Mercaptopurine/Azathioprine was used by 24% of patients with Crohn's disease and 12% of patients with ulcerative colitis (95% CI for the difference: 8.9% – 15%). In patients with Crohn's disease, use was not associated with gender, income or region of residence but was associated with age and markers of disease activity. Infliximab was used by 112 respondents (4%), the majority of whom had Crohn's disease. Variations in infliximab use based on region of residence and income were not seen. Sixty-eight percent of respondents indicated that they depended most on a gastroenterologist for their IBD care. There was significant regional variation in this. However, satisfaction with primary physician did not depend on physician type (for example, gastroenterologist versus general practitioner). Conclusion This study achieved the goal of obtaining a large, geographically diverse sample that is more representative of the general IBD population than a clinic sample would have been. We could find no evidence of significant regional variation in medical treatments due to gender, region of residence or income level. Differences were noted between different age groups, which deserves further attention.

13. Do English and Chinese EQ-5D versions demonstrate measurement equivalence? an exploratory study. Luo N, Chew LH, Fong KY, Koh DR, Ng SC, Yoon KH, Vasoo S, Li SC, Thumboo J. Health and Quality of Life Outcomes 2003, 1:7 (17 April 2003) Background Although multiple language versions of health-related quality of life instruments are often used interchangeably in clinical research, the measurement equivalence of these versions (especially using alphabet vs pictogram-based languages) has rarely been assessed. We therefore investigated the measurement equivalence of English and Chinese versions of the EQ-5D, a widely used utility-based outcome instrument. Methods In a cross-sectional study, either EQ-5D version was administered to consecutive outpatients with rheumatic diseases. Measurement equivalence of EQ-5D item responses and utility and visual analog scale (EQ-VAS) scores between these versions was assessed using multiple regression models (with and without adjusting for potential confounding variables), by comparing the 95% confidence interval (95%CI) of score differences between these versions with pre-defined equivalence margins. An equivalence margin defined a magnitude of score differences (10% and 5% of entire score ranges for item responses and utility/EQ-VAS scores, respectively) which was felt to be clinically unimportant. Results Sixty-six subjects completed the English and 48 subjects the Chinese EQ-5D. The 95%CI of the score differences between these versions overlapped with but did not fall completely within pre-defined equivalence margins for 4 EQ-5D items, utility and EQ-VAS scores. For example, the 95%CI of the adjusted score difference between these EQ-5D versions was -0.14 to +0.03 points for utility scores and -11.6 to +3.3 points for EQ-VAS scores (equivalence margins of -0.05 to +0.05 and -5.0 to +5.0 respectively). Conclusion These data provide promising evidence for the measurement equivalence of English and Chinese EQ-5D versions.

14. Long term benzodiazepine use for insomnia in patients over the age of 60: discordance of patient and physician perceptions. Mah L, Upshur REG. BMC Family Practice 2002, 3:9 (8 May 2002) Background The aim of this study was to determine and compare patients' and physicians' perceptions of benefits and risks of long term benzodiazepine use for insomnia in the elderly. Methods A cross-sectional study (written survey) was conducted in an academic primary care group practice in Toronto, Canada. The participants were 93 patients over 60 years of age using a benzodiazepine for insomnia and 25 physicians comprising sleep specialists, family physicians, and family medicine residents. The main outcome measure was perception of benefit and risk scores calculated from the mean of responses (on a Likert scale of 1 to 5) to various items on the survey. Results The mean perception of benefit score was significantly higher in patients than physicians (3.85 vs. 2.84, p < 0.001, 95% CI 0.69, 1.32). The mean perception of risk score was significantly lower in patients than physicians (2.21 vs. 3.63, p < 0.001, 95% CI 1.07, 1.77). Conclusions There is a significant discordance between older patients and their physicians regarding the perceptions of benefits and risks of using benzodiazepines for insomnia on a long term basis. The challenge is to openly discuss these perceptions in the context of the available evidence to make collaborative and informed decisions.

15. Inter-rater agreement in the scoring of abstracts submitted to a primary care research conference. Montgomery AA, Graham A, Evans PH, Fahey T BMC Health Services Research 2002, 2:8 (26 March 2002) Background Checklists for peer review aim to guide referees when assessing the quality of papers, but little evidence exists on the extent to which referees agree when evaluating the same paper. The aim of this study was to investigate agreement on dimensions of a checklist between two referees when evaluating abstracts submitted for a primary care conference. Methods Anonymised abstracts were scored using a structured assessment comprising seven categories. Between one (poor) and four (excellent) marks were awarded for each category, giving a maximum possible score of 28 marks. Every abstract was assessed independently by two referees and agreement measured using intraclass correlation coefficients. Mean total scores of abstracts accepted and rejected for the meeting were compared using an unpaired t test. Results Of 52 abstracts, agreement between reviewers was greater for three components relating to study design (adjusted intraclass correlation coefficients 0.40 to 0.45) compared to four components relating to more subjective elements such as the importance of the study and likelihood of provoking discussion (0.01 to 0.25). Mean score for accepted abstracts was significantly greater than those that were rejected (17.4 versus 14.6, 95% CI for difference 1.3 to 4.1, p = 0.0003). Conclusions The findings suggest that inclusion of subjective components in a review checklist may result in greater disagreement between reviewers. However in terms of overall quality scores, abstracts accepted for the meeting were rated significantly higher than those that were rejected.

16. Effect of prize draw incentive on the response rate to a postal survey of obstetricians and gynaecologists: A randomised controlled trial. [ISRCTN32823119] Moses SH, Clark TJ. BMC Health Services Research 2004, 4:14 (28 June 2004) Background Response rates to postal questionnaires are falling and this threatens the external validity of survey findings. We wanted to establish whether the incentive of being entered into a prize draw to win a personal digital assistant (PDA) would increase the response rate for a national survey of consultant obstetricians and gynaecologists. Methods A randomised controlled trial was conducted. This involved sending a postal questionnaire to all Consultant Obstetricians and Gynaecologists in the United Kingdom. Recipients were randomised to receiving a questionnaire offering a prize draw incentive (on response) or no such incentive. Results The response rate for recipients offered the prize incentive was 64% (461/716) and 62% (429/694) in the no incentive group (relative rate of response 1.04, 95% CI 0.96 – 1.13) Conclusion The offer of a prize draw incentive to win a PDA did not significantly increase response rates to a national questionnaire survey of consultant obstetricians and gynaecologists.

17. Predicting gender differences as latent variables: summed scores, and individual item responses: a methods case study. Pietrobon R, Taylor M, Guller U, Higgins LD, Jacobs DO, Carey T. Health and Quality of Life Outcomes 2004, 2:59 (25 October 2004) Background Modeling latent variables such as physical disability is challenging since its measurement is performed through proxies. This poses significant methodological challenges. The objective of this article is to present three different methods to predict latent variables based on classical summed scores, individual item responses, and latent variable models. Methods This is a review of the literature and data analysis using "layers of information". Data was collected from the North Carolina Back Pain Project, using a modified version of the Roland Questionnaire. Results The three models are compared in relation to their goals and underlying concepts, previous clinical applications, data requirements, statistical theory, and practical applications. Initial linear regression models demonstrated a difference in disability between genders of 1.32 points (95% CI 0.65, 2.00) on a scale from 0–23. Subsequent item analysis found contradictory results across items, with no clear pattern. Finally, IRT models demonstrated three items were demonstrated to present differential item functioning. After these items were removed, the difference between genders was reduced to 0.78 points (95% CI, -0.99, 1.23). These results were shown to be robust with re-sampling methods. Conclusions Purported differences in the levels of a latent variable should be tested using different models to verify whether these differences are real or simply distorted by model assumptions.

19. The Outcome of Extubation Failure in a Community Hospital Intensive Care Unit: A Cohort Study. Seymour CW, Martinez A, Christie JD, Fuchs BD. Critical Care 2004, 8:R322-R327 (20 July 2004) Introduction: Extubation failure has been associated with poor intensive care unit (ICU) and hospital outcomes in tertiary care medical centers. Given the large proportion of critical care delivered in the community setting, our purpose was to determine the impact of extubation failure on patient outcomes in a community hospital ICU. Methods: A retrospective cohort study was performed using data gathered in a 16-bed medical/surgical ICU in a community hospital. During 30 months, all patients with acute respiratory failure admitted to the ICU were included in the source population if they were mechanically ventilated by endotracheal tube for more than 12 hours. Extubation failure was defined as reinstitution of mechanical ventilation within 72 hours (n = 60), and the control cohort included patients who were successfully extubated at 72 hours (n = 93). Results: The primary outcome was total ICU length of stay after the initial extubation. Secondary outcomes were total hospital length of stay after the initial extubation, ICU mortality, hospital mortality, and total hospital cost. Patient groups were similar in terms of age, sex, and severity of illness, as assessed using admission Acute Physiology and Chronic Health Evaluation II score (P > 0.05). Both ICU (1.0 versus 10 days; P < 0.01) and hospital length of stay (6.0 versus 17 days; P < 0.01) after initial extubation were significantly longer in reintubated patients. ICU mortality was significantly higher in patients who failed extubation (odds ratio = 12.2, 95% confidence interval [CI] = 1.5–101; P < 0.05), but there was no significant difference in hospital mortality (odds ratio = 2.1, 95% CI = 0.8–5.4; P < 0.15). Total hospital costs (estimated from direct and indirect charges) were significantly increased by a mean of US$33,926 (95% CI = US$22,573–45,280; P < 0.01). Conclusion: Extubation failure in a community hospital is univariately associated with prolonged inpatient care and significantly increased cost. Corroborating data from tertiary care centers, these adverse outcomes highlight the importance of accurate predictors of extubation outcome.

20. Effects of Isoflavones (soy phyto-estrogens) on Serum Lipids: A Meta-Analysis of Randomized Controlled Trials. Yeung J, Yu T. Nutrition Journal 2003, 2:15 (19 November 2003) Objectives: To determine the effects of isoflavones (soy phyto-estrogens) on serum total cholesterol (TC), low density lipoprotein cholesterol (LDL), high density lipoprotein cholesterol (HDL) and triglyceride (TG). Methods: We searched electronic databases and included randomized trials with isoflavones interventions in the forms of tablets, isolated soy protein or soy diets. Review Manager 4.2 was used to calculate the pooled risk differences with fixed effects model. Results: Seventeen studies (21 comparisons) with 853 subjects were included in this meta-analysis. Isoflavones tablets had insignificant effects on serum TC, 0.01 mmol/L (95% CI: -0.17 to 0.18, heterogeneity p = 1.0); LDL, 0.00 mmol/L (95% CI: -0.14 to 0.15, heterogeneity p = 0.9); HDL, 0.01 mmol/L (95% CI: -0.05 to 0.06, heterogeneity p = 1.0); and triglyceride, 0.03 mmol/L (95% CI: -0.06 to 0.12, heterogeneity p = 0.9). Isoflavones interventions in the forms of isolated soy protein (ISP), soy diets or soy protein capsule were heterogeneous to combine. Conclusions: Isoflavones tablets, isolated or mixtures with up to 150 mg per day, seemed to have no overall statistical and clinical benefits on serum lipids. Isoflavones interventions in the forms of soy proteins may need further investigations to resolve whether synergistic effects are necessary with other soy components.

All of the abstracts listed above are from Bio-Med Central, the Open Access Publisher. With open access,

Anyone is free:

  • to copy, distribute, and display the work;
  • to make derivative works;
  • to make commercial use of the work;

Under the following conditions: Attribution

  • the original author must be given credit;
  • for any reuse or distribution, it must be made clear to others what the license terms of this work are;
  • any of these conditions can be waived if the authors gives permission.

Statutory fair use and other rights are in no way affected by the above.

Bio-Med Central's Open Access Charter: http://www.biomedcentral.com/info/about/charter
Access to Bio-Med Central journals: http://www.biomedcentral.com/info/about/access


What is a population?

A collection of items of interest in research. The population represents a group that you wish to generalize your research to. Populations are often defined in terms of demography, geography, occupation, time, care requirements, diagnosis, or some combination of the above. Contrast this with a definition of a sample. An example of a population would be:

  • all infants born in the state of Missouri during the 1995 calendar year who have one or more visits to the Emergency room during their first year of life.

This webpage was written by Steve Simon on 2002-10-11, edited by Steve Simon, and was last modified on 2008-07-08. This page needs minor revisions. Category: Definitions, Category: Hypothesis testing.


What is a sample?

A subset of a population. A random sample is a subset where every item in the population has the same probability of being in the sample. Usually, the size of the sample is much less than the size of the population. The primary goal of much research is to use information collected from a sample to try to characterize a certain population. As such, you should pay a lot of attention to how representative the sample is of the population. If there are problems, with representativeness, consider redefining your population a bit more narrowly. For example, a sample of 85 smokers between the ages of 13 and 18 in Rochester, Minnesota who respond to an advertisement about participation in a smoking cessation program might not be considered representative of the population of all teenage smokers, because the participants selected themselves. The sample might be more representative if we restrict our population to those teenage smokers who want to quit.

This webpage was written by Steve Simon on 2002-10-11, edited by Steve Simon, and was last modified on 2008-07-08. This page needs minor revisions. Category: Definitions, Category: Hypothesis testing.

 


What is a Type I Error?

In your research, you specify a null hypothesis (typically labeled H0) and an alternative hypothesis (typically labeled Ha, or sometimes H1). By tradition, the null hypothesis corresponds to no change.

When you are using Statistics to decide between these two hypothesis, you have to allow for the possibility of error. Actually, if you are using any other procedure, you should still allow for the possibility of error, but we statisticians are the only ones honest enough to admit this.

A Type I error is rejecting the null hypothesis when the null hypothesis is true.

You should always remember that it is impossible to prove a negative. Some statisticians will emphasize this fact by using the phrase "fail to reject the null hypothesis" in place of "accept the null hypothesis." The former phrase always strikes me as semantic overkill.

Example

Consider a new drug that we will put on the market if we can show that it is better than a placebo. In this context, H0 would represent the hypothesis that the average improvement (or perhaps the probability of improvement) among all patients taking the new drug is equal to the average improvement (probability of improvement) among all patients taking the placebo.

  • A Type I error would be allowing an ineffective drug onto the market.

Suppose we are comparing two groups of patients, one with a possibly dangerous exposure (e.g., non-ionizing radiation), and the other unexposed. In this context, H0 would represent the hypothesis that the average level of harm (or perhaps the probability of harm) among those with exposure is equal to the average level (probability) of harm among those without the exposure.

  • A Type I error would be condemning an exposure that actually is safe.

This webpage was written by Steve Simon on 2007-04-05, edited by Steve Simon, and was last modified on 2008-07-08. Category: Definitions, Category: Hypothesis testing.


What is a Type II Error?

In your research, you specify a null hypothesis (typically labeled H0) and an alternative hypothesis (typically labeled Ha, or sometimes H1). By tradition, the null hypothesis corresponds to no change.

When you are using Statistics to decide between these two hypothesis, you have to allow for the possibility of error. Actually, if you are using any other procedure, you should still allow for the possibility of error, but we statisticians are the only ones honest enough to admit this.

  • A Type II error is accepting the null hypothesis when the null hypothesis is false.

Many studies have small sample sizes that make it difficult to reject the null hypothesis, even when there is a big change in the data. In these situations, a Type II error might be a possible explanation for the negative study results.

Example

Consider a new drug that we will put on the market if we can show that it is better than a placebo. In this context, H0 would represent the hypothesis that the average improvement (or perhaps the probability of improvement) among all patients taking the new drug is equal to the average improvement (probability of improvement) among all patients taking the placebo.

  • A Type II error would be keeping an effective drug off the market.

Suppose we are comparing two groups of patients, one with a possibly dangerous exposure (e.g., non-ionizing radiation), and the other unexposed. In this context, H0 would represent the hypothesis that the average level of harm (or perhaps the probability of harm) among those with exposure is equal to the average level (probability) of harm among those without the exposure.

  • A Type II error would be absolving an exposure that actually does harm.

This webpage was written by Steve Simon on 2007-04-05, edited by Steve Simon, and was last modified on 2008-07-08. This page needs minor revisions. Category: Definitions, Category: Hypothesis testing.


What is a p-value?

A p-value is a measure of how much evidence we have against the null hypothesis. The null hypothesis, traditionally represented by the symbol H0, represents the hypothesis of no change or no effect.

The smaller the p-value, the more evidence we have against H0. It is also a measure of how likely we are to get a certain sample result or a result “more extreme,” assuming H0 is true. The type of hypothesis (right tailed, left tailed or two tailed) will determine what “more extreme” means.

Much research involves making a hypothesis and then collecting data to test that hypothesis. In particular, researchers will set up a null hypothesis, a hypothesis that presumes no change or no effect of a treatment. Then these researchers will collect data and measure the consistency of this data with the null hypothesis.

The p-value measures consistency by calculating the probability of observing the results from your sample of data or a sample with results more extreme, assuming the null hypothesis is true. The smaller the p-value, the greater the inconsistency.

Traditionally, researchers will reject a hypothesis if the p-value is less than 0.05. Sometimes, though, researchers will use a stricter cut-off (e.g., 0.01) or a more liberal cut-off (e.g., 0.10). The general rule is that a small p-value is evidence against the null hypothesis while a large p-value means little or no evidence against the null hypothesis. Please note that little or no evidence against the null hypothesis is not the same as a lot of evidence for the null hypothesis.

It is easiest to understand the p-value in a data set that is already at an extreme. Suppose that a drug company alleges that only 50% of all patients who take a certain drug will have an adverse event of some kind. You believe that the adverse event rate is much higher. In a sample of 12 patients, all twelve have an adverse event.

The data supports your belief because it is inconsistent with the assumption of a 50% adverse event rate. It would be like flipping a coin 12 times and getting heads each time.

The p-value, the probability of getting a sample result of 12 adverse events in 12 patients assuming that the adverse event rate is 50%, is a measure of this inconsistency. The p-value, 0.000244, is small enough that we would reject the hypothesis that the adverse event rate was only 50%.

A large p-value should not automatically be construed as evidence in support of the null hypothesis. Perhaps the failure to reject the null hypothesis was caused by an inadequate sample size. When you see a large p-value in a research study, you should also look for one of two things:

  1. a power calculation that confirms that the sample size in that study was adequate for detecting a clinically relevant difference; and/or
  2. a confidence interval that lies entirely within the range of clinical indifference.

You should also be cautious about a small p-value, but for different reasons. In some situations, the sample size is so large that even differences that are trivial from a medical perspective can still achieve statistical significance.

As a statistician, I am not in a good position to advise you on whether a difference is trivial or not. As a medical expert, you need to balance the cost and side effects of a treatment against the benefits that the therapy provides.

The authors of the research paper should inform you what size difference is clinically relevant and what sized difference is trivial. But if they don't, you should. Ask yourself how much of a difference would be large enough to cause you to change your practice. Then compare this to the confidence interval in the research paper. If both limits of the confidence interval are smaller than a clinically relevant difference, then you should not change your practice, no matter what the p-value tells you.

You should not interpret the p-value as the probability that the null hypothesis is true. Such an interpretation is problematic because a hypothesis is not a random event that can have a probability.

Bayesian statistics provides an alternative framework that allows you to assign probabilities to hypotheses and to modify these probabilities on the basis of the data that you collect.

Example

A large number of p-values appear in a publication

  • Consultation Patterns and Provision of Contraception in General Practice Before Teenage Pregnancy: Case-Control Study. Churchill D, Allen J, Pringle M, Hippisley-Cox J, Ebdon D, Macpherson M, Bradley S. British Medical Journal 2000: 321(7259); 486-9. [Abstract] [Full text] [PDF]

 by Churchill et al 2000. This was a study of consultation practices among teenagers who become pregnant. The researchers selected 240 patients (cases) with a recorded conception before the age of 20. Three controls were selected for each case and were matched on age and practice.

The not too surprising finding is that the cases were more likely to have consulted certain health professionals in the year before conception and were more likely to request contraceptive protection. This demonstrates that teenagers are not reluctant to seek advice about contraception.

For example, 91% of the cases (219/240) sought the advice of a general practitioner in the year before conception compared to 82% of the controls (586/719) during a similar time frame. This is a large difference. The odds ratio is 2.37. The p-value is 0.001, which indicates that this ratio is statistically significantly different from 1.0. The 95% confidence interval for the odds ratio is 1.45 to 3.86.

In contrast, 23% of the cases (56/240) sought advice from a practice nurse while 24% of the controls (170/719) sought advice. This is a small difference and the odds ratio is 0.98. The p-value is 0.905, which indicates that this odds ratio does not differ significantly from 1. As with any negative finding, you should be concerned about whether the result is due to an inadequate sample size. The confidence interval, however, is 0.69 to 1.39. This indicates that the research study had a good amount of precision and that the sample size was reasonable.

This webpage was written by Steve Simon on 2007-04-05, edited by Steve Simon, and was last modified on 2008-07-14. This page needs minor revisions. Category: Definitions, Category: Hypothesis testing, Category: P-values.


Confidence Intervals.

Dear Professor Mean:  Can you give me a simple explanation of what a confidence interval is?

We statisticians have a habit of hedging our bets. We always insert qualifiers into our reports, warn about all sorts of assumptions, and never admit to anything more extreme than probable. There's a famous saying: "Statistics means never having to say you're certain."

We qualify our statements, of course, because we are always dealing with imperfect information. In particular, we are often asked to make statements about a population (a large group of subjects) using information from a sample (a small, but carefully selected subset of this population). No matter how carefully this sample is selected to be a fair and unbiased representation of the population, relying on information from a sample will always lead to some level of uncertainty.

Short Explanation

A confidence interval is a range of values that tries to quantify this uncertainty. Consider it as a range of plausible values. A narrow confidence interval implies high precision; we can specify plausible values to within a tiny range. A wide interval implies poor precision; we can only specify plausible values to a broad and uninformative range.

Consider a recent study of homoeopathic treatment of pain and swelling after oral surgery (Lokken 1995). When examining swelling 3 days after the operation, they showed that homoeopathy led to 1 mm less swelling on average. The 95% confidence interval, however, ranged from -5.5 to 7.5 mm. From what little I know about oral surgery, this appears to be a very wide interval. This interval implies that neither a large improvement due to homoeopathy nor a large decrement could be ruled out.

Generally when a confidence interval is very wide like this one, it is an indication of an inadequate sample size, an issue that the authors mention in the discussion section of this paper.

How to Interpret a Confidence Interval

When you see a confidence interval in a published medical report, you should look for two things. First, does the interval contain a value that implies no change or no effect? For example, with a confidence interval for a difference look to see whether that interval includes zero. With a confidence interval for a ratio, look to see whether that interval contains one.

Here's an example of a confidence interval that contains the null value. The interval shown below implies no statistically significant change.

Figure 2.1

Here's an example of a confidence interval that excludes the null value. If we assume that larger implies better, then the interval shown below would imply a statistically significant improvement.

Figure 2.2 (1222 bytes)

Here's a different example of a confidence interval that excludes the null value. The interval shown below implies a statistically significant decline.

Figure 2.3 (1214 bytes)

Practical Significance

You should also see whether the confidence interval lies partly or entirely within a range of clinical indifference. Clinical indifference represents values of such a trivial size that you would not want to change your current practice. For example, you would not recommend a special diet that showed a one year weight loss of only five pounds. You would not order a diagnostic test that had a predictive value of less than 50%.

Clinical indifference is a medical judgement, and not a statistical judgement. It depends on your knowledge of the range of possible treatments, their costs, and their side effects. As statistician, I can only speculate on what a range of clinical indifference is. I do want to emphasize, however, that if a confidence interval is contained entirely within your range of clinical indifference, then you have clear and convincing evidence to keep doing things the same way (see below).

Figure 2.4 (1558 bytes)

One the other hand, if part of the confidence interval lies outside the range of clinical indifference, then you should consider the possibility that the sample size is too small (see below).

Figure 2.5 (1553 bytes)

Some studies have sample sizes that are so large that even trivial differences are declared statistically significant. If your confidence interval excludes the null value but still lies entirely within the range of clinical indifference, then you have a result with statistical significance, but no practical significance (see below).

Figure 2.6 (1548 bytes)

Finally, if your confidence interval excludes the null value and lies outside the range of clinical indifference, then you have both statistical and practical significance (see below).

Figure 2.7 (1550 bytes)

The Standard Error

In many situations, the width of a confidence interval is proportional to the standard error. The standard error is defined the variability for a statistical estimate. You can compute a crude confidence interval by taking the estimate plus or minus twice the standard error.

Confidence Interval for a Simple Average

There are lots of different formulas for the confidence interval and the standard error, depending on the context of the problem. The simplest formula appears when you estimate an average from a single sample. In this situation, the standard error would be

Sigma/Sqrt(n) (972 bytes)

where sigma represents the variability of the original data and n represents the size of the sample. The crude confidence interval would be the sample mean plus or minus two standard errors.

The width of your confidence interval goes down as the sample size goes up, since you are placing a larger value in the denominator. This is a classic and intuitive relationship in statistics: larger sample sizes provide greater precision (that is, narrower confidence intervals).

One way of planning a sample size for your study is to try to make sure your confidence interval has an adequate amount of precision. Although larger sample sizes mean narrower confidence intervals, there is usually a point of diminishing returns. This occurs when further shrinking of the interval is not worth the cost of additional subjects.

An often overlooked strategy for gaining precision is by finding a way to shrink sigma, the variability in your original data set. For example, use of calibration and quality control checks in a laboratory can often provide substantially smaller values for sigma.

Confidence Interval for a Difference Between Two Averages

If we were interested in estimating the difference in averages between two independent samples of data, the standard error of the estimated difference would be

Sqrt(sigma1^2/n1+sigma2^2/n2) (1232 bytes)

where the subscripts 1 and 2 indicate whether the values come from the first or the second group. Notice that the standard error and hence the width of the confidence interval goes down as either or both sample sizes go up.

When you are planning a research study comparing two groups, it is often helpful to consider different allocations of samples to the two groups. For example, if your first group is much more variable than the second group, you might be better off trying for a larger sample size in that group, rather than trying to get equal numbers in each group.

Confidence Interval for a Proportion

If we compute a proportion, p, from a sample, the standard error of that proportion would be

sqrt(p*(1-p)/n) (1210 bytes)

Just like the previous examples, larger sample sizes lead to smaller standard errors and narrower confidence intervals.

Did you notice in this formula that the width of the confidence interval is related to the estimate itself. A bit of work with calculus will show you that, assuming the sample size stays the same, the widest confidence interval occurs when p=0.5. Both rarer and more frequent events than 50% will produce narrower intervals.

Confidence Interval for an Odds Ratio

The final example involves computing an odds ratio. We often use the odds ratio to summarize data in a two by two table. The rows of the table might represent disease status (healthy/diseased) and the columns might represent exposure status (exposed/unexposed). In this case, the odds ratio would represent the relative change in the odds of disease between exposed and unexposed patients.

Or possibly the rows might represent treatment status (active drug/placebo) and the columns might represent health outcome (improvement/no improvement). Here, the odds ratio represents the relative change in the odds of improvement between drug and placebo.

If we let the letters a, b, c, and d represent the frequency counts in a two by two table (see below)

Two by two matrix (1013 bytes)

then the odds ratio would be ad/bc. The odds ratio is skewed, so we cannot easily compute a standard error for the odds ratio itself. We can, however, find a standard error for the natural logarithm of the odds ratio. It is simply

sqrt(1/a+1/b+1/c+1/d) (1280 bytes)

We see that as any or all of the counts in the two by two table increase, the confidence interval for the log odds ratio shrinks. Also, it turns out that the smallest count in the two by two table plays the largest role in determining the size of the standard error.

Example of a Confidence Interval For a Mean

In a study of immunotherapy in children with asthma, 61 patients showed an average improvement of 2.5% peak expiratory flow rate with a standard deviation of 11%. We divide the standard deviation by the square root of 61 to get a standard error of 1.4. A crude confidence interval would be 2.5% plus or minus 2.8% which equals 0.3% to 4.8%. I'm not an expert of asthma, but if we defined a range of clinical indifference to be an improvement of less than 5%, then this confidence interval is entirely within the range of clinical indifference.

Example of a Confidence Interval for An Odds Ratio

In the same study, the author noted that 15 out of 53 immunotherapy patients showed partial remission on their need for medication. This sample size is smaller because of a small number of dropouts. In the placebo group, 12 out of 57 showed partial remission. The two by two table for these data looks like

wpeB9.gif (1899 bytes)

The odds ratio is 1.5, which shows that the immunotherapy treatment increases the odds of partial remission. The natural log of the odds ratio is 0.6. For this calculation, be sure that you use a natural logarithm and not a base 10 logarithm.

The standard error of the log odds ratio is

wpeBA.gif (1493 bytes)

So a crude confidence interval for the log odds ratio is 0.6 plus or minus 0.9 which equals -0.5 to 1.3. We can exponentiate (use the exp button on your scientific calculator) to convert back to the original measurement scale. This gives us a confidence interval of 0.6 to 3.6 for the odds ratio itself. Even though this interval contains 1, we still have to allow for the possibility that the improvement might be as large as two-fold or three-fold.

Summary

A confidence interval is a range of plausible values that accounts for uncertainty in a statistical estimate.. A narrow confidence interval implies high precision; a wide interval implies poor precision.

When you see a confidence interval in a published medical report, you should look for two things.

  1. Does the interval contain a value that implies no change or no effect?
  2. Does the confidence interval lie partly or entirely within a range of clinical indifference?

This webpage was written by Steve Simon on (unknown date), edited by Steve Simon and Linda Foland, and was last modified on 2008-07-14. Category: Confidence intervals, Category: Statistical evidence


Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.