Children's Mercy Hospital
Find a Doctor | Press Room | Careers | Directions & Locations

About Us | Contact Us | Giving to Children's Mercy
For Patients and Families   Your Child's Health   Clinical Services   |   For Health Care Professionals   Medical Education   Medical Research

Stats #25: What Do All These Numbers Mean? Regression Coefficients

Content:  This two hour training class will teach you to interpret results from linear and logistic regression models. This class is useful for anyone who is involved with producing or interpreting medical research.

Objectives:  In this class you will learn how to:

  • Interpret the slope coefficient in a linear regression model,
  • Identify problems with extrapolation in linear regression, and
  • Explain why the logistic regression model uses log odds units.

Teaching strategies:  Didactic lectures and small group exercises.

IRB Education Credits:  This class does not qualify for IRB Education Credits (IRBECs).

Outline:


Overview of the STATS web pages (January 21, 2000)

What are the STATS web pages?

The STATS pages are a collection of handouts that I use in my job as a statistical consultant. The web provides a nice home for these handouts, because as I update my material, the newest version is immediately available to anyone who is interested.

Where can I find STATS?

If you have a web browser, like Internet Explorer or Netscape Navigator, you can surf on over to my site,

http://www.childrensmercy.org/stats

which is also found at http://internet1/stats, if you are attached to the Children's Mercy Hospital network. There are two obsolete sites: http://www.cmh.edu/stats and http://simon/stats. Do not use either of these sites.

Some of the fun stuff you can find on the STATS web pages.

Ask Professor Mean.  For the tough Statistics questions that Dear Abby won't touch.

Planning Your Research Study.  Things you need to plan for before you start collecting your data.

Selecting An Appropriate Sample Size.  How much data do you really need?

Managing Your Research Data.  Everything you want to know before you step to the keyboard.

Steps In a Typical Data Analysis.  I have my data on the computer. Now what?

How to Read a Medical Journal Article.  Reading a journal is hard work. Here's some help.

Professor Mean's Library.  Good books and good web sites about Statistics.

... and even more good stuff!!!

This webpage was written by Steve Simon, edited by Linda Foland, and was last modified on 07/08/2008. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Website details


For CMH employees only: Statistical Consulting Services.

You can get free statistical consulting if you work for Children's Mercy Hospital. Steve Simon and Ashley Sherman provide a wide range of statistical consulting services to help you with your research projects. This help can start as early as the initial planning of your research. I also help with the analysis of your data, using SPSS or other statistical software. We can also provide assistance with the preparation of your presentations and publications.

Here area some examples of the services that we have provided:

  • setting up your research hypothesis,
  • selecting and justifying your sample size,
  • writing the statistical methods section for your grant,
  • preparing randomization tables for your study,
  • reviewing your surveys for content and quality,
  • developing a system for entering your data,
  • choosing an appropriate statistical model for your data,
  • establishing validity and/or reliability for your measurement scales,
  • checking for violations of statistical assumptions in your data,
  • producing graphs and tables for your research publication, and
  • providing references for new and unusual statistical methods.

Specific statistical advice has been outlined on a series of web pages which can be found at http://www.childrensmercy.org/stats/. The pages provide advice about planning your research, selecting an appropriate sample size, managing your research data, performing a variety of data analyses, presenting research data, and writing research papers.

How to get in touch with a statistician

If you would like to meet with Steve Simon or Ashley Sherman, you can set up an appointment by emailing or calling Judy Champion (jmchampion (at) cmh (dot) edu or 816-983-6784). If you have a very simple question, send an email directly to us (ssimon (at) cmh (dot) edu and aksherman (at) cmh (dot) edu).

This webpage was written by Steve Simon on 2003-04-30, edited by Steve Simon, and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Directions to my new office (April 25, 2008).

I have moved to a new office. It is a modular building just north of Children's Mercy Hospital. It is between 23rd and 22nd street, just off of Kenwood Avenue (Kenwood is a small north/south street just west of Holmes). If you need to get from your office to mine, here are some directions written by my Administrative Assistant, Judy Champion.

  • Take the elevator of the research tower down to the yellow level. Exit the employee parking garage on 23rd Street, walk to Kenwood and cross 23rd Street. Your destination is Building M 3 which is the building closest to 22nd Street. However, the entrance to our building faces Building M 2. It’s best to walk into the parking area that is just north of Building M 1 and follow the sidewalk around the west side of building M 2 in order to get to our building’s entrance on its south side. Another route would be to exit the Hospital Hill Center Building on Holmes and then walk ½ block north to 23rd Street, cross 23rd Street, walk west to Kenwood then north to building M 3 address 2220 Kenwood.

This webpage was written by Steve Simon and was last modified on 2008-07-14. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Professional details


Interpreting coefficients in a linear regression model

When I ask most people to remember their high school algebra class, I get a mixture of reactions. Most recoil in horror. About one in every four people say they liked that class. Personally, I thought that algebra, and all the other math classes I took were great because they didn't require writing a term paper.

One formula in algebra that most people can recall is the formula for a straight line. Actually, there are several different formulas, but the one that most people cite is

Y = m X + b

where m represents the slope, and b represents the y-intercept (we'll call it just the intercept here). They can also sometimes remember the formula for the slope:

m = Δy / Δx

In English, we would say that this is the change in y divided by the change in x.

In linear regression, we use a straight linear to estimate a trend in data. We can't always draw a straight line that passes through every data point, but we can find a line that "comes close" to most of the data. This line is an estimate, and we interpret the slope and the intercept of this line as follows:

The slope represents the estimated average change in Y when X increases by one unit.

The intercept represents the estimated average value of Y when X equals zero.

Be cautious with your interpretation of the intercept. Sometimes the value X=0 is impossible, implausible, or represents a dangerous extrapolation outside the range of the data.

The graph shown below represents the relationship between mother's age and the duration of breast feeding in a research study on breast feeding in pre-term infants.

The regression coefficients are shown below. The intercept, 6, is represented the estimated average duration of breast feeding for a mother that is zero years old. This is an impossible value, so the interpretation is not useful. What is useful, is the interpretation of the slope, approximately 0.4. The estimated average duration of breast feeding increases by 0.4 weeks for every extra year in the mother's age.

When X is categorical, the interpretation changes somewhat. Let's look at the simplest situation, a binary variable. A binary variable can have only two possible categories. Some examples are live/dead, treatment/control, diseased/healthy, male/female. We need to assign number codes to the categories. Most people assign the codes 1 and 2, but it is actually better to assign the codes 0 and 1.

In a study of breast feeding, we have a treatment group and a control group. Let us label the treatment group as 1 and the control group as 0. The outcome variable is the age when breast feeding stopped.

The control group had a mean duration of breast feeding just a bit larger than 13. The mean for the treatment group is just a bit larger than 20. Notice that the regression line shown above connects the two means.

In this situation, the intercept, 13, represents the average duration for the control group. The slope is 7, which is the change in the average duration when we move from the control group to the treatment group.

We could have just as easily labeled the treatment group as 0 and the control group as 1. If we did that, we would get a graph that looks like the following:

Here, the intercept, 20, represents the mean of the treatment group. The slope, -7, represents the change in average duration as we move from the treatment group to the control group. It is actually this reverse coding that SPSS chooses as a default.

Neither coding is correct or incorrect. Just make sure that you understand the difference. If you get a slope that is in the opposite direction of what you expected, perhaps it is because your software is using a different coding than what you expected.

When we represent a binary variable using 0-1 coding, the slope represents the estimated average change in Y when you switch from one group to the other.

The intercept represents the estimated average value of Y for the group coded as zero.

The interpretation of the regression coefficient for a categorical variable with more than two values is a bit trickier, and we will discuss it in a separate handout.

Florida election results example

There has been much discussion of the unusual ballot format in Palm Beach County, Florida and how it may have led several thousand voters to cast their ballots for Buchanan rather than Gore. Greg Adams and Chris Fastnow (http://madison.hss.cmu.edu/) have performed a regression analysis of the county by county returns in Florida. A portion of their analyses is recomputed below using SPSS. My model is slightly different than the model of Adams and Fastnow, but it reaches the same general conclusion: the vote for Buchanan in Palm Beach County, Florida is a couple of thousand votes higher than you would expect, if it behaved similar to other Florida Counties.

The story behind the Palm Beach County vote is quite controversial. I discuss it, not to revive the controversy, but to illustrate some concepts in linear regression. Some background information is at:

The ballot in Palm Beach county was (according to some people) confusing, because it used a staggered two page format. The layout would lead some people who thought they were voting for Al Gore to instead cast a ballot for Patrick Buchanan instead. Exactly how confusing the ballot was and is still open to debate. Several statisticians offered analyses that tried to estimate how many votes for Buchanan might actually be votes for Gore. These analyses are far more detailed than what I present here. My analysis is more useful for helping you to understand concepts in regression than for resolving this voting controversy. 

The scatter plot shown above shows the number of votes for George Bush in each Florida County versus the number of votes for Patrick Buchanan in the same county.

The regression equation is:

Votes (Buchanan) = 45.3 +0.0049 * Votes (Bush)

Notice that 0.0049 is roughly 1/500. How do we interpret these numbers? The intercept is 45, which means that the estimated average number of votes for Buchanan would be 45 in a county with zero votes for Bush. This is an extrapolation, as every single county in Florida had thousands of votes for Bush. The slope is 1/200, which means that the estimated average number of votes for Buchanan increases by 1/200 for each additional vote for Bush. In other words, each additional 200 votes for Bush is associated with an increase of 1 vote for Buchanan.

We can compute a predicted number of votes for Buchanan for each county by using the above equation. Palm Beach County had 152,846 votes for Bush. So the regression model would predict that Buchanan should get:

45 + 0.0049 * 152,846 =  797.

Thus, if the relationship observed across the entire state held exactly in Palm Beach County, then we would estimate the vote count for Buchanan to be 797.

There were actually 3,407 votes recorded for Buchanan, which is quite a discrepancy from what we predicted. The residual, the difference between what we observed and what would be predicted by the regression model is:

3,407 - 797 = 2,610.

One possible interpretation is that this discrepancy represents an estimate of the number of people who voted incorrectly for Buchanan. Such an interpretation would have to consider other possibilities, though. Is there something unique about Palm Beach County that would cause that county to vote in disproportionate numbers for Buchanan? Buchanan does indeed have some relatives in the area, and although they do not number in the thousands, perhaps they exerted some influence on their community.

Other information might tend to corroborate that a large number of votes were cast erroneously for Buchanan. Some of the highest vote counts for Buchanan were in precincts that were most heavily Democratic. There were also a large number of complaints received by the election board prior to anyone knowing how close the vote count in Florida would be.

There are other models that have been considered for the Palm Beach County vote, and most of them show a similar size discrepancy between the observed vote and the vote that would be predicted the regression model. It would set a dangerous precedent, of course, to use a statistical model to adjust vote counts, so this example is more for understanding what might have gone wrong and the magnitude of the error made.

This webpage was written by Steve Simon on 2002-06-24, edited by Linda Foland and Steve Simon and was last modified on 2008-07-08. Send feedback to ssimon at cmh dot edu or click on the email link at the top of the page. Category: Linear regression


The concepts behind the logistic regression model (July 23, 2002) Category: Logistic regression

The logistic regression model is a model that uses a binary (two possible values) outcome variable. Examples of a binary variable are mortality (live/dead), and morbidity (healthy/diseased). Sometimes you might take a continuous outcome and convert it into a binary outcome. For example, you might be interested in the length of stay in the hospital for mothers during an unremarkable delivery. A binary outcome might compare mothers who were discharged within 48 hours versus mothers discharged more than 48 hours.

The covariates in a logistic regression model represent variables that might be associated with the outcome variable. Covariates can be either continuous or categorical variables.

For binary outcomes, you might find it helpful to code the variable using indicator variables. An indicator variable equals either zero or one. Use the value of one to represent the presence of a condition and zero to represent absence of that condition. As an example, let 1=diseased, 0=healthy.

Indicator variables have many nice mathematical properties. One simple property is that the average of an indicator variable equals the observed probability in your data of the specific condition for that variable.

A logistic regression model examines the relationship between one or more independent variable and the log odds of your binary outcome variable. Log odds seem like a complex way to describe your data, but when you are dealing with probabilities, this approach leads to the simplest description of your data that is consistent with the rules of probability.

Let's consider an artificial data example where we collect data on the gestational age of infants (GA), which is a continuous variable, and the probability that these infants will be breast feeding at discharge from the hospital (BF), which is a binary variable. We expect an increasing trend in the probability of BF as GA increases. Premature infants are usually sicker and they have to stay in the hospital longer. Both of these present obstacles to BF.

A linear model for probability

A linear model would presume that the probability of BF increases as a linear function of GA. You can represent a linear function algebraically as

prob BF = a + b*GA

This means that each unit increase in GA would add b percentage points to the probability of BF. The table shown below gives an example of a linear function.

This table represents the linear function

prob BF = 4 + 2*GA

which means that you can get the probability of BF by doubling GA and adding 4. So an infant with a gestational age of 30 would have a probability of 4+2*30 = 64.

A simple interpretation of this model is that each additional week of GA adds an extra 2% to the probability of BF. We could call this an additive probability model.

I'm not an expert on BF; what little experience I've had with the topic occurred over 40 years ago. But I do know that an additive probability model tends to have problems when you get probabilities close to 0% or 100%. Let's change the linear model slightly to the following:

prob BF = 4 + 3*GA

This model would produce the following table of probabilities.

You may find it difficult to explain what a probability of 106% means. This is a reason to avoid using a additive model for estimating probabilities. In particular, try to avoid using an additive model unless you have good reason to expect that all of your estimated probabilities will be between 20% and 80%.

A multiplicative model for probability

It's worthwhile to consider a different model here, a multiplicative model for probability, even though it suffers from the same problems as the additive model.

In a multiplicative model, you change the probabilities by multiplying rather than adding. Here's a simple example.

In this example, each extra week of GA produces a tripling in the probability of BF. Contrast this to the linear models shown above, where each extra week of GA adds 2% or 3% to the probability of BF.

A multiplicative model can't produce any probabilities less than 0%, but it's pretty easy to get a probability bigger than 100%. A multiplicative model for probability is actually quite attractive, as long as you have good reason to expect that all of the probabilities are small, say less than 20%.

The relationship between odds and probability

Another approach is to try to model the odds rather than the probability of BF. You see odds mentioned quite frequently in gambling contexts. If the odds are three to one in favor of your favorite football team, that means you would expect a win to occur about three times as often as a loss. If the odds are four to one against your team, you would expect a loss to occur about four times as often as a win.

You need to be careful with odds. Sometimes the odds represent the odds in favor of winning and sometimes they represent the odds against winning. Usually it is pretty clear from the context. When you are told that your odds of winning the lottery are a million to one, you know that this means that you would expect to having a losing ticket about a million times more often than you would expect to hit the jackpot.

It's easy to convert odds into probabilities and vice versa. With odds of three to one in favor, you would expect to see roughly three wins and only one loss out of every four attempts. In other words, your probability for winning is 0.75.

If you expect the probability of winning to be 20%, you would expect to see roughly one win and four losses out of every five attempts. In other words, your odds are 4 to 1 against.

The formulas for conversion are

odds = prob / (1-prob)

and

prob = odds / (1+odds).

In medicine and epidemiology, when an event is less likely to happen and more likely not to happen, we represent the odds as a value less than one. So odds of four to one against an event would be represented by the fraction 1/4 or 0.25. When an event is more likely to happen than not, we represent the odds as a value greater than one. So odds of three to one in favor of an event would be represented simply as an odds of 3. With this convention, odds are bounded below by zero, but have no upper bound.

A log odds model for probability

Let's consider a multiplicative model for the odds (not the probability) of BF.

This model implies that each additional week of GA triples the odds of BF. A multiplicative model for odds is nice because it can't produce any meaningless estimates.

It's interesting to look at how the logarithm of the odds behave.

Notice that an extra week of GA adds 1.1 units to the log odds. So you can describe this model as linear (additive) in the log odds. When you run a logistic regression model in SPSS or other statistical software, it uses a model just like this, a model that is linear on the log odds scale. This may not seem too important now, but when you look at the output, you need to remember that SPSS presents all of the results in terms of log odds. If you want to see results in terms of probabilities instead of logs, you have to transform your results.

Let's look at how the probabilities behave in this model.

Notice that even when the odds get as large as 27 to 1, the probability still stays below 100%. Also notice that the probabilities change in neither an additive nor a multiplicative fashion.

A graph shows what is happening.

The probabilities follow an S-shaped curve that is characteristic of all logistic regression models. The curve levels off at zero on one side and at one on the other side. This curve ensures that the estimated probabilities are always between 0% and 100%.

An example of a log odds model with real data

There are other approaches that also work well for this type of data, such as a probit model, that I won't discuss here. But I did want to show you what the data relating GA and BF really looks like.

I've simplified this data set by removing some of the extreme gestational ages. The estimated logistic regression model is

log odds = -16.72 + 0.577*GA

The table below shows the predicted log odds, and the calculations needed to transform this estimate back into predicted probabilities.

Let's examine these calculations for GA = 30. The predicted log odds would be

log odds = -16.72 + 0.577*30 = 0.59

Convert from log odds to odds by exponentiating.

odds = exp(0.59) = 1.80

And finally, convert from odds back into probability.

prob = 1.80 / (1+1.80) = 0.643

The predicted probability of 64.3% is reasonably close to the true probability (77.8%).

You might also want to take note of the predicted odds. Notice that the ratio of any odds to the odds in the next row is 1.78. For example,

3.20/1.80 = 1.78

5.70/3.20 = 1.78

It's not a coincidence that you get the same value when you exponentiate the slope term in the log odds equation.

exp(0.59) = 1.78

This is a general property of the logistic model. The slope term in a logistic regression model represents the log of the odds ratio representing the increase (decrease) in risk as the independent variable increases by one unit.

Categorical variables in a logistic regression model

You treat categorical variables in much the same way as you would in a linear regression model. Let's start with some data that listed survival outcomes on the Titanic. That ship was struck by an iceberg and 863 passengers died out of a total of 1,313. This happened during an era where there was a strong belief in "women and children" first.

You can see this in the crosstabulation shown above. Among females, the odds of dying were 2-1 against, because the number of survivors (308) was twice as big as the number who died (154). Among males, the odds of dying were almost 5 to 1 in favor (actually 4.993 to 1), since the number who survived (142) was about one-fifth the number who died (709).

The odds ratio is 0.1, and we are very confident that this odds ratio is less than one, because the confidence interval goes up to only 0.13. Let's analyze this data by creating an indicator variable for sex.

In SPSS, you would do this by selecting TRANSFORM | RECODE from the menu

Then click on the OLD AND NEW VALUES button.

Here, I use the codes of 0 for female and 1 for male. To run a logistic regression in SPSS, select ANALYZE | REGRESSION | BINARY LOGISTIC from the menu.

Click on the OPTIONS button.

Select the CI for exp(B) option, then click on the CONTINUE button and then on the OK button. Here is what the output looks like:

Let's start with the CONSTANT row of the data. This has an interpretation similar to the intercept in the linear regression model. the B column represents the estimated log odds when SexMale=0. Above, you saw that the odds for dying were 2 to 1 against for females, and the natural logarithm of 2 is 0.693. The last column, EXP(B) represents the odds, or 2.000. You need to be careful with this interpretation, because sometimes SPSS will report the odds in favor of an event and sometimes it will report the odds against an event. You have to look at the crosstabulation to be sure which it is.

The SexMale row has an interpretation similar to the slope term in a linear regression model. The B column represents the estimated change in the log odds when SexMale increases by one unit. This is effectively the log odds ratio. We computed the odds ratio above, and -2.301 is the natural logarithm of 0.1. The last column, EXP(B) provides you with the odds ratio (0.100).

Coding is very important here. Suppose you had chosen the coding for SexFemale where1=female and 0=male.

Then the output would look quite different.

The log odds is now -1.608 which represents the log odds for males. The log odds ratio is now 2.301 and the odds ratio is 9.986 (which you might want to round to 10).

SPSS will create an indicator variable for you if you click on the CATEGORICAL button in the logistic regression dialog box.

If you select LAST as the reference category, SPSS will use the code 0=male, 1=female (last means last alphabetically). If you select FIRST as the reference category, SPSS will use the code 0=female, 1=male.

How would SPSS handle a variable like Passenger Class, which has three levels

  • 1st,
  • 2nd,
  • 3rd?

Here's a crosstabulation of survival versus passenger class.

Notice that the odds of dying are 0.67 to 1 in 1st class, 1.35 to 1 in 2nd class, and 4.15 to 1 in 3rd class. These are odds in favor of dying. The odds against dying are 1.50 to 1, 0.74 to 1, and 0.24 to 1, respectively.

The odds ratio for the pclass(1) row is 6.212, which is equal to 1.50 / 0.24. You should interpret this as the odds against dying are 6 times better in first class compared to third class. The odds ratio for the pclass(2) row is 3.069, which equals 0.74 / 0.24. This tells you that the odds against dying are about 3 times better in second class compared to third class. The Constant row tells you that the odds are 0.241 to 1 in third class.

If you prefer to do the analysis with each of the other classes being compared back to first class, then select FIRST for reference category.

 This produces the following output:

Here the pclass(1) row provides an odds ratio of 0.494 which equals 0.74 / 1.50. The odds against dying are about half in second class versus first class. The pclass(2) provides an odds ratio of 0.161 (approximately 1/6) which equals 0.24 / 1.50. The odds against dying are 1/6 in third class compared to first class. The Constant row provides an odds of 1.496 to 1 against dying for first class.

Notice that the numbers in parentheses (pclass(1) and pclass(2)) do not necessarily correspond to first and second classes. It depends on how SPSS chooses the indicator variables. How did I know how to interpret the indicator variables and the odds ratios? I wouldn't have known how to do this if I hadn't computed a crosstabulation earlier. It is very important to do a few simple crosstabulations before you run a logistic regression model, because it helps you orient yourself to the data.

Looking at a variable both continuously and categorically.

Suppose you wanted to further explore the concept of "women and children first" by looking at how age affects the odds against survival. Here is a simple model that includes age as a continuous variable:

The odds ratio for age is 0.991. This tells you that the odds against dying drop by about 1% for each year of life. In other words, the older you are the more likely you are to die. Let's re-interpret this in terms of decades. A full decade change in age would lead to a -0.009 * 10 unit change in the log odds ratio. This leads to an odds ratio of 0.91 per decade, which says the odds against dying decline by about 9% for each extra decade of life.

I have a problem with this model, because it assumes that the effect of one year of age is the same for children, teenagers, adults, and very old people. I suspect that perhaps the effect of age is very strong for young people but it levels off in adults. The chances of dying for a 30 year old would probably not be too much different than for a 40 year old. Let's split the data into different age groups and see how each of them compares.

To do this in SPSS, select TRANSFORM | RANK CASES from the menu.

Click on the RANK TYPES button.

Select Ntiles and choose 10 for the number of groups. This is a somewhat arbitrary choice. You want enough groups to allow for subtle changes across the age range, but if there are too many groups, you lose too much precision.

Here's what you get

Notice that the first group has an average age of 6.6. We'll call those the young children. The next group has an average age of 17.9 which looks more like older teenagers. The remaining groups represent adults of various ages. We have about 70-80 people in each age group. Here is a crosstabulation of age groups by survival.

Notice that the chances for survival are best for the young children. Survival also seems to be high for older teenagers and some of the older adult categories. The worst survival rates are in age groups 3, 4, 5, 6, and 8. Perhaps there is some tendency in this data for "women, children, and old people" first. Here's what a logistic regression model looks like for age groups.

All age groups have a decreased chance of surviving relative to the young children group. You might want to investigate this further looking at combining some or all of the adult groups and splitting the young children group into further subcategories.

This webpage was written by Steve Simon and was last modified on 07/08/2008.


Interactions in logistic regression (April 8, 2004)

Someone asked me how to compute interactions in binary logistic regression. You need to be careful, since interactions are tricky to interpret. I need to update my pages to offer a detailed description of what an interaction is and how to interpret it properly.

I did find the time to run a simple example of interaction using the Titanic mortality data set. In that data set, women had much better survival probabilities (this was during an era when people believed in the phrase "Women and children first"). The effect, however, is not as strong in 3rd class passengers as it is in first and second class passengers.

When you calculate a separate odds ratio for each passenger class, you get a value of 30 for first class, 42 for second class, and 4.6 for third class.

This disparity is the classic example of interaction--the effect of one variable becomes stronger or weaker depending on the level of the other variable.

When you fit a binary logistic model with an interaction, the estimates look like this:

The last passenger class (third class) becomes the reference level. The odds ratio for sex, 4.6, is the odds ratio within the third passenger class. The odds ratio for the interaction between pclass(1) and sex is 6.6, which is shows how much larger the odds ratio would be in first class compared to third class. It is literally a ratio of odds ratios (30/4.6). The odds ratio for the interaction between pclass(2) and sex is 9.3, which again shows how much larger the odds ratio is in second class compared to third class (9.3 = 43/4.6).

If you detect a significant interaction in your data set, the simplest thing to do is to estimate odds ratios separately for one of the two factors involved in the interaction. For example, someone wrote in asking about an interaction between gender (male/female) and alcohol use (yes/no) in a logistic regression model predicting dating violence. The simplest thing is to look at the odds ratio relating drinking and dating violence first for males only and then for females only. That seems like the logical choice, but you could also look at the odds ratio for gender and dating violence, first for the drinkers and then for the teetotalers.

This webpage was written by Steve Simon and was last modified on 07/08/2008. Category: Logistic regression


Please fill out an evaluation form. Your input is important. These evaluation forms also ensure that we can offer Continuing Medical Education credits for this class.