Working With Dummy Variables

Regression analysis is used with numerical variables. Results only have a valid interpretation if it makes sense to assume that having a value of 2 on some variable is does indeed mean having twice as much of something as a 1, and having a 50 means 50 times as much as 1.

However, social scientists often need to work with categorical variables in which the different values have no real numerical relationship with each other. Examples include variables for race, political affiliation, or marital status. If you have a variable for political affiliation with possible responses including Democrat, Independent, and Republican, it obviously doesn't make sense to assign values of 1 - 3 and interpret that as meaning that a Republican is somehow three times as politically affiliated as a Democrat.

The solution is to use dummy variables - variables with only two values, zero and one. It does make sense to create a variable called "Republican" and interpret it as meaning that someone assigned a 1 on this varible is Republican and someone with an 0 is not.

Nominal variables with multiple levels

If you have a nominal variable that has more than two levels, you need to create multiple dummy variables to "take the place of" the original nominal variable. For example, imagine that you wanted to predict depression from year in school: freshman, sophomore, junior, or senior. Obviously, "year in school" has more than two levels.

What you need to do is to recode "year in school" into a set of dummy variables, each of which has two levels. The first step in this process is to decide the number of dummy variables. This is easy; it's simply k-1, where k is the number of levels of the original variable.

You could also create dummy variables for all levels in the original variable, and simply drop one from each analysis.

In this instance, we would need to create 4-1=3 dummy variables. In order to create these variables, we are going to take 3 of the levels of "year of school", and create a variable corresponding to each level, which will have the value of yes or no (i.e., 1 or 0). In this instance, we can create a variable called "sophomore," "junior," and "senior." Each instance of "year of school" would then be recoded into a value for "sophomore," "junior," and "senior." If a person were a junior, then "sophomore" would be equal to 0, "junior" would be equal to 1, and "senior" would be equal to 0.

Interpreting results

The decision as to which level is not coded is often arbitrary. The level which is not coded is the category to which all other categories will be compared. As such, often the biggest group will be the not- coded category. For example, often "Caucasian" will be the not-coded group if that is the race of the majority of participants in the sample. In that case, if you have a variable called "Asian", the coefficient on the "Asian" variable in your regression will show the effect being Asian rather than Caucasian has on your dependant variable.

In our example, "freshman" was not coded so that we could determine if being a sophomore, junior, or senior predicts a different depressive level than being a freshman. Consequently, if the variable, "junior" was significant in our regression, with a positive beta coefficient, this would mean that juniors are significantly more depressed than freshman. Alternatively, we could have decided to not code "senior," if we thought that being a senior is qualitatively different from being of another year.

© 2007 The Trustees of Princeton University. All rights reserved. data@princeton.edu; NOTE: Information is for Princeton University. Feel free to use the documentation but we can not answer questions outside of Princeton