Item Response Analysis

Uit Methodologiewinkel
Ga naar: navigatie, zoeken

How to execute IRT analysis in R

In this section, the procedure of Item response theory will be demonstrated in the statistical software R, by analyzing a dichotomous test. We will simulate data and fit two models to the data; the Rasch model and the 2PL model. We will compare the two models, test the assumptions and check item qualities.

First of all we must install the “ltm” package, which allows us to do IRT analysis on our data:


Simulating data

In order to be able to showcase how to analyse data using IRT analysis, we must have data. We will simulate data so readers can redo this analysis completely. Take note that the data is simulated assuming a 2PL model. We have 50 items, but will only look at the first 11 items here:


Fitting the models


Here we can see the estimated item parameters when the rasch model is fit to the data. The estimated difficulties differ between items, while the estimated discrimination parameter is the same across items. Again, we are only looking at the first 11 items.


As you can see, for the 2pl model, both the estimated difficulty and the discriminations differ between items.

Comparing the models

In order to find out which model, the 2pl or the rasch model, fit our data the best, we will do a Likelihood-ratio test (LRT). For this to work, the simpler model (rasch model in this case) must be entered before the more complex model (2pl). If the LRT is significant, it means that the more complex model is significantly better than the simpler model.


To see whether the test is significant, we look at the p value. If the p value is lower than 0.05, the test is significant.

In our case, the LRT is significant (as can be seen by the p-value; <0.001), which means that the more complex model - the 2pl model - fits the data significantly better. This makes sense, since we simulated our data based on the 2pl model. If this were real data, we could assume that our items have differing discrimination compacities, and continue our analysis using only the 2pl model.

We can also try to simulate our data based on the Rasch model, to check whether the LRT favours the Rasch model, as it should. Let’s find out:


As we can see, the p-value does not imply significance, and therefore the conclusion is that the 2pl model does not fit the data significantly better than the rasch model.

Checking for the assumptions


One way to check for unidimensionality (See information in the page of Item response theory) is by using a scree plot (see in Factor Analysis). We make a scree plot by computing the eigenvalues of the correlation matrix of our data and plotting them. If the biggest drop on the plot occurs after the first value, we can assume that the assumption is not violated.

First off, use the following function to make a correlation matrix from your data:


Next, we extract the eigenvalues of our data:


And plot them:


As can be seen above, the biggest drop occurs after the first value, so we can assume that the assumption of unidimensionality is not, at least greatly, violated.

Local independence

To check Local independence, we will look at conditional correlations, that is, the correlations between the item scores conditional on the latent trait estimate. If conditional independence were to hold, not more than 5% of the correlations should be significant. The p-values of the correlations should be evenly distributed.



Here, just over 9% of the p-values imply significance, and the p-values of the correlations are pretty uniformly distributed. We can assume that the condition of local independence is not grossly violated, and will continue our analysis.

Ability scores and sumscores

Now we will compare the scores obtained from our model to the more commonly used (and more easily applied) sum score.

We will make sum scores, and check their distribution compared to the theta estimates, and plot their correlation.




The sumscore and ability score correlate with 0.98, which is very high. Therefore, we can assume that using the sumscore instead of the ability for scoring the test would not, in this case, lead to very different results.

If the data had been simulated from the rasch model, the correlation should be perfect. Let’s look at this with the data we simulated based on the Rasch model:


As expected, the sumscores are perfectly correlated with the theta estimates for the rasch model.

Item characteristic curves

The item characteristic curves (See: Item response theory) allow us to see both where each item is located in terms of difficulty, and also how well the item discriminates between people of different abilities. The slope of the item curves show the discriminations; the steeper the slope, the better the item discriminates.

Seeing as we have 50 items, visualizing them all at once will be too hectic, and hard to see which items give what information. Therefore, it is good to plot only a few items at the same time. We will make do with plotting the first 10 items, for the purpose of this tutorial. To edit which items to see specifically, you can vary the “items” argument.


As the plot above shows, our items vary in discriminations (they have varying slopes). It is particularly noteworthy that item 4 (the dark blue curve) has a very low discrimination compared to the other items. This means that the item is not very correlated to the underlying construct being measured. If this were actual data, this would warrant an investigation of item 4 - reconsidering the wording to make the item more clear might be necessary. Or, perhaps, the item might simply not belong in this particular test.

It is worth mentioning that items with very steep slopes are not necessarily always better than items with average slopes, because they can only discriminate between different abilities on a narrow range, although they do so very well. It is good to have both items with steep curves and items with average steepness. We can get more information about the quality of the items with the item information curves.

Test and item information curves

Test information (See: Item response theory) is similar to reliability in Classical test theory. However, we can see the information our test gives given the abilities. We can also see the information that each item gives us separately.

Let’s look at the test information curve:


This curve shows us that our test gives quite a lot of information for people around the mean, and it distinguishes especially well people who are just above the mean.

Next, we can plot the informations of the items (See: Item response theory), to see how much information each item provides us about the measured construct. Let's look at the first 10 items.


Here, the gray, purple and yellow items seem to give the most information, the red and the black items give medium information and the rest has a relatively low information value. The three most informative items are rather concentrated around the mean, but give less information about people who have extreme abilities. Therefore, items such as the black one add a lot to the test, even though it has a lower information value.

Be careful when interpreting these curves: the scale of the y-axis depends on the information of the curves plotted, and therefore one cannot compare the informations by eyeballing different plots; rather, take note of the information presented on the y-axis.


IRT can be used for different purposes, and therefore the interpretation of the analysis varies depending on the purpose. In our case, we used the analysis to find out the qualities of our test and its items. We found out that our items have differing discrimination capacities, and that item 4 performs particularly poorly, and should be reconsidered. We could also, for example, decide to reduce the amount of items in our test, and then select out the items that don't give any additional information to the other items. However, we only looked at the first 10 items in this tutorial, and of course all items should be carefully considered before taking a decision.