Administrator Knowledge Base

Admin Application Guides

Business Intelligence & Reporting

Item Analysis Guide

The Item Analysis Report provides detailed statistics on individual exam items, allowing users to evaluate item performance, quality, and reliability.

Overview

This article describes the automated Item Analysis Report. At the end are some guidelines for interpreting these statistics, but you may wish to consult a psychometrician or a textbook on psychometric item analysis. The main objective of item analysis is to obtain item-level statistics (e.g., item difficulty, item-test correlation) and to identify effective items and poorly performing items.

The Item Analysis Report automatically analyzes the exam response data. This involves four steps:

Specify the test data to be used for item analysis
Select the criteria to exclude invalid exam results
View the item analysis results
Apply any filter, choose the columns to show, and (if desired) export the item analysis data

Specify test data

Specify the data you would like to use for item analysis through the drop-down lists at the top of the page. Click “RUN REPORT” to start analyzing the data.

Organization: The organization which owns the exam.
Exam: The name of the exam.
Forms (Assessments): The test form(s) of an exam. Check the box in front of one or more forms to select a single form or multiple forms under the same exam.
Data Range: the calendar date range of administration (in UTC). Select from one of the following:
Last Month: The previous calendar month.
Last Quarter: The previous three-month period. A year is typically divided into four quarters – Q1 (January to March), Q2 (April to June), Q3 (July to September), and Q4 (October to December).
Last Year: The previous calendar year.
Past 30 Days: The time period that spans the previous 30 consecutive days, including today.
Past 12 Months: The time period that spans the previous 12 consecutive months, including this month.
Other: Other specific time period. If “Other” was selected, you need to select the “Start Date” and “End Date” by typing in the date (in format mm/dd/yyyy) or selecting the date from the drop-down calendar.

Tips:

The dates are based on Universal Coordinated Time (UTC) which is several hours ahead of all US time zones. You may need to select a date one day earlier or later than the date in your timezone to account for different timezones. For example, to select all the US-based candidates who took an exam on February 2, you may need to select February 1 to February 3. If you need more fine-grained control, you can download the responses and select individual records.
If multiple forms are selected, the item statistics will be calculated for all responses to an item across forms regardless of its scoring status. For example, given item001 was administered as a pretest item on Form A with a sample size of 50 and as a scored item on Form B with a sample size of 100. If both forms were selected for analysis, the statistics for item001 will be calculated for 50 + 100 = 150 candidates who took either Form A or Form B.
When selecting the date range, you can exclude some results that are outside the testing window by selecting Other and setting a custom date range based on your testing window.

Exclude invalid exam results

The system allows you to exclude invalid exam candidates from the analysis if the pattern of their responses indicate inattentiveness or low response quality. It is a best practice to remove candidates who seem inattentive or who seem to have low response quality before interpreting the item analysis.

To use this feature, first run the item analysis without any data cleaning. The number of candidates affected by each criterion will show up on the right of each criterion. Check the box to the left (i.e., in front of) one or more data cleaning criteria to remove those candidates from the item analysis. Click “RUN REPORT” to see the results with specified criterion/criteria applied.

You can choose from one or more of the following three criteria:

Incomplete Data: Missing a lot of the responses may indicate an atypical candidate who should not be included in the analysis. Applying this criterion will remove any result that has 30% or more omit/not reached responses.
Quick Responders: Extremely short administration time usually indicates that the candidate's exam ended early or that the candidate was rushing through the exam and not being attentive. Applying this criterion will remove any result whose Administration Time was less than 15 minutes.
Chance Responders: Test total score lower than chance level for four-option multiple choice items (25% correct) indicates the candidate was unusual (not inattentive, missing data, etc.). Applying this criterion will remove any result with less than 25% of the items answered correctly. If your form includes many item types other than four-option multiple choice then you may not wish to use this filter.

Tips:

The default data cleaning criteria has proved to be effective for most of the exams. The system does not support customizing the data cleaning criteria, but you can export the response data and conduct the data cleaning manually.
Cases are only temporarily deleted from the analysis (never deleted from the dataset). You can check/uncheck the boxes for data cleaning criteria and hit “RUN REPORT” and the data analysis is regenerated automatically.

Item and option analyses

Once the test data is selected and the data cleaning criteria applied (optional), the item and option analyses results will be shown onscreen. These results include the following columns:

Item meta data:

Item Id: (hidden by default) The unique numeric identifier for the item.
Item Name: The unique string identifier for the item.
Is Scored: The scoring status of an item. “ü” indicates the item is scored on the selected form(s); “X” indicates the item is unscored (pretest) on the selected form(s); “both” appears only when multiple forms are selected and indicates the item is scored on one or more forms and unscored on one or more selected forms.
Domain Id: The content domain identifier indicating the content area which the item assesses.
Domain: The content area which the item assesses.
Item Type: The format of the item or question. “MCX” indicates the item is multiple-choice-single-selection with X options; “MCRX-Y” indicates the item is multiple-choice-multiple-selection with X options and Y of them are correct.
Key: The correct option identifier(s) of the item.

Item analysis results:

Candidate Count: The number of test-takers on which the item analysis is based.
Mean: The average score on the item across test-takers or the proportion of test-takers who correctly answered the item. This is a measure of item difficulty, and higher values indicate easier items.
Standard Deviation: The amount of variance or dispersion in the scores on the item. This is a measure of the dispersion of the item score distribution, and higher values indicate more dispersed distribution.
Corrected Item Total Correlation: The association between the score on the item and the score on the test without this item. This measures how closely the performance on the item is related to performance on the entire test. Higher values indicate the item has a stronger association with the test.
Item Total Correlation: (hidden by default) The association between the score on the item and the score on the test (including this item). Higher values indicate the item has a stronger association with the test. This is similar to Correct Item Total Correlation except that the item itself is not excluded from the total score so that it is always higher than or equal to the Correct Item Total Correlation.

Option analysis results:

For each option, two statistics are reported:

Test Correlation: This is the corrected option total correlation which assesses the association between endorsing an option and the test total score without the item. Higher values indicate people endorsing the option would get higher scores on the test.
Proportion: The proportion of test-takers endorsing the option.

Tips:

Click on “Columns” on the far right to enable or disable columns of the report (see below). The export of this report includes the currently selected (shown) columns.
By default, the system shows a limited number of rows. For most exams, you will need to adjust the number of rows at the lower right or page through the pages of results to see all items. If you export this report, the export always exports all rows meeting the chosen filters.

Other functions

Filter function:

Enter a value in the “Filter” textbox below the header row to show items with the specified values. Click on the filter button under each header row to set more complex filters using that column. To set a filter using multiple columns, click on “FILTERS” on the right edge of the page.

Manage columns:

To manage the columns, click on “COLUMNS” on the right edge of the page. Check the box in front of a column to make it appear and uncheck the box to hide it.

Export Item and Option analyses results:

To export the analysis results, click on the download button located at the top right corner of the page. The exported file will show exactly what you see on the page with visible columns only and any filter applied (but all rows).

Interpreting the item statistics

This section provides some generic guidelines for interpreting these statistics, but you may wish to consult a psychometrician or a textbook on psychometric item analysis to help you apply general principles to your specific exam context.

Mean:

The mean (or average) shows the proportion of candidates who answered the item correctly. For example, a mean of 0.518 represents that about 52% of candidates answered the item correctly.
There is no proper level of difficulty, although individual exams may have guidelines. Item variability is maximized when 50% of the sample responds correctly, but this is too hard for most exams. So, typically items are answered correctly by 60% to 85% of the sample. Items with a mean below 0.60 are hard and items with a mean above 0.85 are easy.
When most (e.g., >= 95%) candidates answer an item correctly, the Corrected Item Total Correlation becomes less meaningful. See the explanation in the section about CITC.

Corrected Item Total Correlation (CITC):

The CITC measures the association between the score on a single item and the sum of the rest of the items on a test. It provides evidence on how well the item discriminates between high and low performers on the test. A high CITC means that test-takers who correctly answer the item also score high on the test. It indicates that an item is good and aligns well with the overall test. On the other hand, a low CITC suggests that people who got the item right tend to score low on the test. This may imply that the item is poorly written or the item is confusing or misleading to test-takers.
There is no single set of rules in practice to interpret the size of CITC. In most of the programs administered through Certiverse, we apply the following rules in interpreting the CITC:
CITCs lower than -0.15: The item has a negative association with the test and should be avoided unconditionally.
CITCs between -0.15 and -0.05: The item has a small negative association with the test and should be avoided in most situations.
CITCs between -0.05 and 0.05: The item has near zero association with the test and may be considered in operational use.
CITCs between 0.05 and 0.15: The item’s association with the test is small but OK.
CITCs greater than 0.15: The item’s association with the test is good.
When most (e.g., >= 95%) candidates answer an item correctly, the Corrected Item Total Correlation becomes less meaningful because in this case nearly all respondents answer the item correctly and CITC calculated will be largely determined by the performance of a very small number of people who incorrectly answer the item. And any statistics based on such a small group of people will involve a large degree of error. For example, if an item is part of a 50-item test and was administered to a sample of 100 people, and the test results showed that 99 people got the item correct while the single person who incorrectly answered the item achieved the highest total score on the test. In this case, the item is extremely easy but has a large negative CITC. However, the negative CITC value was a result of a single person and would be hard to replicate if the test were to be administered to another sample. Therefore, we would reach the conclusion that the item is extremely easy and not interpret its CITC.

Tips:

Statistics are less useful when candidate count is tiny. Do not rely on statistical values calculated using samples that are tiny (fewer than 20 candidates) or unrepresentative (e.g., samples of candidates who are different from the eventual population of candidates). You should be aware that even estimates from large, representative samples include a degree of sampling error.
While there is no “correct” mean level, a mean value close to or below chance responding (e.g., 0.25 for a 4-option multiple-response item) is unreasonably hard and may be an indication of a problem with the item. And a mean value close to 1.0 indicates a very easy that contributes little or nothing to reliable measurement.

Option Analysis

Option Analysis in Certiverse system reports two statistics for each option: Proportion and Test Correlation. Proportion represents the proportion of test-takers endorsing the option. A higher Proportion value means the option was endorsed by more people. For multiple-choice single selection items, the Proportion for the keyed option will equal to the item Mean, and the sum of proportion for all options and proportion of missing responses would be 1. Test Correlation is the corrected option-total correlation (COTC) which measures the association between endorsing an option and the total score on the test. Higher positive COTC values indicate that people selecting the option tend to get a higher score on the exam.

Option analyses are primarily used when an item is functioning poorly, in order to better understand which response options are popular and how the choice of each response option is related to the overall exam score. These details may be useful for reviewers to diagnose and correct a poorly performing item. For example, in a typical well performing item, the keyed response is usually a popular one with a large proportion of people endorsing it. The keyed option should have a positive Test Correlation while all incorrect options should have a non-positive Test Correlation. The table below shows the option analysis results for two example items. The interpretation of these results is described in detail, and results of option analysis for your exam project can be interpreted in the same fashion.

Item001	Which of the following is a mammal?	Proportion	Test Correlation
Option A	A snake.	0.04	-0.21
Option B	A shark.	0.03	0.02
Option C	A penguin.	0.12	-0.25
Option D	A dolphin.*	0.81	0.45
Response missing		0.00	0.00
Item002	Which of the following is an insect?	Proportion	Test Correlation
Option A	A crab.	0.05	-0.06
Option B	A spider.*	0.42	-0.24
Option C	A bee.	0.32	0.23
Option D	A centipede.	0.20	0.07
Response missing		0.01	-0.01

Note. * denotes the keyed option.

In example Item001, the keyed option D “A dolphin.” was endorsed by 81% of the respondents with a COTC of 0.45. All other three options were endorsed by a small proportion of the sample and had trivial or negative correlation with the total score. These results suggest most of the test takers believed that the keyed option was correct, and people selecting the keyed option also scored high on the test, and thus the item has good performance in discriminating high scorers from low scorers on the test.

Item002 shows another example on the same test where the option analysis results indicated the item is potentially miskeyed. The keyed option B “A spider.” was selected by 42% of the test takers, but it had a negative COTC, indicating people selecting this option tend to have low scores on the test. On the other hand, option C “A bee.” was endorsed by 32% of the people and had a positive COTC. This pattern suggests people who scored high on the test tend to select option C instead of option B. SMEs reviewing this item can then use subject matter knowledge to determine option C should be the key, and option B should be a distractor.

Item Quality Stoplight Model

There are different criteria to determine the quality of an item based on the results of item and options analyses. In Certiverse, we utilize the item difficulty (i.e., Proportion Correct) and CITC to classify items into three stoplight categories: “red” (avoid), “yellow” (avoid if possible), and “green” (ready to use). The table below shows how item difficulty and CITC are combined to determine the quality flagging for multiple-choice-single-selection items with four options. As shown in the table, items were flagged as "red" if they have extremely low CITC, extremely low proportion correct, or a combination of low CITC and low proportion correct. Items were "green" if they had a reasonably high CITC value and their proportion correct was between 50% and 95%. The remaining items were flagged as "yellow".

CITC Proportion Correct	-1 <=CITC<= -0.05	-0.05 <=CITC< 0.05	0.05<=CITC< 0.15	0.15 <=CITC<= 1
0 <=PC <= 25%	red	red	red	red
25% <=PC < 50%	red	red	yellow	yellow
50% <=PC < 95%	red	yellow	yellow	green
95% <=PC <= 100%	red	yellow	yellow	yellow

Tips:

The stoplight model demonstrated is only one example application of item analysis results. You may want to build your own model or modify the above model to better fit your exam program. For example, if your exam consists of mostly difficulty items, applying the above model may cause too many items with low Proportion Correct being flagged as “yellow” or “red”. In this case, you may modify the model to adjust the threshold values of Proportion Correct or to change the label of some categories (e.g., label items with 25% <=PC < 50% and 0.15 <=CITC<= 1 “green” instead of “yellow”).

When you aim to select multiple forms, we recommend making sure the distribution of items with various quality are comparable across forms. For example, having similar numbers of “red” (ideally zero), “yellow”, and “green” items on two forms.

See here for a step-by-step video tutorial on how to create an Item Analysis Report.

Contact Us

If you have any questions or need additional assistance, please contact us by either emailing support@certiverse.com or by submitting a ticket from this article.