Accurate Medical Coding, Part 3: Evaluating Medical Coding Reliability with IRR

Jun 30, 2023 | Risk Adjustment, Policy

Introduction

In Part 2 of this blog series, we discussed the importance of consistent coding practices in the healthcare industry and identified potential sources and risks associated with variations in medical coding. Measuring variation in medical coding and developing strategies to mitigate its causes is an essential function of any healthcare organization. In this blog, we will explore how to measure coding quality and variability using a metric called inter-rater reliability (IRR). This is generally the first step in addressing unwanted variation in coding. The IRR can be used to establish accurate and reliable coding practices to consistently gauge patient outcomes and promote financial stability.

What is Inter-Rater Reliability (IRR)?

Medical coding is an essential part of our healthcare system that involves assigning standardized codes to describe medical diagnoses and procedures for billing and record-keeping purposes (see Part 1 for more details on different types of medical coding and their use). However, medical coding is prone to subjectivity and individual biases. Within any healthcare organization, it is important to measure how reliable these codes are (i.e., that the codes are complete, accurate and applied consistently), despite the intrinsic variation built in this process.

An intuitive way to evaluate the quality of medical coding is to see how multiple coders consistently identify the same code within a set of medical records. The more frequently the medical codes appear across multiple coder reviews, the more likely it is a correct code. Inter-rater reliability (IRR) measures the consistency and agreement rate between two or more coders, given the same set of medical records. A high IRR value indicates more agreement and consistency between multiple coders, suggesting less error in the coding process.

How is the IRR used?

IRRs are typically reported for quality control and assurance across multiple sectors in the healthcare industry. The accuracy and consistency of medical coding can impact patient care, billing and reimbursement, and compliance with coding and billing guidelines. Below, we discuss how healthcare organizations can use the IRR to evaluate the overall quality of medical coding.

Assessing Coding Consistency

The IRR is used to assess coding consistency between medical coders coding the same set of records. Healthcare organizations can use this IRR value to identify areas where there are consistencies or discrepancies and take measures to address these issues. However, it is crucial to look out for cases where coders are making consistent systematic errors. It is important to note that relying solely on IRR may not be adequate for identifying and rectifying these systematic errors. In these cases, a thorough examination of the root causes of such errors is imperative. We discuss the limitations of the IRR in more detail in the next section.

Identifying Areas for Training

The IRR can also be used to identify areas where coders may need additional training or support. For example, a healthcare plan may serve a county that has a population with high occurrence of chronic obstructive pulmonary disease (COPD), a progressive lung disease associated with multiple co-morbidities. If IRRs are low, then the plan may consider providing targeted training in COPD to improve coder proficiency.

Evaluating Coder Performance

Healthcare organizations can use the IRR as a standardized and objective evaluation of coder performance. While there may be multiple causes for a low IRR, if the discrepancies coincide with a particular coder consistently, that may be a flag on that coder’s performance. By examining the agreement among coders, an organization can gauge the consistency and accuracy of their coding practice. This approach promotes accountability and encourages high-quality work standards. Generally, when evaluations of coders are conducted, comparisons are made to a senior coder and/or multiple other coders.

We should note that while the IRR is a valuable metric for evaluating coder performance, it should not be the sole method used. Other performance indicators, such as accuracy from audit results, should be taken into consideration to provide a comprehensive view of coder performance.

Ensuring the Integrity of Healthcare Payments

Ensuring that medical coding is consistent and reliable is important, especially for healthcare plans who receive risk adjusted payments, such as managed care plans participated in the Medicare Advantage (MA) programs. Under MA, the Center for Medicare and Medicaid (CMS) audits MA plans with its Risk Adjustment Data Validation (RADV) program to ensure they are submitting reliable billing data with its associated medical records. This helps promote the integrity and accuracy of payments made by the government to healthcare plans. The RADV audit process can involve multiple comparisons of codes on billing and other medical records across plan and government coders, with the aim of having high IRR measures.

Coding and Billing

It is essential for healthcare plans to determine that they are conducting proper coding and billing procedures to ensure reliability, accuracy, and compliance with billing regulations. IRR measures can be used to inform more efficient and lean billing and reimbursement process. Additionally, organizations can use inconsistencies or discrepancies in medical codes to identify potentially fraudulent billing practices, such as upcoding or unbundling services. This proactive approach to detecting billing irregularities helps organizations maintain compliance with regulatory requirements and safeguard against financial losses and damage to their reputation.

Resource Allocation

IRR can be used by healthcare organizations to evaluate the reliability of their healthcare services billing and costs. By ensuring there is consistency and accuracy in coding decisions, organizations can better monitor areas where costs can be reduced or where investment is needed to improve the quality of patient care.

IRR Metrics: Strengths and Pitfalls

There are different ways to calculate an IRR metric, each one having different sets of assumptions, and different merits depending on the context in which it is used. We’ll discuss strengths and weaknesses, along with practical applications of the most common option and a more robust IRR approach.

Percent Agreement: Intuition and Pitfalls

Percent agreement is a commonly used metric to measure IRR in medical coding. It is calculated as the percentage of times that two or more coders assign the same code to a medical record. For example, if two coders assign the same code to a medical record 80 times out of 100, the percent agreement would be 80%.

While percent agreement is a simple and easy-to-understand measurement, there are issues with using it as an IRR metric. One issue of using percent agreement is that it does not account for the possibility of agreement occurring by coincidence, rather than through an accurate interpretation of the medical record.

To better illustrate this concept, let’s consider an example. Suppose we are trying to determine the severity level for type II diabetes in a population of patients. Coders are required to assign the appropriate code based on specific criteria, such as the presence of complications or the need for medication management. However, due to the complexity or ambiguity of the guidelines, some coders may struggle with distinguishing the correct severity level consistently. As a result, their coding decisions may vary.

Using percent agreement, the IRR measurement may show occasional alignment or agreement between coders on the severity level for diabetes. However, this alignment may not necessarily reflect a true consensus or shared understanding – rather, it could be a result of occasional coincidental agreement. That is, coders who struggle with severity level determination may occasionally align by chance when faced with a specific case that aligns with their interpretation, even though their overall coding decisions may not consistently match. Therefore, percent agreement tends to overestimate the level of agreement between coders.

Another issue with using percent agreement is that it does not consider the potential for systematic errors in coding. For example, two coders may consistently assign the same incorrect code to a medical record, leading to a high percent agreement but low accuracy. Therefore, percent agreement alone may not provide a complete picture of the reliability of medical coding.

Despite these limitations, percent agreement can still be a useful measure of IRR when used in conjunction with other statistical measures. For example, percent agreement can be used as the preliminary assessment tool when evaluating multiple coders, as it will often provide the upper bound of coder agreement. If the percent agreement is initially low, immediate actions to increase coder agreement, such as training a junior coder, should likely be prioritized over optimizing this metric for reporting purposes.

Further, using multiple metrics to assess IRR can provide multiple views of analysis. Other IRR metrics make assumptions about the factors that contribute to coder agreement, such as chance agreement. By using multiple measures of IRR, healthcare providers can get a more complete and accurate assessment of the reliability of their medical coding practices and take more informed actions to address problematic discrepancies.

Factoring Chance Agreement Using Cohen’s Kappa

Now, let us consider how to provide a more comprehensive view of the IRR, using an additional metric called Cohen’s Kappa. Consider a scenario where two coders agree on a code for Disease X 50% of the time. One coder is experienced, and instantly recognizes the code for Disease X, while the other coder knows little about the disease and chose a code based on their best guess. While the two coders agreed half of the time, it is not a true reflection of each of the individual coders’ expertise. We would expect that the experienced coder likely correctly applied the code each time, while the inexperienced coder was essentially flipping a coin.

Cohen’s Kappa addresses this by measuring the agreement between two or more coders after accounting for chance agreement. As with percent agreement, the higher the value for Cohen’s Kappa, the more agreement there is between two coders. In the example above, Cohen’s Kappa would factor in the experience from the coder that knows nothing about Disease X. If the coder has a 50% chance of coding Disease X correctly based on random guessing, Cohen’s Kappa would be zero (0). This contrasts with the simpler percent agreement measure, which would show 50%. In essence, Cohen’s Kappa subtracts out the probability that there is random agreement to better reflect the true performance of the coders.

The example above also highlights how Cohen’s Kappa provides a more conservative estimate of IRR and how percent agreement tends to overestimate agreement between multiple coders. However, while Cohen’s Kappa is more reliable in assessing IRR than percent agreement, it also has some limitations. One is that it can be affected by the prevalence of codes in a dataset. That is, if one code is more prevalent than others, it can artificially inflate Cohen’s Kappa, making it appear as though there is a higher level of agreement between coders (as a coder who is randomly guessing may be correct more often when the code is supposed to be present). Additionally, Cohen’s Kappa assumes that coders are independent and have no influence on each other’s coding practices, which may not always be the case. For instance, multiple coders may work together to assign codes or labels to a set of data, which may lead to mutual influence or coding to conform to the majority opinion.

Summary

In summary, consistent and accurate medical coding is crucial for proper billing and reimbursement, compliance with coding and billing guidelines, and ultimately, positive patient outcomes. Inter-rater reliability (IRR) measures are used to gauge the consistency and agreement between multiple coders when coding the same medical record.

Looking to improve your medical coding practices and avoid costly errors? Our team at RaLytics can offer expert data analysis and consulting services to ensure that your coding processes are accurate, compliant, and efficient. Contact us today at info@ralytics.com to learn more about how we can help you optimize your medical coding practices and maximize reimbursements.