The third worst thing in the world is not having any labeled data. Worse than that is having bad labeled data. But worst of all, is having bad labeled and not knowing about it.

Inter annotator agreement is useful to quickly find out if you have quality problems in your data, so that you can know about bad labeled data and turn it into good labeled data.

The Agreement Matrix

LightTag will show you an agreement matrix. Each row and column correspond to an annotator who participated in a task, and a cell has the agreement data for that pair.

The diagonal cells are always perfectly green, because everyone always agrees with themselves.

Subtle But Important Point

You might notice that the matrix in the picture isn't perfectly symmetrical. That's because agreement isn't completely symmetrical.

For example if I say "Dog, Cat" and you said "Dog" then I agreed with you 100% of the time, and you agreed with me 50% of the time.

In LightTag's Agreement view, the percentage is always relative to the row.

The second row's first column says "Melanie Williams Agreed with Charlene Wells" 37% of the time.

The second column of the first row says " Charlene Wells agreed with Melanie Williams 35% of the time"

How Does This Help Me

The nice thing about agreement scores are that they are there, you don't need to do anything else and can get some sense of what is happening. Here are a few things to look for

Individual annotator Performance (One bad row)

In a more realistic scenario you might get a matrix like this one, which shows that one annotator is performing particularly poorly (in fact it's me, and that's why I write documentation instead of labeling data) .

In a case like that you'd know you need to review that annotators partciular work and give them the training they need

Team Confusion

Another, unfortunately common, scenario is when your matrix has the color of dying grass in the summer.

If this is what your matrix looks like you should stop and figure out what's going on. Any model you train on such data will be confused and under performant, and the data is to inconsistent to reasonably evaluate with.

We don't like it when you stop labeling, but we want you to put the best models you possibly can in production. So if this is what you're seeing, stop labeling, review the data and give your team better guidelines and feedback. Talk to us we'll help you.

Did this answer your question?