# ALMa: Active Learning (data) Manager

Active Learning is a popular technique to reduce annotation costs by using AI to decide what to label next.
The subtle bookkeeping involved in keeping track of what has been labeled is tedious and error prone. We need to train our learner on **L**abeled data, but sample new examples to label from **U**nlabeled data. As we label data it moves around between the two subsets of our **D**ataset and managing the bookkeeping a chore that should be abstracted away.

Today we're happy to open-source ALMa the **A**ctive **L**earning **Ma**nager that abstracts away the bookkeeping. Whereas most implementations modify the data array in place, ALMa maintains views of **L**abeled and **U**nlabeled subsets of the original **D**ataset.

## The Problem ALMa Solves

In a typical active learning setup we start with an Array **D**ataset of examples which naturally divides into two disjoint subsets **U**labeled data and **L**abeled data. When we work with our **U**nlabeled data as an array, it's indices don't correspond to the indices of our **D**ataset and they change every time we add new labeled data.

### Active Learning without ALMa involves confusing Bookkeeping

In the example below, taken from the ModAL library, the original dataset is constantly modified with hard to read numpy code. While it is short, it is difficult to grock and harder still to ensure correctness.

```
for index in range(N_QUERIES):
query_index, query_instance = learner.query(X_pool)
#What is this reshape stuff doing ?
X, y = X_pool[query_index].reshape(1, -1), y_pool[query_index].reshape(1, )
learner.teach(X=X, y=y)
#Confusing np.delete
X_pool, y_pool = np.delete(X_pool, query_index, axis=0), np.delete(y_pool, query_index)
```

## Active Learning with ALMa is easy:

Here's the same code using ALMa, it has a few more lines, but the bookkeeping has effectively been abstracted away.

```
for index in range(N_QUERIES):
index_to_label, query_instance = learner.query(manager.unlabeld)
original_ix = manager.get_original_index_from_unlabeled_index(index_to_label)
y = original_labels_train[original_ix]
label = (index_to_label, y)
manager.add_labels(labels)
learner.teach(X=manager.labeled, y=manager.labels)
```

## ALMa's solution: Simpler Bookkeeping with Views and Offsets

ALMa uses numpy's fancy indexing to maintain views of the **D**ataset which minimizes and simplifies the bookkeeping that needs to be done. ALMa relies on two numpy features to manage the bookkeeping, fancy indexing with *mask index arrays* to create views of the data, and the *nonzero* method to calculate index offsets for new labeled data (which comes in indexed relative to the **U**nlabeled subset)

### Maintaining Views of the Labeld and Unlabeled data

ALMa uses numpy's mask index arrays to create "views" of **D**ata that correspond to our **L**abeled and **U**nlabeled data.

When we initialize an ActiveLearningManager it creates a boolean array whose indices are True if the corresponding feature has been labeled, and False otherwise.

```
# Create a boolean array with the same length as features
self.labeled_mask = np.zeros(self.features.shape[0], dtype=bool)
```

This makes getting the **Unlabeled** indices simple

```
def unlabeled_mask(self):
return np.logical_not(self.labeled_mask)
```

We can then expose the views on the ActiveLearningManager as follows:

```
@property
def labeled(self):
return self.features[self.labeled_mask]
@property
def unlabeld(self):
return self.features[self.unlabeled_mask]
```

### Adding New Labels

With our views in place our active learning process boils down to:

- Sample some data from the
**U**nlabeld subset - Have the annotator label them and update ALMa
- Train the learner on the updated
**L**abeled subset - repeat

But, when we sample from the **U**nlabeled, the examples are not indexed relative to the original dataset and so we need a way to recover to correct indices.

This would be easily solved if we had an array whose indices were the same as our **U**nlabeled data and values were the corresponding indices in the **D**ataset.

#### Mapping offsets with Numpy's nonzero method

This is actually easier done than said, numpy provides a nonzero() method that gives us exactly that.

```
import numpy as np
a = np.zeros(10,dtype=np.bool)
a.nonzero()[0]
```

`array([], dtype=int64) #Empty array`

```
a[3] =True
a.nonzero()[0]
```

`array([3]) # Maps the first true value to it's index in the original rray`

```
a[7] = True
a.nonzero()[0]
```

`array([3, 7]) # Maps both True values to their correct place in the original array`

#### Adding Labels With Numpy's nonzero method

ALMa holds a boolean array *labeled*mask_ that whose values are True when we already have a label for the example at the index. We calculate *unlabeled*mask_ by taking the logical*not of _labeled_mask*. So calling *nonzero()* on our *unlabeled*mask_ gives us a new array whose indices are the indices of our **U**nlabeled data and values are the indices of that example in our **D**ataset.

Armed with that, calculating the correct offsets is simple:

```
def _offset_new_labels(self, labels_for_unlabeled_dataset: LabelList):
if len(self._labels) == 0:
# Nothing to correct in this case
return labels_for_unlabeled_dataset
labels_for_dataset: LabelList = []
unlabeled_indices_map = self.unlabeled_mask.nonzero()[0]
for label in labels_for_unlabeled_dataset:
index_in_unlabeled, annotation = label
index_in_dataset = unlabeled_indices_map[index_in_unlabeled]
new_label: Label = (index_in_dataset, annotation)
labels_for_dataset.append(new_label)
return labels_for_dataset
```

And when we add one or more new labels ALMa does

```
def add_labels(self, labels: LabelList, offset_to_unlabeled=True):
if isinstance(labels, tuple): # if this is a single example
labels: LabelList = [labels]
elif isinstance(labels, list):
pass
else:
raise Exception(
"Malformed input. Please add either a tuple (ix,label) or a list [(ix,label),..]"
)
if offset_to_unlabeled:
labels = self._offset_new_labes(labels)
self._update_masks(labels)
for label in labels:
self._labels[label[0]] = label[1]
```

# Final Thoughts

Managing state is generally difficult and error prone, and this is true for active learning as as well. By minimizing the state being muated and working with views of the data we can simplify the end users experience. We hope that using ALMa will help you focus on your research or production models by freeing you up from bookkeeping. Clone ALMa here