Redaction Image Classifier: NLP Edition
I train an NLP model to see how well it does at predicting whether an OCRed text contains a redaction or not. I run into a bunch of issues when training, leading me to conclude that training NLP models is more complicated than I'd at first suspected.
I've previously
written
about my use of fastai's vision_learner
to create a classification model that
was pretty good (> 95% accuracy) at detecting whether an image contained
redactions or not.
This week in the course we switched domains and got to know HuggingFace's
transformers
library as a
pathway into NLP (natural language processing) which is all about text inputs. I
struggled quite a bit trying to think of interesting yet self-contained / small
uses of NLP that I could try out this week. A lot of the common uses for simple
NLP modelling seem to be in the area of things like 'sentiment analysis' where I
couldn't really see something I could build. Also there are a lot of NLP uses
cases which feel unethical or creepy (perhaps more so than in the computer
vision, it felt to me).
I emerged at the end of this thought process with the idea to try to pit image classification and text classification against one another: could I train an NLP model that would outperform my image classifier in detecting whether a specific document or page contains a redaction or not?
Of course, the first thing I had to do was to OCR all the pages in my image dataset and convert this all into a text dataset. When it comes to OCR tools, there are a number of different options available and I'd luckily experimented around with them. (A pretty useful overview of three leading options can be found in this blogpost by Francesco Pochetti.) I went with Tesseract as I knew had pretty good performance and accuracy for English-language documents.
My process for converting the documents wasn't particularly inspired.
Essentially I just loop over the image files one by one, run the OCR engine over
them to extract the text and then create a new .txt
file with the extracted
text. At the end, I had two folders with my data, one containing texts whose
corresponding images I knew had contained redactions, and one where there were
no redactions.
I had two hunches that I hoped would help my NLP model.
- I hoped that the redactions would maybe create some kind of noise in the extracted text that the training process could leverage to learn to distinguish redacted from unredacted.
- I knew that certain kinds of subjects were more likely to warrant redaction than others, so perhaps even the noise of the OCR trying to deal with a missing chunk of the image wouldn't be as important as just grasping the contents of the document.
What follows is my attempt to follow steps initially outlined in Jeremy Howard's Kaggle notebook that the course reviewed this week in the live lesson. My code doesn't depart from the original notebook much.
!pip install datasets transformers tokenizers -Uqq
from pathlib import Path
import numpy as np
import pandas as pd
I save my .txt
files on the machine and I get a list of all the paths of those files.
path = Path("redaction_texts")
p = path.glob("**/*.txt")
files = [x for x in p if x.is_file()]
I iterate through all the paths, making of list of all the redacted texts as strings.
texts = []
for file_path in files:
with open(file_path) as file:
texts.append(file.read())
!ls {path}
def is_redacted(path):
"Extracts the label for a specific filepath"
if str(path.parent).split("/")[-1] == "redacted":
return float(1)
else:
return float(0)
is_redacted(files[1])
Converting a Python dict
into a Pandas DataFrame is pretty simple as long as
you provide the data in the right formats. I had to play around with this a
little when I was getting this to work.
data = {
"input": texts,
"labels": [is_redacted(path) for path in files],
}
df = pd.DataFrame(columns=["input", "labels"], data=data)
# df
df.describe(include='object')
We now have a DataFrame containing 3886 rows of data. You can see here that 35 rows have no visible text. Potentially something went wrong with the OCR extraction, or the redaction covered the entire image. I didn't really know or want to fiddle around with that too much, so I left those rows in.
We create a Dataset
object from our DataFrame. It requires that our targets
have the column name labels
.
from datasets import Dataset, DatasetDict
ds = Dataset.from_pandas(df)
ds
We're finetuning a pre-trained model here, so I start with the small version of Deberta which will allow me (I hope!) to iterate quickly and come up with an initial baseline and sense of whether this is even a viable approach to solving the problem.
model_nm = 'microsoft/deberta-v3-small'
Before we finetune our model, we have to do two things to our text data in order that it works within our gradient descent powered training process:
- we have to tokenise our text data
- we have to turn those tokens into numbers so they can be crunched within our GPU as numbers.
Tokenisation is the process of splitting our words into shorter stubs of text --
there are varying schools of thought and use cases on the extent to which you
break the words down. We have to use the same tokenisation process that was used
by our pretrained model, so we let transformers
grab the original tokenisers
that was used with deberta-v3-small
.
from transformers import AutoModelForSequenceClassification, AutoTokenizer
tokz = AutoTokenizer.from_pretrained(model_nm)
def tok_func(x): return tokz(x["input"])
tok_ds = ds.map(tok_func, batched=True)
We split our data into training and validation subsets as per usual so that we know how our model is doing while training.
dds = tok_ds.train_test_split(0.25, seed=42)
dds
We define our metric as Pearson's r
AKA the Pearson correlation
coefficient, a
metric I don't feel an immense instinctual understanding for, but suffice it for
this blogpost to know that a higher value (up to a maximum of 1) is better.
def corr(x, y):
return np.corrcoef(x, y)[0][1]
def corr_d(eval_pred):
return {"pearson": corr(*eval_pred)}
from transformers import TrainingArguments,Trainer
Here we define our batch size, the number of epochs we want to train for as well as the learning rate. The defaults in Jeremy's NLP notebook were far higher than what you see here. His batch size was 128. When I ran the cells that follow, I hit the infamous "CUDA out of memory" error more or less immediately. I was running on a machine with a 16GB RAM GPU, but this apparently wasn't enough and the batch size was far too large. I had to reduce it down to 4, as you can see, in order to even be able to train the model. There are tradeoffs to this in terms of how well the model learns, but without spending lots of money on fancy machines this was the compromise I had to make.
bs = 4
epochs = 5
lr = 1e-4
args = TrainingArguments(
"outputs",
learning_rate=lr,
warmup_ratio=0.1,
lr_scheduler_type="cosine",
fp16=True,
evaluation_strategy="epoch",
per_device_train_batch_size=bs,
per_device_eval_batch_size=bs * 2,
num_train_epochs=epochs,
weight_decay=0.01,
report_to="none",
)
model = AutoModelForSequenceClassification.from_pretrained(
model_nm, num_labels=1
)
trainer = Trainer(
model,
args,
train_dataset=dds["train"],
eval_dataset=dds["test"],
tokenizer=tokz,
compute_metrics=corr_d,
)
trainer.train();