Structured Data Extraction for ISAF Press Releases with Instructor
I used Instructor to understand how well LLMs are at extracting data from the ISAF Press Releases dataset. They did pretty well, but not across the board.
nlp
afghanistan
datalabelling
isafpr
llms
miniproject
Author
Alex Strick van Linschoten
Published
June 2, 2024
I’m currently participating in the Maven LLM course / conference. Originally focused on finetuning LLMs, it’s since expanded to encompass a wide range of LLM-related topics. I thought I’d try to work through a small project alongside the course to get some practical experience with fine-tuning LLMs.
I previously published a dataset from my time working in Afghanistan: the ISAF Press Releases dataset. (See here for a blogpost I wrote describing the dataset in more detail.) Even though it was not really intended as a dataset to be used for any kind of model training, I thought it might serve well to finetune an LLM on top of it. The dataset is made up of press releases from the International Security Assistance Force (ISAF) and I had previously annotated them, extracting out metadata of interest.
Here’s an example:
“2013-02-21 KABUL, Afghanistan, February 21, 2013 - An Afghan and coalition security force arrested six insurgents during an operation in search of an Islamic Movement of Uzbekistan leader in Kunduz district, Kunduz province, today. The leader is allegedly instrumental in manufacturing, procuring and distributing improvised explosive devices for use in attacks against Afghan and coalition forces in the province. The security force also seized a shotgun as a result of the operation.”
From this I extracted the following metadata:
the date
the type of event (detention)
the province where the even took place (Kunduz)
The city or district where the event took place (Kunduz)
the group that was targeted in the raid (IMU)
the minimum number of people killed (0)
the minimum number of people captured (6)
the phrase used to characterise the number captured (6)
whether anyone was killed or not (no)
whether anyone was captured or not (yes)
whether an airstrike was conducted or not (no)
whether any leaders were arrested or not (no)
There were a few other pieces of metadata captured but you probably get the idea. Check out the dataset’s card which gives full details.
There are 4822 such events in the dataset and as you might imagine it took quite a long time to manually annotate all this data. An example like the one above is fairly straightforward but it’s sometimes unclear exactly how many people were involved. Take this press release:
“2012-12-S-025 KABUL, Afghanistan (Dec. 26, 2012) An Afghan- led security force of more than 1,000 Afghan National Security Force soldiers and policemen, concluded a five day coalition-supported operation in Baraki Barak district, Logar province, yesterday. The operation was conducted by the Provincial Response Company Laghman, along with elements of the Afghan Local Police, the Afghan Uniformed Police, and the Afghan National Army. During the operation, the Afghan-led force killed multiple insurgents and detained dozens of suspected insurgents. The security force also seized improvised explosive device materials, suicide vests, weapons, ammunition, and a quantity of illicit drugs.”
The number of people killed is specified as “multiple” and the number of those captured is specified as “dozens”. So that means a minimum of 3 killed and a minimum of 24 captured. But you have to be reading fairly closely to pick all of this up, and it gets even more complicated when they refer to multiple events in the same press release (and so on).
It occurred to me recently that it might make for an interesting test of an LLM’s ability to extract this data out of the raw text in a structured format. So ideally I’d input the press release and I’d get out a JSON object (or Pydantic or whatever) which populates all the various fields.
I’m lucky in that I’ve already lovingly labeled such a large dataset so I can be really confident in the quality which will allow me to focus on the task of finetuning an LLM to do this task.
So my learning goals from this project are to:
get more familiar with the practical flow of finetuning an LLM, particularly given hardware constraints and seeing how much can be done on my local machine vs needing cloud hardware.
understand the best use cases for finetuning LLMs
get some experience with deploying the finetuned LLMs (and seeing how that compares to the popular LLMs available via API (Anthropic / OpenAI etc))
understand the limits of structured data extraction using LLMs
get a sense of the effort needed to finetune LLMs for a focused task like this
see how much can be done with local and open-source LLMs
For the project itself, I keep reading (and watching) that you can get GPT-4 level performance on specific focused tasks by finetuning LLMs and I wanted to see how much can be done with limited resources (or just how cherry-picked those public examples actually are.)
I have a ‘complete’ dataset which includes press releases published after I finished working on my report, so ideally I’ll be able to use the model to label the remaining data (if it’s good enough in terms of accuracy). I’d also like to see how the speed of a finetuned model compares to using something like GPT-4.
Let’s try this out with a simple prompt and a single example to see how it performs out of the box! We load the dataset first:
# get data from datasetsfrom datasets import load_datasetimport pandas as pdfrom rich importprintimport tqdm as notebook_tqdm# Load the datasetdataset = load_dataset("strickvl/isafpressreleases", split="train")# Convert the dataset to a pandas DataFramedf = pd.DataFrame(dataset)# Print the first few rows of the DataFrameprint(df.head())
/home/strickvl/.pyenv/versions/3.10.14/envs/isafpr/lib/python3.10/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html
from .autonotebook import tqdm as notebook_tqdm
If we look at the options for the ‘eventtype’ column, you’ll see that we have some single-word event types at the top, but also some semi-colon-separated event types. We’ll need to handle these a bit differently when we get to finetuning but perhaps let’s ignore that for now.
We have everything we need to set up the task and the data structures that will be filled by our LLM. Let’s start with the event type where we just create an enum to store the options:
Finally we can create an IsafEvent which is a Pydantic model where we’ll store all the various pieces of data we’re interested in. We include descriptions of the different fields which will help our LLM to understand the data it’s working with.
class IsafEvent(BaseModel): name: str= Field(description="A title or name for the event") start_date: date = Field( description="The start date of the event in YYYY-MM-DD format" ) end_date: date = Field(description="The end date of the event in YYYY-MM-DD format") event_type: EventType = Field(description="The event type") province: Province = Field(description="The province in which the event occurred") target_group: str= Field( description="The group that was targetted during the event." ) min_killed: int= Field( description="The minimum number of people killed during the event" ) min_captured: int= Field( description="The minimum number of people captured during the event" ) killq: bool= Field( description="Whether someone was killed or not during the event" ) captureq: bool= Field( description="Whether someone was captured or not during the event" ) killcaptureraid: bool= Field( description="Whether the event was a so-called 'kill-capture raid'." ) airstrike: bool= Field( description="Whether an airstrike was used during the event" ) noshotsfired: bool= Field( description="Whether no shots were fired during the event" ) min_leaders_killed: int= Field( description="The minimum number of leaders killed during the event" ) min_leaders_captured: int= Field( description="The minimum number of leaders captured during the event" )class Config: arbitrary_types_allowed =True
Dec. 11: Haqqani Facilitator Detained in Khowst; Security Discussed in Farah
NEWS RELEASE ISAF Joint Command - Afghanistan 2009-12-CA-065 For Immediate Release KABUL, Afghanistan (Dec. 11) - An
Afghan-international security force detained a couple of militants in Khowst province today, one of whom was a
sought-after Haqqani facilitator. The facilitator is responsible for the shipment and distribution of weapons to
other militant elements in the area.
The joint security force searched a compound near the village of Badal Kalay in the Nader Shakhot district where
intelligence sources indicated the facilitator was located. The facilitator identified himself and surrendered
without incident. No shots were fired and no one was injured.
Now we can construct a simple prompt to help guide the LLM to extract the right data:
query =f"""The following is a press release issued by ISAF (formerly operating in Afghanistan):{article_text}Please extract the following information from the press release:- The name of the event- The start date of the event- The end date of the event- The event type- The province in which the event occurred- The target group of the event- The minimum number of people killed during the event- The minimum number of people captured during the event- Whether someone was killed or not during the event- Whether someone was captured or not during the event- Whether the event was a so-called 'kill-capture raid'- Whether an airstrike was used during the event- Whether no shots were fired during the event- The minimum number of leaders killed during the event- The minimum number of leaders captured during the event"""
Structured data extraction with Instructor
Now we can use Instructor to help ensure that our data is extracted out according to our Pydantic model and just use GPT-3.5 to see how it performs:
import instructorfrom openai import OpenAI# patch the client to add `response_model` to the `create` methodclient = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)openai_resp = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "user","content": query, }, ], response_model=IsafEvent,)print(openai_resp)
As you can see, GPT-3.5 did pretty well! It was able to extract all the data more or less as I’d have hoped. It was able to determine that “Khowst” province in the article was “Khost” province in the schema, and it correctly determined that two individuals were detained. The only thing where I would have done things differently was to state that a minimum of one leader was captured. In this project I didn’t consider a ‘faciliator’ to be a leader so in the original dataset this would have been a 0:
df["minleaderscaptured"][article_id]
'0'
It’s hard for the LLM to have known that we’re taking that approach without having specified it in the prompt, but we can try again, updating the prompt with that rule:
updated_query =f"""The following is a press release issued by ISAF (formerly operating in Afghanistan):{article_text}Please extract the following information from the press release:- The name of the event- The start date of the event- The end date of the event- The event type- The province in which the event occurred- The target group of the event- The minimum number of people killed during the event- The minimum number of people captured during the event- Whether someone was killed or not during the event- Whether someone was captured or not during the event- Whether the event was a so-called 'kill-capture raid'- Whether an airstrike was used during the event- Whether no shots were fired during the event- The minimum number of leaders killed during the event- The minimum number of leaders captured during the eventSo-called 'facilitators' aren't considered leaders in this task."""
import instructorfrom openai import OpenAI# patch the client to add `response_model` to the `create` methodclient = instructor.patch(OpenAI(), mode=instructor.Mode.MD_JSON)openai_resp = client.chat.completions.create( model="gpt-3.5-turbo", messages=[ {"role": "user","content": query, }, ], response_model=IsafEvent,)print(openai_resp)
Unfortunately we’re still getting the same response. I could fiddle around with the prompt a bit more to get the result I wanted, but you can see that this approach of encoding all these edge cases into the prompt isn’t going to scale very well. It’s also going to overfit a little to the training data and really what we want is a model that can do well at any new articles we pass into it.
Even when we try Claude’s Opus model we get the same result, which tells us that for sure we’d have to amend this in the prompt to fix the problem.
from anthropic import Anthropicimport instructorclient = instructor.from_anthropic(Anthropic())# note that client.chat.completions.create will also workclaude_opus_resp = client.messages.create( model="claude-3-opus-20240229", messages=[ {"role": "user","content": query, }, ], max_tokens=4096, response_model=IsafEvent,)print(claude_opus_resp)
If there’s one thing the Maven conference has done well it’s to emphasise the importance of getting a solid sense of an initial baseline from which you can (measureably) improve. So what I’d like to do next is to make a simple evaluation of the performance of the different models on this task.
Coming up with a score will be interesting as there are multiple pieces of information to compare. Numbers to numbers is an easy comparison, but what happens when it gets the wrong category? Do I subtract a mark from the score, or do I keep scores for all the different attributes? It isn’t clear to me how best to construct this score.
It also occurred to me while writing the above code that when I was doing the annotation I did so in ‘passes’. So I’d read the article and I’d be reading it looking only for the numbers of killed and captured individuals and nothing else. Then I’d reread the article and extract the data and location, and so on with multiple passes. This is to keep me focused on small details as if I tried to make note of everything at a single pass then I’d certainly miss things or get things wrong. So extrapolating out to LLMs (which, to be clear, don’t work like humans when they read text) I’m wondering whether it might make sense to finetune multiple specialised LLMs to be really best in class at extracting one or two data points instead of expecting it to do what I was unable (i.e. extracting everything at a single pass). Obviously it’d be preferable to have a single model, but I’m wondering whether we might push the limits of what an LLM can do at some of the longer or more complex articles. We’ll have to keep that in mind going forward.
The one thing I’ll have to make sure to do before I run the evaluation is to make sure that my data structures are set up to match the original dataset. You’ll remember that above we ignored the fact the eventtype field could have multiple types separated by a semicolon. So I’ll have to make sure that my data structures are set up to handle that. I’ll also improve the descriptions of the fields where possible and try to make the prompt a bit more performant.
In the next blog I’ll show the results of evaluating some of our baseline LLMs (proprietary and open-source) to see how they perform for this task. Once we have that baseline then we can continue to the task of actually finetuning an LLM.