Aiimi Engineering on… Named Entity Recognition.
In this blog, we’ll discuss Named Entity Recognition, a powerful approach to extracting data from free text and a key step in the Aiimi Insight Engine enrichment pipelines.
At Aiimi, our mission is to connect people to insight, and often in organisations, that insight may be locked away in free text fields or documents, rather than sitting in nicely structured databases. We regularly use Named Entity Recognition (NER) techniques to add value to documents, surface information that was previously invisible, and draw connections between datasets.
What is Named Entity Recognition?
NER is a way to find and classify useful information (entities) within unstructured text – entities such as people, organisations, and locations. NER is a sequence tagging task, used to label parts of a word sequence (usually a sentence) with the location and type of entity. It’s worth noting that entities are commonly one-word long, but they can also be made up of multiple words.
How does Named Entity Recognition work?
As with many natural language problems, NER is something that Large Language Models (LLMs) are very good at. However, research into NER dates back to the 1990s, and many other techniques are available. Typically, simpler NER techniques are both cheaper to run than LLMs and provide more trustworthy answers for business purposes.
Dictionary-based NER
This method compares each word in a text to a pre-existing dictionary of entities. If a word or series of words matches one of the items in the dictionary, it’s classified as an entity. This approach is limited by the quality of your master data and is not appropriate for entity sets that may regularly change. We might use dictionary-based NER for an asset management company with a fixed list of sites with defined names that do not regularly change.
Rule-based NER
This approach is based on the observation that certain types of entities have a predictable structure. To extract these from text, we apply a series of rules that correspond to each type of entity we’re looking for – also known as pattern-matching.
This method of entity extraction is more generalisable than the dictionary-based method – we might use it to identify site names or product codes that conform to known patterns. The Aiimi Insight Engine supports several standard patterns out-of-the-box, like email addresses, phone numbers, and postcodes.
Statistical NER (Machine Learning)
The Machine Learning approach creates a probabilistic model of where the entities are likely to appear within natural language, based on previous experience. Trained with many labelled examples, this model then uses those learnings to identify entities in an unseen context.
The benefit of this approach is that providing the model is trained over a diverse set of contexts, it should be able to work on a wide range of unseen examples. The Aiimi Insight Engine comes packaged with pretrained models from the open-source community to identify people, places, and organisations using statistical models.
While it is possible to train our own models – for example, we might want to identify phone numbers belonging to customers based on context, rather than matching any 10-digit number – in practice, producing sufficient training sets can prove difficult.
Extractive AI Models
Extractive AI models work in a similar way to generative AI (LLMs) in that they require a free text prompt and are trained to understand natural language. However, unlike generative AI, extractive AI models do not produce new content; rather, they highlight sections of existing text that answer a question. As a result, extractive AI is much cheaper to run than LLMs, too. The benefit of extractive AI is that it can be used to extract any entity with the appropriate prompt. It is excellent for identifying entities that do not conform to a specific pattern and where pretrained statistical models do not exist. For example, we might use an extractive AI model to identify root cause reasons from incident reports using the prompt “What was the cause of failure?”
Large Language Models
LLMs (generative AI) represent the cutting edge in natural language processing, and while they are not specifically designed for NER, they are able to turn their hand to it. LLMs build on the capabilities of extractive AI by being able to comprehend short-hand and synonyms, rephrasing entities where required and able to understand more complex contexts. While LLMs are the most expensive NER solution and suffer from inaccuracies due to hallucinations, both their performance and cost are rapidly improving. Aiimi’s experiments with smaller LLMs (Small Language Models) show that they are particularly good at extractive tasks.
Why is Named Entity Recognition so useful?
NER can provide us with many business benefits. It’s one of the key techniques used by the Aiimi Insight Engine to extract value from information and classify documents, such as identifying sensitive files for GDPR. By associating pieces of data with documents (e.g. invoices with purchase orders), Named Entity Recognition enables users to easily navigate information without relying on search. It also allows us to extract information like people's names, geopolitical data, and organisations. Combined with technologies such as phrases and topics, it enables us to establish links between these entities and events like fraud or terrorism.
We’ve covered several types of NER in this blog: dictionary-based, rule-based, statistical, and AI. Each has its own speciality and appropriate use cases. The Aiimi Insight Engine supports all the methods discussed in this blog, and in practice, we find ourselves using a combination of all these different approaches for different entities with our customers.
Named Entity Recognition has excellent synergies with other technologies, such as phrase and topic extraction, and text classification. It also helps with:
- Document summarisation – Named entities provide additional, useful context that allows a user to quickly access the key points within a document.
- Automated pseudonymisation and anonymisation – Named entities allow us to automate this whole process and remove time spent redacting information.
- Synonym detection – We can use NER to find master data that a business didn’t know about, like suppliers, assets, or pieces of equipment.
What’s Aiimi working on in this space?
Aiimi’s current research is focused on how we can get the most out of extractive and generative AI technologies. Aiimi Insight Engine provides a standard interface for connecting to and using any open source or commercial AI model, and within that we aim to:
- Dynamically choose the most appropriate model for the task at hand
- Improve the admin experience of configuring AI models to perform NER and other tasks
- Provide tooling for our customers to finetune language models to better understand their corporate data.
We’re also exploring ways to derive greater value from extracted entities by:
- Using entities to more accurately classify documents and improve search relevancy
- Dynamically linking entities to identify synonyms
- Linking shared entities across a graph database (such as our partners Neo4J) to derive complex links between documents – for example, to help better identify fraud.
Find out more about how the Aiimi Insight Engine helps businesses classify, enrich, protect, and control data, or take a deep dive into our latest insights on Machine Learning and neural networks.
Stay in the know with updates, articles, and events from Aiimi.
Discover more from Aiimi - we’ll keep you updated with our latest thought leadership, product news, and research reports, direct to your inbox.
You may unsubscribe from these communications at any time. By submitting this form you consent to us processing and storing the information you provide in accordance with our Privacy Policy.
Enjoyed this insight? Share the post with your network.
How data hacks are driving innovation in the water industry
Love lessons: 3 tips and tricks for Databricks users