Insights

Enrichment: Accelerating classification with Document Clustering.

by Paul Maker

20 Dec 2019

If you recall my blog on Day 5 of this series, Sustainable Document Classification, I talked about some of the challenges with document classification, most of which have to do with the length of time it takes to train an accurate model.

How can Clustering help with training a model?

To train a model, we need a pre-labelled set of training data for each document class (i.e. document type – for example, invoice). Pre-labelling a set of training data means marshalling a whole lot of users to find the documents and then label them – and therein lies the problem. Users tend to have better things to do, like their day job, so asking them to undertake a task like this is unlikely to yield much success!

What if we could use the power of machine learning to have a first go at labelling the documents and take this pain away from the business and the user?

There's a branch of machine learning called unsupervised machine learning that requires no pre-labelling of data or prebuilt models. Specifically, a process called cluster analysis or 'clustering' can take a large set of documents or data and group these into groups or clusters that represent similarity. We can then use this to create an initial document classification model and further refine this with user driven reinforcement and feedback.

How Clustering works using TF-IDF

Logically, the primary way to differentiate documents is logically to look at their content, specifically text content. A key insight we wish to gain from a document’s text is to identify how it differs from other documents and which terms make it unique. For example, you would expect invoice documents to have a high occurrence of the word “invoice”.

This is achieved using a method called “Term Frequency – Inverse Document Frequency” or “TF-IDF” for short, which identifies how significant the occurrence of a word is in one document compared to all other documents. Using this method for each document, words that occur frequently across all documents will have a low score and words that occur frequently in a smaller subsection of documents will have a high score.

Once we have obtained this “TF-IDF” score for each word in each document, we can run a clustering model which will group documents that have high levels of similarity between TF-IDF scores. Following the previous example, if the word “invoice” is scored highly in two documents, these two documents are more likely to be clustered together.

Making it easier to train a classification model

Now that we have these groups, or 'clusters; as we call them, we can use them to train a supervised classification model, which can then be used to classify documents. Before we start this training process, we may want to inspect the clusters and make any minor corrections that we see fit. To help with this, we have been working on some innovative user interfaces that simplify this process.

Another cool thing which we are working on at Aiimi is how we reinforce a classification model and make it more accurate over time. For this, we are looking at how we can crowd source classification corrections through the InsightMaker user interface. We can then take these corrections and automatically retrain the classification model in the background.

Through our research, we have discovered that we can dramatically accelerate the time to build classification models, something that has always been the Achilles Heel of document classification.

I hope you found today's post interesting - perhaps it's inspired you to take the plunge and look into document classification for your organisation. If you'd like to find out more about the value of Document Classification for businesses, don't forget to check out Day 5's post - linked below.

Cheers and speak soon, Paul

If you missed my blogs in the 12 Days of Information Enrichment series, you can catch up here.

Day 1 - What is enrichment? Creating wealth from information
Day 2 - Starting at the beginning with Text Extraction
Day 3 - Structuring the unstructured with Business Entity Extraction
Day 4 - Solving the GDPR, PII and PCI problem
Day 5 - Sustainable Document Classification
Day 6 - Image Enrichment: Giving your business vision
Day 7 - Advanced Entity Extraction with Natural Language Processing
Day 8 - Understanding customers with Speech to Text translation
Day 9 - Accelerating classification with Document Clustering
Day 10 - Giving users what they need, when they need it
Day 11 - Understanding documents with Dynamic Topics
Day 12 - The power of Enrichment

Stay in the know with updates, articles, and events from Aiimi.

Discover more from Aiimi - we’ll keep you updated with our latest thought leadership, product news, and research reports, direct to your inbox.

Aiimi may contact you with other communications if we believe that it is legitimate to do so. You may unsubscribe from these communications at any time. For information about our commitment to protecting your information, please review our Privacy Policy.

Enjoyed this insight? Share the post with your network.

Enrichment: Accelerating classification with Document Clustering.

How can Clustering help with training a model?

How Clustering works using TF-IDF

Making it easier to train a classification model

Stay in the know with updates, articles, and events from Aiimi.

Read more on Aiimi Blog