Digital Publishing Guest Columns
4 mins read

Auto-tagging: AI for publishing beyond the hype curve

Getting your Trinity Audio player ready...


Ever since the advent of the internet, publishers and broadcast media have had their business models repeatedly upended by the pace of change. Some, such as the BBC, have acted quickly to adapt to the digital world, but many are struggling to keep up with the flux of digital publishing processes, disruption of the advertising market, platform fragmentation, and the emergence of new competitors with disruptive business models. 

They may understand the challenges facing them, and the value of further technology enhancements, but for many publishers, the question remains: how can they unlock the value of their data and content?

Many are looking to AI and data science for answers. One area which is already using AI successfully is content classification, which applies metadata – descriptive data that identifies assets and information across your business with common terms – to content automatically. 

Metadata at its best enables accurate business intelligence to support decisions on when and what to publish, it actively creates new opportunities in targeted advertising and content recommendations for users, as well as improving SEO. It also enables innovation agility, because it becomes easier and faster to repurpose your content to new ends when it is described and organised consistently, and reduces duplication of work as content moves through the publishing funnel. As Barack Moffitt, EVP Content Strategy and Operations, Universal Music, said: “We should call metadata sex, so everyone knows they need it”. The music industry has already clocked that it needs better metadata, with a metadata gap in royalties alone estimated at $2.5bn every year.

The heart of the metadata opportunity for publishers today is AI content classification. At a high level, content classification software lets publishers categorise any piece of content according to a list of terms that they use. Once the platform has been provided with enough best practice examples, it learns how to mimic that behaviour, both in terms of accuracy and style. Then, the process can be completely automated. This speeds up the publishing cycle, replacing a tedious and repetitive job for time-poor journalists. In comparison to manual tagging by the author of each piece of content, automation vastly reduces human error – no more missing out key tags to the detriment of SEO – and increases consistency, while freeing up resource to focus on your core business proposition.

Content classification systems that learn from pre-existing examples have been available for several years. However, due to the volume of best practice examples they have required, they have not been successful for publishers in the breaking news space. On a day where big news breaks, such as when COVID19 hit the mainstream media, an outlet might produce two fact-based articles and one, maybe two, pieces of analysis from the editorial team. Four pieces of content, even assuming that all were perfectly tagged, was not enough to teach the software how to integrate these new terms going forward. This is where AI rides to the rescue again.

Content classification systems that themselves utilise artificial intelligence, adapted for use in the newsroom, are more appropriate to dynamic domains where the focus might change quickly. In a document archive you might be happy for the classification process to take longer to learn if it enables the software to be more specific. Publishers and broadcasters, however, need automated tagging to happen at the same speed as breaking news. The role of AI is often overstated, but in this case it can really add value by doing the heavy lifting. 

For these reasons and more, it isn’t a question of ‘if’ traditional media will begin to leverage AI, it’s ‘when’ and ‘how’. Now that it’s available, the answer to ‘when’ is most likely ‘soon’. The ‘how’ requires a little more thought, but like most technical deployments it comes down to three options: build, buy, or partner.

Some large publishers and broadcasters already employ data scientists and have tasked them with building their own content classification systems. This tends to be a role happily accepted by data scientists, as it has an illusion of simplicity and many resources already exist online. In reality, though, building a robust and scalable AI-driven content tagging platform is a complex project that requires extensive software engineering, a separate skill to data science, and familiarity with APIs. This often leads to the total cost of ownership rocketing, as data scientists on large salaries spend much more time on engineering systems than scheduled, as the project becomes a time intensive and unwieldy digital transformation piece. This impacts other areas of publishing businesses too – if data scientists are deployed building bespoke versions of products available off the shelf, they aren’t working on products unique to your proposition or creating new revenue streams.

Although partnerships are often considered a good compromise in such situations, this approach requires the end product to be brand differentiating – which fundamentally, AI-driven content classification systems are not.

In the financially stretched publishing and broadcast sectors, total cost of ownership and product differentiation are more important than ever. As such, publishers and broadcasters should ensure they are deploying their data scientists where they can add the most value – and not spending heavily to find out that they have reinvented the wheel.

Matt Shearer
Director of Product Innovation, Data Language

About: Based in the UK, Data Language is a software product company that specialises in using data science and knowledge graphs to solve business-critical challenges. Its Tagmatic content classification tool uses AI and machine learning to reduce the cost, time and effort associated with tagging and organising written content, speeding up behind-the-scenes processes and ensuring consistency.