Digital Publishing
4 mins read

Smart archives: Google and NYT take archival content digitization to the next level with AI

Getting your Trinity Audio player ready...

It was a Saturday morning when Benjamin Havrilak, a news assistant with The New York Times went down to the “morgue” – an area in the basement of the company’s Times Square office that houses its vast photo collection.

Havrilak had gone there to pull some photos and to his horror, came across “a river of water”. He found that the area was fast getting flooded. It was the beginning of the weekend, and who knows if not for that “fortunate” accident, the flooding may have gone undiscovered for at least two days, causing much greater loss.

Although the damage was minor, it raised significant alarm and concern about how the company’s most precious assets and intellectual property could be stored safely.

The need to preserve pre-digital age content

The New York Times has around 5 to 7 million photos in its basement jammed into hundreds of filing cabinets and drawers. They go back as far as the late 19th century. Many of them have tremendous historical value and cannot be found anywhere else in the world.

“[It’s] a treasure trove of perishable documents. A priceless chronicle of not just The Times’s history, but of nearly more than a century of global events that have shaped our modern world.” says the NYT’s chief technology officer Nick Rockwell.

Opportunities waiting to be discovered

But many of these priceless photos have not even been seen for years. Legacy publishers like The Times have a significant advantage over recent digital publishers due to their access to years of high-quality and historically valuable content. They can repackage this and sell it to both readers and advertisers, as well as use it to increase traffic. Several publishers including The Atlantic, The Economist and The New Yorker have already been doing so.

The New Yorker creates single topic anthologies drawn from already published materials to sell them as standalones, as well as a part of subscriptions. According to former New Yorker publisher Lisa Hughes, “You’re monetizing it from the advertising side and the consumer side. You have the long tail of people buying it as a single copy. We also push it out to subscribers. And advertisers love them — they can own it entirely.”

“Everything old is new again”

In its own internal 2014 Innovation report, the Times had acknowledged that not using its archived content was a big missed opportunity. Now, the company intends to set that right.

The company announced recently that it will be using Google’s services to digitize the whole collection of around 5 to 7 millions photos stored in its basement.

Digitization powered by AI

What sets this digitizing apart is that it will be powered by AI. Most of these photos have hand-written or printed notes attached to them indicating their location, date and in many cases, other contextual information.

These will be captured and categorized as well, and added to the photo data. The AI with its sophisticated object recognition technology can extract additional information from the photos and automatically and add it to the photo data.

The front and back of a sample photo. It shows Penn Station in 1942. The clippings on the back are taken from captions in the paper. Image: Google / The New York Times

The Times’ assistant managing editor Monica Drake says, “We’ve always known that we were sitting on a trove of historical photos. Cloud technology allows us to not only preserve this archival source, but easily search and pull photos to provide even more historical context.”

The New York Times plans to use these photos to enrich reporting and provide better context to stories. It is also planning a special feature on them called Past Tense. There will likely be other initiatives down the road and it will be interesting to see how the Times’ digitization project unfolds.

“Bite the bullet”

Considering the challenges faced by publishers today, reusing old content is an idea they can’t afford to ignore. Going back to 2015, the year when the morgue was saved from disaster, Reuters had announced that it was embarking on a project to make its historical archive footage available online.

Speaking on the occasion, Tim Redman, its head of archive, had these words of advice for publishers who had yet to digitize their archives, “Bite the bullet, plan hard and take the plunge sooner rather than later.”

Digitising archives is also a focus for Exact Editions, the London-based specialist which has created archives (spanning multiple decades) for titles including Gramophone, The Wire, and The Numismatist.

Managing Director Daryl Rayner adds, “There is no doubt that the market is increasingly placing importance on complete archives. It is imperative to safeguard publishers’ valuable content, which will also serve as a historical cultural resource for generations to come.”