The New York Times overhauled its substantial digital archive this year, using Amazon Web Services to create, store...
and serve billions of small images to subscribers interested in back issues of the newspaper dating from 1851 to 1980.
The company used AWS storage when it first launched the archive, called TimesMachine, in 2008, but the new TimesMachine that launched last month was a much more intricate project, according to Evan Sandhaus, director for search, archives and semantics at The New York Times.
The original TimesMachine used a preview image that linked to a larger image of the page, where readers could see headlines for articles, but no text. These zoomed-in images linked to PDF files which contained readable versions of the articles.
“With the new TimesMachine, we gave ourselves the challenge of, how can we create a single experience that lets you get a sense of a newspaper as a whole, but also lets you read individual articles -- all without exiting that experience?” Sandhaus said.
Small images no small undertaking
To download a full Sunday edition of the newspaper would require huge bandwidth on the client side, on the order of about 300 megabytes, overwhelming most readers’ machines.
“That’s several iTunes albums’ worth of data to show them an issue in which they’re probably interested in something very specific, like a single article,” Sandhaus said.
Instead, Sandhaus and his team took a cue from the Geographic Information System (GIS) mapping industry, which faces a similar problem with providing detailed views of large maps.
The way the GIS community addressed this problem is through image tiling. The new TimesMachine broke its 9000 x 7000 pixel images down into 256 x 256 pixel tiles computed at several different zoom levels. On the front end, it uses a piece of open source GIS software called Leaflet to fetch the tiles that correspond to the portion of the newspaper readers want.
“We started out with two and a half million images -- that’s about how many pages there are in the new TimesMachine,” Sandhaus said. “In the old TimesMachine, for each page, we computed two images, a zoomed out version and a zoomed in version, but for the new TimesMachine, we computed about a thousand images for each page.”
AWS storage, then and now
This meant that the Times’ AWS storage needs went from some five million objects to close to two and a half billion, computed with Amazon’s Elastic MapReduce service and stored in its Simple Storage Service (S3) object store.
With the old TimesMachine, the newspaper’s team had to stand up a Hadoop environment to perform the MapReduce jobs on its own, since Elastic MapReduce didn’t yet exist.
This time, “it was a lot easier because a lot of the infrastructure comes out of the box now,” Sandhaus said. “You just need to supply the parameters that are specific to your job, which dramatically lowers the number of weeks you have to spend getting your servers configured properly.”
Despite computing an order of magnitude more image files, the MapReduce job took four hundred c1.xlarge Elastic Compute Cloud machines – as opposed to a thousand required for the old TimesMachine – just three days to complete.
Putting history in context
The end result of this project is a complete digital archive of 46,592 issues of The New York Times in which all text from the original image scans of the newspapers’ pages is easily readable without leaving the Web browser.
The newspaper has used this archive to surround its current news stories with context, such as the 50th anniversary of the 1964 World’s Fair and the 50th anniversary of the introduction of the Ford Mustang. The original full-page 1964 advertisement introducing the Mustang is now clearly readable.
Overall, AWS’s newer services vastly improved the Times team’s experience creating the new TimesMachine, but there is one item on Sandhaus’s wish list: the ability to upload large numbers of files as a single file that would then be decompressed on Amazon’s side, such as a zip file.
“That could be a nice way of getting data into S3,” Sandhaus said. “But that’s really a minor request for us.”
The New York Times is using predictive analytics algorithms. Why isn't your enterprise?