Skip to main content
The Daily San Diego

All of San Diego, every day

News

San Diego's Digital Archives Are Riddled With Duplicate Images — And the Numbers Tell a Costly Story

City databases and local cultural institutions are sitting on thousands of redundant digital files, draining storage budgets and slowing public access to records.

Share

By San Diego News Desk · Published 4 July 2026, 11:39 AM

4 min read

Updated 4 h ago· 4 July 2026, 8:14 PM

How we reported this

This article was generated by AI from the linked public sources. The Daily San Diego is independently owned and covers San Diego news free from advertiser or sponsor influence. Read our editorial standards →

San Diego's Digital Archives Are Riddled With Duplicate Images — And the Numbers Tell a Costly Story
Photo: Photo by Samyantak Mohanty on Pexels

San Diego's public digital archives contain an estimated tens of thousands of duplicate image files spread across municipal databases, library systems, and cultural repositories — a problem that costs city agencies real money and undermines the reliability of public records searches. While the issue sounds like a technical housekeeping matter, the data behind it reveals something more troubling about how San Diego manages its growing digital infrastructure.

Duplicate image replacement — the systematic process of identifying, flagging, and replacing redundant digital files with single authoritative versions — has become a pressing operational concern for institutions managing large visual archives. The urgency has sharpened in 2025 and 2026 as cloud storage costs have climbed and as San Diego's various departments have pushed aggressively to digitize physical records, generating new image files at a faster rate than old ones are being audited or cleaned.

Where the Problem Lives in San Diego

The San Diego Public Library system, which operates 36 branch locations citywide including the flagship Central Library on Park Boulevard in East Village, has been working through a multi-year digitization program for historical photographs and documents. Librarians and digital archivists handling those collections have long acknowledged that redundant scans — sometimes three or four near-identical versions of the same photograph created during different scanning sessions — accumulate quickly when intake workflows lack automated deduplication checks.

The San Diego History Center in Balboa Park manages one of the region's largest photographic collections, with holdings numbering in the hundreds of thousands of images. Digital collections of that scale are particularly vulnerable to duplication because images are ingested from multiple donor sources, each arriving with their own file-naming conventions and metadata standards. Without a standardized deduplication protocol applied at the point of ingestion, matching files pile up across storage tiers.

The City of San Diego's Office of the City Clerk, which maintains official digital records under the city's records retention policies, uses a content management system that does flag some duplicate documents — but image files, particularly JPEGs and TIFFs used in planning and permitting records, fall into a category that automated flags sometimes miss. The city's IT department has not publicly released figures on the scale of duplication within those systems.

What the Numbers Actually Mean

Storage costs are the most immediate metric. Commercial cloud storage for large uncompressed TIFF image files — the format preferred by archival institutions — runs roughly $20 to $25 per terabyte per month on standard enterprise-tier platforms. A single high-resolution archival scan can run 50 to 100 megabytes. An archive holding 200,000 duplicate high-resolution files could therefore be carrying 10 to 20 terabytes of entirely redundant data, representing a recurring monthly cost in the hundreds of dollars that compounds annually.

Beyond raw storage, duplication inflates search result noise. When a researcher queries a digital archive and receives multiple near-identical versions of the same image without clear versioning labels, determining which file is the authoritative record requires manual review. A 2023 report from the Digital Preservation Coalition — a UK-based nonprofit whose guidance is widely referenced by U.S. archival institutions — found that redundant file management issues added measurable staff hours to routine retrieval workflows at mid-sized cultural institutions.

San Diego-based institutions looking to address the problem have a growing toolkit available. Perceptual hashing algorithms, which generate a fingerprint for each image based on visual content rather than file metadata, can identify near-duplicate photographs even when file names and creation dates differ. Several open-source implementations of these tools are already in use at peer institutions in the Los Angeles area.

For city agencies, the practical next step is auditing existing holdings before the next budget cycle. The City of San Diego's fiscal year 2027 budget process begins in earnest in early 2027, and technology line items for records management infrastructure will be under review. Institutions that can document specific storage savings from deduplication work — in concrete terabytes and dollar figures — are better positioned to make the case for dedicated staffing or software procurement. The data, in this case, makes the argument on its own.

You might also like

Editorial picks

How did this story land?

Spread the word

Share

Have your say

Loading comments…

Sources

About this article

Published by The Daily San Diego

Covering news in San Diego. This article was generated by AI from the linked sources and was not reviewed by a human editor before publishing. See our editorial standards.

Spread the word

Share

See something wrong? Suggest a correction.

Daily brief

Enjoyed this? Wake up to San Diego news every morning.

Free, in your inbox before 7am. Weekdays.

By subscribing you agree to receive emails from The Daily San Diego and accept our Privacy Policy. Unsubscribe anytime.