Addressing the global namespace
Addressing the global namespace
There is a moment in every data hoarder's life - and in every small media shop's IT history - when the first archive drive fills up. You slot in a second tape, plug in another external drive, configure a new cloud bucket. Problem solved.
Except now you have a different problem: you have no idea where anything is. Not in the "I lost the file" sense. More in the "I know I have that file somewhere across these six volumes, but I don't know which one, and I'm not sure I haven't archived it twice, and that third copy might be the old version" sense.
The data is there. The knowledge of where it is, and which copy is canonical, is not. This is the multi-volume namespace problem. It's been lurking in storage management since reel-to-reel tape in the 1960s, and the solutions to it span from "I have a spreadsheet" to "I have a $200,000 enterprise storage cluster." Most people end up somewhere in the uncomfortable middle.
Let's look at the problem properly, put some numbers on it, and walk through what people actually do - and why each approach eventually runs out of road.
The Numbers First
"Multiple volumes" sounds modest. It isn't, once you start measuring. LTO tape capacity by generation (native / 2.5:1 compressed):
| Generation | Native | Compressed | $/TB (cartridge only, approx.) |
|---|---|---|---|
| LTO-7 | 6 TB | 15 TB | ~$0.02/GB |
| LTO-8 | 12 TB | 30 TB | ~$0.015/GB |
| LTO-9 | 18 TB | 45 TB | ~$0.012/GB |
| LTO-10 | 40 TB | 100 TB | (2026, available now) |
A serious homelab data hoarder with a modest LTO-8 drive will fill a 12 TB cartridge in 6–18 months depending on what they're collecting. A four-tape rotation is a realistic minimum for anyone who takes offline copies seriously. That's 48 TB native - multiple volumes by definition.
Scale up to a small production company, a university research group, or a regional broadcaster. Now you have dozens of tapes, offline drives at an off-site location, and a cloud backend for disaster recovery. Each one a separate island of storage.
Hard drive archives have the same problem. A rotating set of 14–20 TB USB drives for cold backup is cheap and practical. It's also, over five years, eight or ten drives with no shared index.
The problem isn't finding the storage. The problem is that the more storage you buy, the worse your ability to reason about it gets - unless you build something to manage the namespace across all of it.
What "Namespace" Means in This Context
A namespace is simply a unified view of your files. On a single volume, it's trivial: mount the drive, run ls, see everything. The file system is the namespace.
The moment you have two volumes, the file system isn't enough. A file might be on either one. If both are mounted simultaneously, you have two separate trees. If one is offline, its files are invisible. If the same logical file has been archived to both volumes at different points in time, you have two physical copies with no built-in way to determine which is current.
The ideal global namespace would let you:
- List every file you've ever archived, regardless of which physical volume holds it
- Know immediately which volume a file is on, and what its current version is
- Know that you haven't accidentally archived the same content twice on different volumes
- Retrieve the right version without checking six different drives one by one
No single standard delivers all four of these, which is why the solutions below exist - and why each one is only a partial answer.
Approach 1: LTFS - A File System on Every Tape
The closest thing to a proper multi-generation standard for tape namespace management is LTFS (Linear Tape File System), adopted by the LTO Consortium in 2010 and standardized by SNIA. If you're running LTO-5 or later, you can format any cartridge as LTFS and mount it exactly like a USB drive.
The way it works is elegant. LTFS divides a tape cartridge into two partitions. Partition 0 is the index - an XML document describing your directory tree, file names, timestamps, and the exact tape block where each file's data begins. Partition 1 is the data itself.
When you insert an LTFS tape and run the mount command, the drive reads Partition 0 into memory, and your OS presents the tape as a browsable volume:
ltfs -o devname=0 /mnt/tape
ls /mnt/tape/projects/alpha/ # rushes/finals/audio/
You can drag and drop. You can cp. Files are written sequentially to the end of the data partition and the index is updated.
It's a genuine open standard - an LTFS tape written on Linux can be read on macOS or Windows with compatible software, with no proprietary software in the middle. For interchange and manual archiving, LTFS is excellent. If you're sending a tape to a collaborator or to a facility you've never worked with before, LTFS is the right choice. It is genuinely self-describing: the metadata travels with the media.
Where LTFS Stops
The problem is that LTFS solves the per-tape namespace and stops there. Each tape is its own island. When a tape is offline - sitting on a shelf - its files are invisible. There is no LTFS spanning standard that lets you query across multiple cartridges simultaneously. If you have four tapes and need to find a file from 2022, you are mounting tapes until you find it, or maintaining a separate external record of what's where.
LTFS also inherits tape's fundamental constraint: files can only be appended, never overwritten. If you delete a file in the LTFS index, the data blocks on tape are simply marked unavailable - the space is not recovered until you reformat the entire cartridge.
More importantly for long-lived archives: LTFS has no native versioning. The spec doesn't define what a version is or how to track one. If you archive final.mov, re-edit it, and archive it again to the same tape, the second write is a separate file entry in the same index. Version control becomes your problem.
The index partition itself has a ceiling - roughly 5–10% of tape capacity - which can become a bottleneck for archives with millions of small files.
For many users, these aren't dealbreakers for a single-tape workflow. But the moment you're managing a tape library, you need something sitting above LTFS to provide the cross-volume view. LTFS gives you a great building block. It doesn't give you a global namespace.
Approach 2: The Spreadsheet (And Its Cousins)
The most common approach in homelabs and small shops isn't LTFS or enterprise software. It's a spreadsheet. Or a text file. Or a Notion database. Or a Python script someone wrote in a weekend that generates a CSV of filenames and tape labels.
This approach is completely understandable and immediately comprehensible to everyone on the team. "Tape 003 has the 2023 raw footage. Tape 007 has the 2024 project deliverables. Check the sheet."
It also has a half-life of about six months.
The spreadsheet tracks what you intended to put on each tape at the moment you archived it. It doesn't automatically update when files change. It doesn't track versions - when you re-archive an updated file, do you add a new row? Update the existing one? Add a note? Different people make different decisions, and over time the spreadsheet becomes an archaeological record of past intentions rather than a live index.
Deduplication is purely manual. Nothing in the spreadsheet warns you that client_deliverable_FINAL_v3.mov on Tape 003 and client_deliverable_FINAL_v3_2.mov on Tape 009 are the same underlying content, archived twice by different team members six weeks apart.
Search is grep over a CSV or a Notion filter - workable for hundreds of files, painful for tens of thousands, and broken for millions.
The spreadsheet approach is a record of what someone thinks is on the tapes. The tapes themselves know what's actually on them. These two sources of truth diverge over time, and the divergence accelerates as the archive grows.
Some teams move from spreadsheets to proper databases - SQLite or Postgres - and write custom tooling to maintain the index. This is a genuine improvement. The index is now queryable, consistent, and maintainable. But custom tooling requires ownership: someone has to build it, update it when formats change, and ensure it stays in sync with the physical media. Most teams don't have that person, or lose them.
Approach 3: Enterprise HSM
The "correct" industrial answer to this problem is a Hierarchical Storage Management (HSM) system. HSM has existed since IBM implemented it for mainframes in 1978. The idea is straightforward: files on fast, expensive storage automatically migrate to slower, cheaper storage based on access patterns, and transparent stub files replace them so users never notice the transition.
Modern commercial HSM products - IBM Spectrum Protect, HPE Data Management Framework, Quantum StorNext - provide exactly the global namespace you need. You have a single directory tree. Files that haven't been touched in 90 days silently move to tape. Open one, and the system recalls it transparently. The catalog tracks every file, every version, every volume.
They work. They work very well.
They're also built for organizations with dedicated storage administrators, SAN infrastructure, enterprise support contracts, and storage budgets in the low-to-mid six figures. Enterprise HSM is designed for the assumption that you have a team, a budget, and a vendor relationship.
The software itself can cost more than a homelab enthusiast's entire NAS. The hardware requirements (dedicated tape libraries, Fibre Channel HBAs, redundant metadata servers) further narrow the addressable audience.
For a small production company with four LTO drives and a NAS, enterprise HSM is a missile to kill a mosquito. And there is no meaningful open-source equivalent in this space - the gap between "hobbyist script" and "enterprise HSM" has historically been a cliff with nothing in the middle.
Approach 4: Backup Software (The Wrong Tool)
A common misconception: "I'll just use my backup software to manage my archive."
Backup software - Veeam, Bacula, Duplicati, Restic, Amanda - is excellent at what it does, which is taking point-in-time snapshots and storing them in a recoverable format. It is not designed for an archive, and the distinction matters.
A backup is a recovery mechanism. You restore from it when something goes wrong. The interface is: "restore system/file to state as of [date]." Backup software maintains its own catalog of backup sets, not a live representation of your file namespace.
An archive is a primary store. Files live there permanently, and you access them directly. The interface should be: "open this file" - and it should just work, regardless of which volume holds the bytes.
Backup software can hold your archival data. It will not give you a unified browsable namespace across multiple tapes. It will not transparently serve a file from tape when an application requests it. It will not tell you whether a file is already archived before you try to archive it again.
These are the functions a real archive system needs to provide, and backup software isn't designed to provide them.
What the Ideal Solution Would Look Like
At this point the shape of the problem is clear, and you can start to describe what a complete solution would need to provide.
Volume identity that outlives device paths. If your LTO drive moves from
/dev/nst0to/dev/nst1because you rearranged your HBA, every restore should still work. Volume identity needs to be based on something burned into the media at format time - a UUID - not a device path or a human-assigned label.A catalog that is separate from any single volume. You need a central index that can answer "where is this file?" without mounting every volume in your library. The catalog holds the address (volume UUID + byte offset) for every archived file.
But also: volumes that are self-describing. The catalog is the fast path. But if you lose the catalog database, you should be able to reconstruct it from the physical media - because each volume should carry enough metadata to identify every object it contains. The catalog is an optimization, not a dependency.
A namespace that is always visible, regardless of volume state. Files that are on offline tapes should still appear in your OS. You should be able to see the file, read its metadata, and - when you try to open it - get a useful message about which volume to insert, not a confusing "file not found" error.
Hash-based deduplication, not path-based. The system should know whether a file has already been archived by its content, not just its name. Renaming a file shouldn't trigger a redundant archive.
Versioning as a first-class concept. When a file changes and is re-archived, both the old and new versions should be tracked and retrievable. The system should know that version 3 is on Tape C and version 1 is on Tape A, and serve the right one by default while making rollback possible on request.
None of the approaches above deliver all six of these properties. LTFS gets you two or three. Spreadsheets get you one on a good day. Enterprise HSM gets you all six, but only if you have an enterprise budget.
An Open-Source Answer
This gap - between hobbyist scripts and enterprise HSM - is where a relatively new open-source project called HuskHoard is trying to operate.
HuskHoard is a data-tiering engine built in Rust for Linux. It uses the fanotify kernel API to transparently stub cold files - replacing them with zero-byte placeholders while keeping them visible in your file system - and automatically recalls the data from the appropriate volume when an application opens the file.
It supports LTO tape drives natively (via the SCSI tape driver), flat image files, external drives, and cloud backends via rclone. The catalog is a SQLite database that tracks every archived file by its content hash, its version history, and its exact byte offset on a specific volume UUID - not a device path. Plug a drive into a different port, and the catalog heals the mapping automatically via a rescan. Volumes carry their own self-describing headers, so
Comments
No comments yet. Start the discussion.