As a library, the very high level prioritization framework is "what would patrons find useful." That's how we started with data.gov and federal Github repos as broad but principled collections; there's likely to be something in there that's useful and gets lost. Going forward I think we'll be looking for patron stories along the lines of "if you could get this couple of TB of stuff it would cover the core of what my research field depends on."
In practice it's some mix of, there aren't already lots of copies, it's valuable to people, and it's achievable to preserve.
> Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals?
Right -- how do you look at the 300,000 entries and figure out what's not at depth one, is archivable, and is worth preserving? If we started with everything it would be petabytes of raw datasets that probably shouldn't be at the top of the list.
In practice it's some mix of, there aren't already lots of copies, it's valuable to people, and it's achievable to preserve.
> Is the second question about figuring out how to prioritize valuable stuff behind two depth traversals?
Right -- how do you look at the 300,000 entries and figure out what's not at depth one, is archivable, and is worth preserving? If we started with everything it would be petabytes of raw datasets that probably shouldn't be at the top of the list.