For the curious, the endpoint itself is fast mostly because it's heavily implemented with Redis. Are you interested in a writeup of how that works? If so, leave a comment and I'll consider spinning that up.
Also, why don't you do the recursive lookup internally instead of having bundler do several calls? It seems like performance would be strictly better because it saves network round trips.
This was pretty much just the first iteration, I'm fine with experimenting going deeper cue inception soundclip.
I could definitely see future iterations going down more levels and caching the result as well. Right now each dependency lookup is not cached, so that might cause issues once 1.1 is actually released :)
tldr: A classic brute force ruby solution by downloading the whole list of gems and parsing it has (finally) been replaced by an intelligent way of fetching just what's needed, on the new rubygems api that bundler and rubygems have developed together.
Which classic brute force Ruby solution was the right call, seeing as how Bundler (a) improves ad-hoc packaging dramatically and (b) has gained widespread adoption.
They did the simplest thing that could possibly work, and ended up delivering 65% solution; now they're iterating. The hard work was getting people to adopt Bundler in the first place.
I'd rather have waited an extra 25 seconds on every bundle install over the past year than being forced to manage gem dependencies manually for another year.
Brute force hack is always the right way to start out, and I love the ruby community/the ruby approach for this. Get something that works, and then optimize the internals if project takes off.
For a solution to the rubygems index download they needed the changes to the rubygems api, which they could probably not have affected before being a successful project.
git (and hg, etc) has already solved the problem of delivering reliable deltas of large, occasionally modified text files. Why not just replace the gem index download with that, and save the need for server CPU cycles?
I'd love to have git be a part of the solution here, really, any other solution would be awesome. Hit me up in #rubygems on IRC or email me (nick [at] quaran.to) and we can talk about it :)
Does ruby's Marshal have the same problems that python's pickle have? Could you construct a valid Marshal bitsting that when loaded would run malicious code? Is this exposing folks to a MITM attack on rubygems?
I don't know what the specific issue with pickle is, but ruby's Marshal format is pretty bulletproof at this point. It is a data only format with pretty strict verification of the stream as it builds the object tree.
Also, Marshal doesn't allow any kind of code to be included into the stream, so there is no ability for stream to perform remote code injection.
Marshal call back into Ruby for non-builtin types, but it does so by simply calling a method on the constant and passing either the raw Marshal data or a previous created object tree. This provides enough protection that there haven't been any reported cases of it being exploited and no know issues exist with it.
I'd be willing to stake some money on a bet that they're going to regret the decision to build key Ruby infrastructure on Marshal, say, within 12 months.
Having said that, I cannot at this moment tell you how to take over a Ruby runtime with a malicious Marshal byte string.
What about the old Marshal problems? I just played with Marshal on my rails console again and it seems it still encapsulates all sorts of implementation details. I haven't tried whether it's now compatible across Ruby versions, but I recall the Marshal format changed a few times, introducing incompatibility.
Not sure what you mean by Marshal having implementation details in the bitstream, it doesn't. Marshal has been reimplemented in many different implementations just fine.
As for why not JSON, because there is no JSON parser as part of the standard library and rubygems needs to be extremely careful about what dependencies it has.
AFAIK, Ruby's Marshal only calls internal stuff like allocate and special Marshal methods (e.g. marshal_dump and marshal_load) on classes in the Marshal data. It doesn't even actually use new or initialize to create instances, and it doesn't go through methods to set up the object. So unless you have a class that overrides Marshal hooks or Ruby internals in an insecure way, it shouldn't allow arbitrary code execution (barring buffer overflows and the like that could allow arbitrary code execution from any function).
Basically, I'm not convinced Marshal is necessarily any more risky than something like YAML would be, even though it feels scarier. But I haven't done an extensive audit or anything — I just looked over the Marshal code a while back because I was curious what it was doing.
The RubyGems index needs a ton of work. It's definitely not ideal right now, this was the easiest thing we could do to get Bundler happier and speedier.
I would love to see a more apt-get like system where you cache things locally, but that has its own implications/difficulties as well.
Basically, it's hard to keep the time down from "gem push" to "gem install" if you go towards more distributed/delayed indexing systems...but I'm willing to compromise if it's way better.
That's a great question. I suppose they split it up into separate requests since downloading all of the dependencies at the same time might end up being slow and returning a lot of data, especially for something like Rails with a lot of dependencies. Or maybe separate requests was fast enough and a good trade off between speed/data and number of requests.
But bundler cannot determine which versions it needs until it has all dependencies, right? And in effect, all it does is recursively call for one "layer" of dependencies after the other, so it seems like the backend might just as well recursively query itself and return all dependencies at once.
as you have seen in the article, bundler needs dependencies of all versions as it might have to deal with older releases. Building the whole dependency graph for all versions of a gem with many dependencies in itself will a) take a very long time and b) probably be huge - likely much bigger than the whole gem index that was downloaded up until 1.1. The time required for parsing those MB after MB of data of which most will be thrown away will quickly get significant.
Another option would be to query for all dependencies of a specific version of a gem. That would reduce the amount of data, but it would still mean more work for the server and it would produce a lot more data to cache (much of which would never be queried more than once)
With the currently applied method, only the data that's really required must be queried and it can easily be cached on the server side too.
I suspect its a question of server resources. Dependency analysis is bundler's job and its better to let thousands of cpus run the algorithm. This leaves rubygems.org to simply service simple queries quickly without much load even though it means there will be many queries as part of each bundle.