Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Does anyone know what the "diameter" of the wikipedia graph is? In other words, what is the longest shortest path between two wikipedia articles?


Well that question only makes sense for connected graphs, we don't really know whether this is connected. So in general a version of this question that makes sense is, among all the connected components of the wikipedia graph, what is the largest diameter.


This is an interesting question that I'd like to answer now that I have all the data. I am curious to see how long it will take to find a solution as I believe even the most efficient algorithms for this have a high runtime complexity.

And yes, the graph is not connected (there are both nodes with no outgoing links and with no incoming links), but over 99% of the pages are connected, so the answer would still be interesting and worthwhile.


Another interesting question would be what the largest graphs are that are disconnected from the "main" one. Are there any larger than 1 or 2 nodes? Is there some set of pages on some obscure topic that only link within themselves?


Floyd-Warshall solves the all-pairs shortest path in time O(V^3). Running a BFS search rooted from every node would find the shortest path in an unweighted graph in O(V*(V + E)).


Hence, since there's a whole lot of Vs in the wikipedia graph, he's probably going to be satisfied with an approximate solution, unless he has a lot of CPU hours to spare.


There's 5,579,252 articles in English Wikipedia right now, which means the largest component has 5 million and change or so vertices. Each individual BFS should take a few CPU-seconds at most, and you can trivially parallelize the BFS for each node across as many CPUs as you have.


Yes, check this out http://mu.netsoc.ie/wiki/

> Several people were asking about what's known as the "diameter" of Wikipedia, that is, the distance between the two articles furthest apart (the longest shortest path if that makes any sense to you). This was in fact the original goal of the project but it turned out not to be very interesting. Wikipedia has giant "tails", almost linear linked lists of articles that stretch out for 70 links. The worst offenders were the subpages of List of named asteroids as each is only linked from the previous one, and it takes about 70 links to get from anywhere to the last one. So, you find that the two articles furthest from each other are List of asteroids/145701-145800, linked to by List of asteroids/145601-145700, linked to by List of asteroids/145501-145600, and so on for 70 links until you get back to some vageuly normal article. This is far less interesting that I was hoping. Even when I special-cased out that string of 70 boring articles, a new one appeared (I think it was linked pages about administrations of Myanmar or something). Rather than special-casing out reams of articles, I decided to pick a different metric, one that better reflects the structure of wikipedia without arbitrarily removing parts of it.

It appears that the particular tail has been "cut" however.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: