How to Build a WikiNet

There are almost 5 million articles in English Wikipedia only, as I wrote in a previous post (Wikipedia as a Complex Network). As you may imagine, analyzing the whole dataset of the Wikipedia network needs some computational power. I wrote several functions and code lines in Mathematica to analyze a small part of this network. For those who want more details, in this repository here there are the Mathematica notebooks. These functions photograph the local neighborhood of a chosen Wikipedia article. First, we select an article that will be the central page of a Wikipedia subgraph, that I called WikiNet, to make it a bit shorter. Second, we decide the depth of this subgraph, but usually with depth 2 we are already considering a large dataset. These functions then browse and store the current subgraph within two-clicks distance starting from the central page. In this procedure I selected only links towards other English Wikipedia articles, excluding special pages, portals, disambiguation pages et similia.

Initially I had a directed graph, not necessarily a tree because some first neighbors might be linked to some other first neighbors or the central page. But I considered outdegrees and indegrees just as degrees, without orientation. Therefore the WikiNet is a simple graph, that is a not-oriented graph without selfoloops and multiedges. I considered then the central page, all its first neighbors (FN, articles one-click away from the central page) and all second neighbors (SN, articles one-click away from all FN).

Let’s consider for instance the WikiNet centered in the “Cloud_Atlas_(film)” article in Wikipedia. This graph is formed by N = 2023 nodes (pages) and M = 2287 edges (links).  With the following instructions in Mathematica it is possible to visualize the undirected graph, where CloudGraph is a list of all edges.

Cloud Atlas WikiNet

Structure and topology of this graph are clearly influenced by the way these functions extract the network, this give us a rough (but still effective) outline of the local image of the entire Wikipedia network. Which approximations are we considering? For all first neighbors and the central page I extracted the correct outdegree but partial information on indegrees. For all second neighbors I have partial indegrees and no idea of the outdegrees, for instance all those nodes that are packed in groups with only one link, they are clearly my network border. As I said before I had to limit the exploration to reduce computational costs, not really for the graph itself but for the visits history analysis that will follow (in future posts).

degreeCloud

To have a quantitative idea let’s look at the degree distribution. This plot describes how connections are distributed in this WikiNet. There is a large number of articles with really few links (blue dots on the left) and a small number of articles with many links (on the right). The average degree in this network is 2.2, this confirms that we have a significative number of articles on the border of our subset of Wikipedia.