Category Archives: Wikipedia

Visits Time Series of a WikiNet

11 February 2015All Projects, Networks, Time Series, WikipediaEnglish Wikipedia, exogenous event, movie release, network, page popularity, page visits, time series, wikipedialevantina

I built a WikiNet (a Wikipedia subset) starting from the article of a recently released movie, and I considered visits history of all articles in the WikiNet, one year before and after the movie release. We will see how the exogenous event, related to the movie release, influence the number of visits of the articles in the network.

I chose a well known movie, Iron Man 3 (released on May 3rd 2013 in USA), in order to have a lot of visitors and a lot of links in the network. I extracted the local neighborhood in Wikipedia as I described in a previous post and I obtained a non oriented graph. Of this graph I considered the central page, all its 461 first neighbors (FN) and just a random selection of 1 000 effectively second neighbors (SN), because some SN could be already FN. I had to restrict my analysis to a subset of articles for computational limitations. Downloading and processing time series for 24 months and for around 4 000 articles was far beyond my possibilities of time and computing power. Nevertheless my analysis is performed on a significant data set.

I considered a the time window of 24 months, and for each article I considered the weekly average and its logarithmic value. First of all I had to remove the effect of the weekly pattern of Wikipedia articles, described in this post, a global and regular fluctuation. Second, I had to reduce the differences among articles, because their typical average number of visits can be very different, and considering the logarithmic value is a valid choice in cases like this.

How to Build a WikiNet

10 February 2015All Projects, Networks, WikipediaEnglish Wikipedia, network, topology, wikipedialevantina

There are almost 5 million articles in English Wikipedia only, as I wrote in a previous post (Wikipedia as a Complex Network). As you may imagine, analyzing the whole dataset of the Wikipedia network needs some computational power. I wrote several functions and code lines in Mathematica to analyze a small part of this network. For those who want more details, in this repository here there are the Mathematica notebooks. These functions photograph the local neighborhood of a chosen Wikipedia article. First, we select an article that will be the central page of a Wikipedia subgraph, that I called WikiNet, to make it a bit shorter. Second, we decide the depth of this subgraph, but usually with depth 2 we are already considering a large dataset. These functions then browse and store the current subgraph within two-clicks distance starting from the central page. In this procedure I selected only links towards other English Wikipedia articles, excluding special pages, portals, disambiguation pages et similia.

Initially I had a directed graph, not necessarily a tree because some first neighbors might be linked to some other first neighbors or the central page. But I considered outdegrees and indegrees just as degrees, without orientation. Therefore the WikiNet is a simple graph, that is a not-oriented graph without selfoloops and multiedges. I considered then the central page, all its first neighbors (FN, articles one-click away from the central page) and all second neighbors (SN, articles one-click away from all FN).

Let’s consider for instance the WikiNet centered in the “Cloud_Atlas_(film)” article in Wikipedia. This graph is formed by N = 2023 nodes (pages) and M = 2287 edges (links). With the following instructions in Mathematica it is possible to visualize the undirected graph, where CloudGraph is a list of all edges.

Cloud Atlas WikiNet

Structure and topology of this graph are clearly influenced by the way these functions extract the network, this give us a rough (but still effective) outline of the local image of the entire Wikipedia network. Which approximations are we considering? For all first neighbors and the central page I extracted the correct outdegree but partial information on indegrees. For all second neighbors I have partial indegrees and no idea of the outdegrees, for instance all those nodes that are packed in groups with only one link, they are clearly my network border. As I said before I had to limit the exploration to reduce computational costs, not really for the graph itself but for the visits history analysis that will follow (in future posts).

degreeCloud

To have a quantitative idea let’s look at the degree distribution. This plot describes how connections are distributed in this WikiNet. There is a large number of articles with really few links (blue dots on the left) and a small number of articles with many links (on the right). The average degree in this network is 2.2, this confirms that we have a significative number of articles on the border of our subset of Wikipedia.

Wikipedia Visits Global Behavior

6 February 2015All Projects, Time Series, WikipediaEnglish Wikipedia, page popularity, page visits, seasonal effect, time series, weekly pattern, wikipedialevantina

I tried to understand and characterize Wikipedia visits time series (also in this post Wikipedia Visits Time Series). I analyzed a random sample of English Wikipedia articles, I considered a first set of 1 000 pages and a second set of 10 000 pages. I downloaded and cleaned all their visits history for 12 months (first pages set), and for 2 months (second pages set). Looking at the data I observed two really interesting global behaviors, with two different time scales. Let’s look at the first data set, in the following plot we have the average daily visits X ( t ) for a year, that is the mean over all the N pages in the set for a certain day.

The first global behavior has a scale of a few months, that is a global decrease of visits during summer months and Christmas days. This low frequency fluctuation can be interpreted as a seasonal effect. It seems reasonable that people visit Wikipedia with less continuity during summer months. There is also a significant average decrease of visits around Christmas (red tick mark in the plot). Also reasonable, since the majority of readers is located in the Northern Hemisphere and Western World (according to Wikimedia Statistics we have main visitors origin: US 36%, UK 10.8%, Canada 6%, India 5%, Australia 3.3%, Germany 2.0%, Philippines 1.7%, Brazil 1.1%, Netherlands 1.1%, France 1.0%, Sweden 1.0%, Italy 0.9% … ). Which can explain the 15% decrease of visits during Christmas holidays and summer months.

Daily visits of English Wikipedia pages from 08/2012 to 07/2013. Average value over a set of 1000 random articles.

What about these high frequency fluctuations? The second global behavior has a scale of a few days. Let’s look at these fluctuations in a smaller time window. In the second plot we can see the average daily visit (but normalized this time) over a set of 10 000 random articles, for only 60 days. The blue and the dark red lines are the data (two sub-sets). The light red line is a sinus function, with a period of 7 days.

Daily visits for 60 days, average value for 10000 random English Wikipedia articles. In red a sinus function of period 7. — Average daily visits for 10000 random English Wikipedia articles, for two months.

I interpreted this result as a weekly pattern, in fact periodic minima correspond to weekend days. Which means that Wikipedia is usually consulted more during working days! Did you expect that?

Wikipedia Visits Time Series

5 February 2015All Projects, Time Series, WikipediaEnglish Wikipedia, page popularity, page visits, time series, wikipedialevantina

Wikipedia is currently maintained by a non-profit organization called Wikimedia Foundation which has the goal of encouraging the growth, development and distribution of free, multilingual content. Because of its intent, not only does Wikimedia make the content of its Wikis available through specific Websites but also through dumps of their whole data that can be downloaded by end users (here). Each request of a page, whether for editing or reading, is collected and stored for several languages editions of Wikipedia. These data are available in files of 90 ÷ 120 MB for every hour of the day. Which means that a whole day could be 3 GB. A simpler way to access these data, since we often have computational limitations, is the Mitzuas’s project, which provides for a given Wikipedia article all its visits history per day since December 2007. For every page there is a JSON format file for each requested month. There is also a nice tool that allows you to visualize a page visits history (90 consecutive days at most).

To download and process automatically these data I wrote several functions and code lines in Mathematica programming language, and I obtained in the end clean and correct time series. For those who want more details, in this repository here there are the Mathematica notebooks. From now on, in this blog, I will always mean with Wikipedia pages only English Wikipedia articles, excluding special pages such as Main Page, Portal pages, Category pages, Project pages and others.

After downloading and cleaning the data, I tried to understand and characterize the visits time series of Wikipedia pages. I decided to restrict my analysis to articles related to movies, everybody loves cinema and I am not an exception. Wikipedia pages are already extremely heterogenous in nature and as a consequence in behavior. Usually visits of a page fluctuate a lot around its typical average and different pages can have very different averages. This typical value can represent a measure of the popularity of that page. Focusing on a specific category could help us in reducing effects that are difficult to understand.

We can see below an example of relatively stationary articles, two well known movies. The number of daily visits fluctuates around a typical value, nothing particularly strange happens in a year. There is no particular reason for people to visit these pages during this time. Just the usual. The access to these articles is not strictly stationary (as we studied in class), but considering the rather random behavior of web users, this is how much constant it can get.

Stationary pages

Different story for different movies. We can see below the daily visits history for two popular movies, two beloved Christmas stories. I truncated the red peak to keep details in the baseline. Something definitely happened. When talking about movies, understand a possible origin of a bursty activity may be easier, then considering articles of other categories. These films were probably aired on TV. This example describes how an exogenous event looks like. Something outside the net happened, and we can see how it clearly affects the number of visitors. The movie “A Christmas Story” had a peak of 245 034 visits on December 25th 2012, when its baseline usually fluctuates around 1 000 daily visits. Same thing, with a smaller peak of 12 550 daily visits, happened to the page “The Nightmare Before Christmas”, that usually fluctuates around 2 000 daily visits.

In conclusion with this first approach to the data I saw that generally daily visits of a page fluctuate, often considerably, around a typical average, that depends strongly on the popularity of that page. Moreover this typical average may vary around very different values for each page. The system is very heterogenous, with strong natural fluctuations. And sometimes an exogenous event occurs, the number of visits explodes and after a while everything goes back to normal.

Wikipedia as a Complex Network

8 January 2015All Projects, Networks, Wikipediacomplex network, English Wikipedia, network, topology, wikipedialevantina

Since its inception Wikipedia has been gaining popularity and nowadays it is consistently ranked in the top 10 most popular sites according to Alexa (currently at 7th place). As of January 2015, it is comprised of more than 5 million of articles on a wide variety of subjects written in more than 200 languages. The number of articles on Wikipedia has been growing exponentially since its creation in 2001. This growth is mainly driven by the exponential increase in the number of users contributing new articles, indicating the importance of the Wikipedias open editorial policy in the current success (more technical details can be found here).

English Wikipedia has 9.5 billion page views per month, the number of articles is about 4.8 million and it grows of 7% every year (see here and here, very nice Wikimedia statistics). It is a complex network, with scale-free properties and power law distribution of degrees (more technical details here). Degree distributions, growth, topology, clustering and path length are common characteristics among different language versions of Wikipedia.

"Social Network Analysis Visualization" by Calvinius — “Social Network Analysis Visualization” by Calvinius

Scale-free systems are one of the reasons why I decided to study Physics at University. I was reading “Chaos” by James Gleik, and it was kind of mind-blowing. I was eighteen and I had to decide what to do with my unclear skills in many things that I liked, as everyone. But those things… were exciting in a completely new way. “Scale-free” means that if you look at a system at a small and large scale, you would not be able to say which is which (see also scale invariance). Let’s go back to Wikipedia, the essential property of scale-free networks is having many nodes with few connections and few nodes with many connections, that is the power-law distribution of degrees.

Usually complex networks such as social networks have also a small diameter of the corresponding graph of social connections. The diameter of a graph is the longest shortest path (or longest distance) that connects every pair of node. This property is related to the well known phenomenon of six-degree separation, that was proved with the social experiment conducted by Stanley Milgram before Internet was born (here). In other words, in a social network everyone can be reached by everyone within six steps of social connections. But every network has his particular degree of separation.

What happens in a network like Wikipedia?

In an interesting work (here) I found that they measured the average path length (or average distance between two nodes) of English Wikipedia and they found a value of almost 3 links, when considering the undirected graph (which means that every link is considered accessible in both directions even though it is not). The average path length si instead almost 5 when considering the directed graph (where every link has a direction).

I found on the Internet a nice tool in a beta version (degreesofwikipedia.com), where you can know the degree of separation (or distance) between two Wikipedia pages of your choice. Have fun!

chaos like home

understanding data