Category Archives: Time Series

Visits Time Series of a WikiNet

I built a WikiNet (a Wikipedia subset) starting from the article of a recently released movie, and I considered visits history of all articles in the WikiNet, one year before and after the movie release. We will see how the exogenous event, related to the movie release, influence the number of visits of the articles in the network.

I chose a well known movie, Iron Man 3 (released on May 3rd 2013 in USA), in order to have a lot of visitors and a lot of links in the network. I extracted the local neighborhood in Wikipedia as I described in a previous post and I obtained a non oriented graph. Of this graph I considered the central page, all its 461 first neighbors (FN) and just a random selection of 1 000 effectively second neighbors (SN), because some SN could be already FN. I had to restrict my analysis to a subset of articles for computational limitations. Downloading and processing time series for 24 months and for around 4 000 articles was far beyond my possibilities of time and computing power. Nevertheless my analysis is performed on a significant data set.

I considered a the time window of 24 months, and for each article I considered the weekly average and its logarithmic value. First of all I had to remove the effect of the weekly pattern of Wikipedia articles, described in this post, a global and regular fluctuation. Second, I had to reduce the differences among articles, because their typical average number of visits can be very different, and considering the logarithmic value is a valid choice in cases like this.

Iron Man 3
First Neighbors
Second Neighbors
Second Neighbors

Wikipedia Visits Global Behavior

I tried to understand and characterize Wikipedia visits time series (also in this post Wikipedia Visits Time Series). I analyzed a random sample of English Wikipedia articles, I considered a first set of 1 000 pages and a second set of 10 000 pages. I downloaded and cleaned all their visits history for 12 months (first pages set), and for 2 months (second pages set). Looking at the data I observed two really interesting global behaviors, with two different time scales. Let’s look at the first data set, in the following plot we have the average daily visits X ( t ) for a year, that is the mean over all the N pages in the set for a certain day.

average

The first global behavior has a scale of a few months, that is a global decrease of visits during summer months and Christmas days. This low frequency fluctuation can be interpreted as a seasonal effectIt seems reasonable that people visit Wikipedia with less continuity during summer months. There is also a significant average decrease of visits around Christmas (red tick mark in the plot). Also reasonable, since the majority of readers is located in the Northern Hemisphere and Western World (according to Wikimedia Statistics we have main visitors origin: US 36%, UK 10.8%, Canada 6%, India 5%, Australia 3.3%, Germany 2.0%, Philippines 1.7%, Brazil 1.1%, Netherlands 1.1%, France 1.0%, Sweden 1.0%, Italy 0.9% … ). Which can explain the 15% decrease of visits during Christmas holidays  and summer months.

Daily visits of English Wikipedia pages from 08/2012 to 07/2013. Average value over a set of 1000 random articles.
Daily visits of English Wikipedia pages from 08/2012 to 07/2013. Average value over a set of 1000 random articles.

What about these high frequency fluctuations? The second global behavior has a scale of  a few days. Let’s look at these fluctuations in a smaller time window. In the second plot we can see the average daily visit (but normalized this time) over a set of 10 000 random articles, for only 60 days. The blue and the dark red lines are the data (two sub-sets). The light red line is a sinus function, with a period of 7 days.

Daily visits for 60 days, average value for 10000 random English Wikipedia articles. In red  a sinus function of period 7.
Average daily visits for 10000 random English Wikipedia articles, for two months.

I interpreted this result as a weekly pattern, in fact periodic minima correspond to weekend days. Which means that Wikipedia is usually consulted more during working days! Did you expect that?

Wikipedia Visits Time Series

Wikipedia is currently maintained by a non-profit organization called Wikimedia Foundation which has the goal of encouraging the growth, development and distribution of free, multilingual content. Because of its intent, not only does Wikimedia make the content of its Wikis available through specific Websites but also through dumps of their whole data that can be downloaded by end users (here). Each request of a page, whether for editing or reading, is collected and stored for several languages editions of Wikipedia. These data are available in files of 90 ÷ 120 MB for every hour of the day. Which means that a whole day could be 3 GB. A simpler way to access these data, since we often have computational limitations, is the Mitzuas’s project, which provides for a given Wikipedia article all its visits history per day since December 2007. For every page there is a JSON format file for each requested month. There is also a nice tool that allows you to visualize a page visits history (90 consecutive days at most).

Main Page visits on January 2015
Main Page visits on January 2015

To download and process automatically these data I wrote several functions and code lines in Mathematica programming language, and I obtained in the end clean and correct time series. For those who want more details, in this repository here there are the Mathematica notebooks. From now on, in this blog, I will always mean with Wikipedia pages only English Wikipedia articles, excluding special pages such as Main Page, Portal pages, Category pages, Project pages and others.

After downloading and cleaning the data, I tried to understand and characterize the visits time series of Wikipedia pages. I decided to restrict my analysis to articles related to movies, everybody loves cinema and I am not an exception. Wikipedia pages are already extremely heterogenous in nature and as a consequence in behavior. Usually visits of a page fluctuate a lot around its typical average and different pages can have very different averages. This typical value can represent a measure of the popularity of that page. Focusing on a specific category could help us in reducing effects that are difficult to understand.

We can see below an example of relatively stationary articles, two well known movies. The number of daily visits fluctuates around a typical value, nothing particularly strange happens in a year. There is no particular reason for people to visit these pages during this time. Just the usual. The access to these articles is not strictly stationary (as we studied in class), but considering the rather random behavior of web users, this is how much constant it can get.

Stationary pages

Different story for different movies. We can see below the daily visits history for two popular movies, two beloved Christmas stories.  I truncated the red peak to keep details in the baseline. Something definitely happened. When talking about movies, understand a possible origin of a bursty activity may be easier, then considering articles of other categories. These films were probably aired on TV. This example describes how an exogenous event looks like. Something outside the net happened, and we can see how it clearly affects the number of visitors. The movie “A Christmas Story” had a peak of 245 034 visits on December 25th 2012, when its baseline usually fluctuates around 1 000 daily visits. Same thing, with a smaller peak of 12 550 daily visits, happened to the page “The Nightmare Before Christmas”, that usually fluctuates around 2 000 daily visits.

Christmas movies on Christmas days
Christmas movies on Christmas days

In conclusion with this first approach to the data I saw that generally daily visits of a page fluctuate, often considerably, around a typical average, that depends strongly on the popularity of that page. Moreover this typical average may vary around very different values for each page. The system is very heterogenous, with strong natural fluctuations. And sometimes an exogenous event occurs, the number of visits explodes and after a while everything goes back to normal.