Press "Enter" to skip to content

The Long and Short of Wikiprediction


The Problem with Wikipedia. (Click to enlarge)In what may be a self-organized example of Occam’s Razor, consider the case of reliability of Wikipedia articles.

Recently, Joshua E. Blumenstock of UC Berkeley performed a statistical analysis of 1000’s of wikipedia pages, looking for predictors of quality articles. (Where "quality articles" was taken to be featured articles. These articles are given this rating by Wikipedia editors, using specific criteria. As of this posting, there are approximately 2000 featured articles out of over 2.4 million wikipedia articles.)

In his paper Automatically Assessing the Quality of Wikipedia Articles Blumenstock describes the search for correlation between "featuredness" and a a wikiload of possible variables. The variables included surface features (e.g. # of characters, words, one-syllable words), structural features (e.g. links , images, tables), a variety of readability metrics (e.g. Gunning Fog, Coleman-Liau Index), and part of speech tags (e.g. nouns, past participles, perterites).

He needn’t have looked so deeply. It turns out that word count alone is an incredibly potent predictor. Amazingly, Blumenstock found that whether an article had greater or less than 1830 words was all that was needed to predict whether an article was featured with 97% accuracy!

Now why is this? The simple answer is Occam’s Razor at work again, and is a natural feature of wiki collaboration: "As articles grow, they likely receive the attention of more editors, and thus the quality would be expected to improve."

It would seem that the Spartan ideal of less is more is no more, a particularly nasty, oxymoronic cut indeed from Occam’s Razor.

But wait. With this simple predictor and simple reason for the correlation comes a potentially nasty side-effect; “[f]eatured articles are meant to be ‘the best that Wikipedia has to offer’; these results indicate that they might merely be the longest Wikipedia has to offer.”

Which either implies that millions of shorter articles (those with fewer than 1830 words) should be featured, or the current number of featured articles is way too high.

Note: The Problem with Wikipedia cartoon is from XKCD, A webcomic of romance, sarcasm, math, and language. The site title is itself a great predictor. This site, maintained by Randall Munroe, is wonderfully strange, contains a wild and wooly "blag" , and is itself featured on Wikipedia, although apparently it is not long enough to be "featured."

BTW, the number of words in this post is 401. You be the judge – is it too short to be featured?