Posted on Tuesday, October 26th, 2010 at 5:49 am under English Wikipedia 1.0, English Wikipedia Version 0.8, WikiTrust

The Wikipedia 1.0 Editorial Team on the English Wikipedia is currently preparing to publish its next test release, Version 0.8.   This is a general selection of almost 50,000 articles.  An IRC meeting was held last week, and the group is now aiming to have the selection ready for publication by late November.

Version 0.8 is being used primarily to test out an automated system for choosing a vandalism-free version (called a RevisionID) of each article.  Every time a Wikipedia article is edited, the new RevisionID is saved; if that RevID is free of vandalism, it can provide a permanently clean version, even if the current article has been corrupted.  The problem (till recently) has been how to identify which RevIDs are “clean”.

The obvious approach would be to use the latest flagged revision of each article, but the implementation of this on the English Wikipedia has been dogged by opposition – so we began to look at other options.  WikiTrust offered a possible method, but the original software was designed for editors/reviewers to look over the content of articles.  Luca de Alfaro and I met up at Wikimania 2009, and discussed how the code might be adapted to find the most “trustworthy” RevID of an article.  This has now been achieved (thanks Luca et al.!), and Version 0.8 was compiled using the WikiTrust-based code for selecting RevIDs.

This represents a huge advance over Version 0.7.  This earlier test release proved very effectively the power of auto-selection of articles, but the large size (>31,000 articles) meant I that had to spend many months of my spare evenings locating vandalism using anti-vandalism scripts.  This was not only slow, it was (I believe) much less effective than the WikiTrust approach – it could not catch the more surreptitious forms of vandalism.  Overall, Version 0.7 took 18 months to bring to publication – an unacceptably long process, largely (but not entirely) because of the RevID problem.  We couldn’t even consider working on Version 0.8 until this issue was resolved.  Thankfully, we do seem to have a working solution, but we will have to wait for publication to see just how well it works.

Typical view of an assessment table for chemistry

There has been an additional value to the automated RevID selection.  We allow a month or so to solicit feedback from WikiProjects on the selection.  This “reality check” is invaluable for getting the right selection – for example, pointing out that only three parts of a four-part opera were chosen, or that Australia has a new premier who needs to be included.  But for Version 0.8, we were able to include our chosen RevIDs in the lists we presented to WikiProjects – this was also due to CBM’s improved layout on the Toolserver.   The articles selected for Version 0.8 are indicated by a diamond on the right; clicking on the diamond brings up the selected revision.  The result is that the WikiProjects not only checked which articles were selected, but they were able to point out vandalism or mention updates, as in this example.  This adds another layer of curation, by subject experts, that can only help both the offline release and the online Wikipedia.

At this point, I mainly need just to finish reviewing the WikiProject comments, because after that the steps are mainly now automated and routine.  The generation of a ZIM file is now straightforward, and Kiwix and Okawix are now working well as readers.  However, one knotty problem remains – we still don’t have a good way to generate an index by topic.  The code used for this on Version 0.7 was rather buggy, though the code for the geographical index was better.  This means we may only be able to offer an alphabetical and a geographical index in Version 0.8 – though we are beginning to work with two librarians who want to help tackle this problem for us.

Overall, things look quite rosy for offline releases at this point.  There is a good chance we’ll have 0.8 out in time for Christmas, and that it will be of higher quality and reliability than previous releases.  If things go as planned, we should be able to produce our first “official” release, Version 1.0, some time in 2010.

Comments are closed.