The Wikipedia 1.0 Editorial Team on the English Wikipedia is currently preparing to publish its next test release, Version 0.8. This is a general selection of almost 50,000 articles. An IRC meeting was held last week, and the group is now aiming to have the selection ready for publication by late November.
Version 0.8 is being used primarily to test out an automated system for choosing a vandalism-free version (called a RevisionID) of each article. Every time a Wikipedia article is edited, the new RevisionID is saved; if that RevID is free of vandalism, it can provide a permanently clean version, even if the current article has been corrupted. The problem (till recently) has been how to identify which RevIDs are “clean”.
The obvious approach would be to use the latest flagged revision of each article, but the implementation of this on the English Wikipedia has been dogged by opposition – so we began to look at other options. WikiTrust offered a possible method, but the original software was designed for editors/reviewers to look over the content of articles. Luca de Alfaro and I met up at Wikimania 2009, and discussed how the code might be adapted to find the most “trustworthy” RevID of an article. This has now been achieved (thanks Luca et al.!), and Version 0.8 was compiled using the WikiTrust-based code for selecting RevIDs.
This represents a huge advance over Version 0.7. This earlier test release proved very effectively the power of auto-selection of articles, but the large size (>31,000 articles) meant I that had to spend many months of my spare evenings locating vandalism using anti-vandalism scripts. This was not only slow, it was (I believe) much less effective than the WikiTrust approach – it could not catch the more surreptitious forms of vandalism. Overall, Version 0.7 took 18 months to bring to publication – an unacceptably long process, largely (but not entirely) because of the RevID problem. We couldn’t even consider working on Version 0.8 until this issue was resolved. Thankfully, we do seem to have a working solution, but we will have to wait for publication to see just how well it works.
There has been an additional value to the automated RevID selection. We allow a month or so to solicit feedback from WikiProjects on the selection. This “reality check” is invaluable for getting the right selection – for example, pointing out that only three parts of a four-part opera were chosen, or that Australia has a new premier who needs to be included. But for Version 0.8, we were able to include our chosen RevIDs in the lists we presented to WikiProjects – this was also due to CBM’s improved layout on the Toolserver. The articles selected for Version 0.8 are indicated by a diamond on the right; clicking on the diamond brings up the selected revision. The result is that the WikiProjects not only checked which articles were selected, but they were able to point out vandalism or mention updates, as in this example. This adds another layer of curation, by subject experts, that can only help both the offline release and the online Wikipedia.
At this point, I mainly need just to finish reviewing the WikiProject comments, because after that the steps are mainly now automated and routine. The generation of a ZIM file is now straightforward, and Kiwix and Okawix are now working well as readers. However, one knotty problem remains – we still don’t have a good way to generate an index by topic. The code used for this on Version 0.7 was rather buggy, though the code for the geographical index was better. This means we may only be able to offer an alphabetical and a geographical index in Version 0.8 – though we are beginning to work with two librarians who want to help tackle this problem for us.
Overall, things look quite rosy for offline releases at this point. There is a good chance we’ll have 0.8 out in time for Christmas, and that it will be of higher quality and reliability than previous releases. If things go as planned, we should be able to produce our first “official” release, Version 1.0, some time in 2010.
One of the exciting things at Wikimania is the chance to meet people from around the world, especially people with whom you share a common interest. One of the main news stories in the offline Wikipedia world this year was the release of the Malayalam offline collection on April 17. This was important, since it was the first such release by one of the smaller Wikipedias, and the first involving a non-Latin script.
As we learnt from Shiju Alex and Santhosh Thottingal (centre & centre-left in picture) Wikimania, the non-Latin script presents a particular barrier in offline releases. The script is poorly supported in Unicode, so it is not rendered correctly by most software. Existing offline reader software such as Kiwix could not be used. In order to produce a collection of articles with the correct script, Santhosh Thottingal wrote the wiki2cd software from scratch, and this proved to be very successful. Wiki2cd is written in Python, and “the program is written such a way that it can be reused with any wikiprojects to do the same kind of work.” Hopefully others will use Santosh’s work, and Kiwix and others will be able to incorporate non-Latin scripts too.
The CD was sponsored by the Kerala State Dept of General Education. This reinforces what we have seen with past offline releases such as Wikipedia for Schools, that one of the main applications for offline collections is in education.
As I mentioned in my previous post, the Wikimedia Foundation is (at last!) getting seriously interested in offline releases. At Wikimania, Jimmy Wales spoke about expanding the reach of Wikimedia by expanding in non-Western languages, and he chose to highlight the Malayalam release in his talk (see picture, taken by Ralf Roletschek). If this interest continues, and talented volunteers continue to arise, we may be fortunate enough to see more offline releases from around the world.
I had the good fortune to attend Wikimania 2010 in the beautiful city of Gdansk, Poland in July. As in the past, it was a great pleasure to meet other Wikimedians in person – for example Heiko and Headbomb (see picture).
When I attended Wikimania in 2006, discussions on offline content were minimal, and the main issues were quality (partly as fallout from the Seigenthaler incident), stable versions (aka flagged revisions/pending changes) and pushing on with the Wikipedia revolution. Four years later, offline content has become a major priority. There were several “offline” sessions in Gdansk, all giving different viewpoints and focussing on different parts of the publication process. We also heard Sue Gardner and Jimmy Wales talking about reaching out to the “global south“, which implicitly includes a major offline component.
Clearly, this is an exciting time for those of us who believe passionately in the idea of using offline collections to put free knowledge into the hands of people around the world. Coincidentally, we have a lot happening on the English Wikipedia; our stumbling efforts are finally giving us tools that will deliver reliable offline content, and (I believe!) stand up to scrutiny by the community and by the press. For all of these reasons, I thought it was about time to start a blog devoted to these activities, and hopefully persuade other Wikipedians to share their part of the work. If things go well, this may become a place for us to share ideas and track developments in offline Wikipedia collections.