Following my previous post on a version of Wikipedia for Windows Mobile improved from the original Pocket Wikipedia 1.0 version by free-soft.ro, I decided to find a MobiPocket (PRC) version, to read on my Blackberry phone, Unfotunately I could not find a usable version – many versions I found, including PRC, are incomplete, with images stripped off, and not suitable for mobile viewing. There is also an expensive Wikipedia software for most mobile platforms. TomeRaider also offers a few free versions of wikipedia (with images, and compact without images around 50MB) of Wikipedia in its propietary format. As none of these suit my needs, I decided to go ahead with creating my own PRC version of Wikipedia.
The article database
I decided to use the same article database (Wikipedia.wi) as Pocket Wikipedia 1.0, which turns out to be the 2007 School Wikipedia selection. Although the source code was never released, the binary was not obfuscated and after a bit of decompiling using .NET Reflector, I was able to extract the articles and images from the 180MB SevenZip-compressed database.
Building the PRC ebook
My first thought was to rely on the MobiPocket Creator user interface. However, its UI is terrible – there is no way to add multiple HTML/image files at a time, you have to add them one by one. Even if drag and drop is supported, the application stops responding when a lot of files are added. I then decided to create the OPF file myself, then feed it into mobigen or kindlegen in order to create the final PRC file.
The source code to extract the articles and create the OPF file was written in .NET and can be downloaded here. Once the OPF is created, as there are more than 5000 articles and 24,000 images, kindlegen/mobigen takes more than 15 minutes on a 3Ghz processor to create the final PRC file.
Some of the articles contain Unicode characters (for example, various currency symbols) but were extracted and saved in ASCII format. I have tried various methods in System.Text.Encoding to convert to Unicode before saving without success. The only resolution I found is to use UTFCast Express (freeware) to convert the HTML files to Unicode before feeding them into kindlegen/mobigen.
The product: Wikipedia on a 214MB PRC file
The final compressed PRC file can be downloaded here. It’s a multi-part RAR file, so you will need to download both parts to the same directory and use WinRar to extract the PRC file.
It contains all articles and images as in the original version, with a subject list, and an index where the titles of all articles can be looked up. As the title list is generated automatically by guessing the few words of the article, there are cases where the title are not retrieved properly, which can be resolved by editing the index manually in the OPF file before calling kindlegen to generate the PRC file.
Due to the large file size, some desktop versions of MobiPocker ebook reader may fail to open the file due to a Win32 exception. Mobile versions, in particular Blackberry and Windows Mobile, seem to open the PRC file properly.
UPDATE (8 Oct 2010): A Vietnamese reader has used my instructions to create an improved version of the Wikipedia ebook in PRC format. The new version, which fixes some font problems and has improved search support, can be downloaded here and here. It’s a multi-part RAR file, so you will need to download both parts to the same directory and use WinRar to extract the original PRC file.