Monday, January 4, 2010

Content migration from documents to wiki (MediaWiki)

This article will cover following topics
• Statement of problem
• Feasible solutions to achieve the goal
• Automating conversion of documents using VBA macros
       o What are VBA macros
       o Functional details
• Technical aspects
       o Technology used
       o Class details
       o IO/user interface
• Configuration
       o How to use the application
       o Extend the application to web or windows
       o Exception
       o Limitation and support
Statement of problem:
To convert documents (doc/pdf/excel) to wiki formatted text so that it can be directly hosted on a wiki server and a mass import of content or document import to wiki can be achieved.
Feasible solutions to achieve the goal: The following two approaches can be followed to achieve the solution.
• Automating conversion of the documents in wiki text, using VBA macros (preferred)
• Reading the entire content with inline formatting and applying wiki syntax to it.
Automating conversion of documents using VBA macros:
• What are VBA macros: Macros are Code written in VBA ,is compiled[ to a proprietary intermediate language called P-code (packed code), which is stored by the hosting applications (Access, Excel, Word) as a separate stream in Structured storage files (e.g. doc or xls) independent of the document streams. The intermediate code is then interpreted.
• Functional details: following steps are mentioned below:
     o Write a macro for each document type as excel, word or ppt which can convert the content in wiki format by persisting the data/formatting and save the output temporarily in the same file
     o Save the macro in .bas format (change/set the extension to .bas)
     o Automate the office documents using .net libraries which can attach the corresponding macro to the document and execute the macro on the documents.
     o Read the output (wiki formatted text) generated by macro and publish it to the hosted wiki using the Mediawiki API’s for .net

Technical aspects:


Technology used: Asp.net, VBA Macros, PHP, Media Wiki, IIS, apache web server
       o write macro for each excel and word documents to convert in wiki formatted text
       o create .net based class library to convert PDF to word document
       o create .net based class library to automate the office application which can add and execute the macro
Class details: project contains a class library (wikiconversion) to automate the document conversion, media wiki API’s (wikiaccess) and a console program to invoke the application


IO/user interface: the project includes a console application by which the application can be simulated, hence there is no visible UI design. It takes input from the folder Documents read all documents in the folder and hosts the doc and excel type in media wiki. An output file is generated in the Output folder under Documents with the current timestamp with the status of documents processed. (Same library can be extended for any web or window application as needed)


Configuration: Following configuration required to run the application

1. A Mediawiki hosted on apache server (mediawiki should support the call From API e.g. $wgEnableAPI=true in localSettings.php file)
2. All the documents that has to be converted must be in documents folder of the application
3. Macros folder which consist macros for word and excel file, should be in the application with relevant word and excel macros. For further details on macro please follow this link http://en.wikipedia.org/wiki/Visual_Basic_for_Applications
4. Wikiconversion.config must be updated with relevant config values like mediawiki server name and various files path

o How to use the application: set the configuration values in config file appropriately, place the file to be converted in documents folder and run the content import project

o Extend the application to web or windows: Create a web or window application, and call the various functions available in wikiconversion class (refer the code written in Main function for more details)

o Exception: Any corrupted file or very big file in size might take either too long to host on wiki or fail to do so. Accuracy of Conversion from documents to wiki depends on the macros placed in application Macro can be replaced for better output as further if available. Current macros convert the most of features/format appropriately still some issues can be there in word macro.

o Limitation and support: currently application support only MS-Word and MS-Excel file to convert and host in media wiki but the same can be extended to PDF,HTML,PPT by converting them to MS-word documents.URL generated in output file might mislead sometimes

References: http://www.mediawiki.org/wiki/API

No comments: