Project Description
This service is built in Perl CGI. It uses Sandeep Kumar's docx2txt, some modified CPAN Perl Modules and SILVERCODERS DocToText as well as the No-Frill Unzipper to extract text and data from corrupt Office 2007 xml format files, Open Office, Word 97-2003 and RTF files.

Corrupt Office File Text/Data Extracting Web Service

Introduction

There are several free web services online now that will convert Microsoft Office 2007, Microsoft Word 97-2003, and RTF files or some combination of these into text, or if we are referring to Excel 2007 files, extract a spreadsheet's data. What is lacking in these services is the ability to do these conversions with corrupt files. This project seeks to provide that service.

It was discovered by one of the developers, Paul Pruitt, that corrupt instances of the conventionally zipped corrupt Microsoft Office 2007 XML format based files can be successfully unarchived by two different un-archivers, CakeCMD and Ccy's DelphiZip derived HaHa Zip. Paul Pruitt, contracted Ccy, to make a command version of HaHaZip called No-Frills Unzip. It's available here:
http://www.godskingsandheroes.info/software/#No-Frills_Unzipper.

Paul later discovered that there are now at least four methods for using scripts to extract or convert Microsoft Office 2007 files to text/data. In each instance, if one uses either CakeCMD or No-Frills Unzipper as the the un-archiver of the Office 2007 XML format based files as the first step or incorporated in the command invocation of the method, one can recover text/data from corrupt Office 2007 files, in as much as the corruption is in the zip nature of the file. Paul Pruitt estimates from his experience with the files uploaded to his instance of the service at http://www.saveofficedata.com that 40% of Office 2007 XML based file corruption is involved in the zip nature of the files, for instance in bad CRC numbers. Office 2007 itself can recover text data from some types of file corruption but seems to be unable to deal with bad CRC numbers from the crucial zipped XML files making up the larger Office 2007 files.

Details

Here are the four methods for recovering text from Office 2007 format files:
  1. https://sourceforge.net/projects/docx2txt/ - Sandeep Kumar's docx2txt for converting Word 2007 docx format files into text. After unzipping the file this script works mostly by removing the XML markup from the word/document.xml file where the text resides in a Word 2007 docx format file. However, Sandeep's script also recovers basic formatting information from the word/_rels/document.xml.rels file. Replacing Sandeep's use of Perl's standard zlib based unzipper, allows the extraction of corrupt docx file.
  2. http://silvercoders.com/index.php?page=DocToText - SILVERCODERS DocToText. This is a nifty command line application which will extract text/data from Word 2007 docx, Excel 2007 xlsx, PowerPoint 2007 pptx, Word 97-2003, and RTF files. Paul Pruitt sponsored some further developments of the application so that it now can be used with No-Frills Unzipper and using the commands "--fix-xml" or "--strip-xml" can extract data/text from corrupt docx, xlsx, and pptx files, although the "--strip-xml" command does not presently work with xlsx files.
  3. http://search.cpan.org/~dmow/Spreadsheet-XLSX-0.1/lib/Spreadsheet/XLSX.pm - Using Ken Prows xls2csv and several other Perl modules, this Perl module can even be used in CGI mode to convert xlsx files to csv ones, although extensive modifications are needed. The modified Perl modules will be included in this download. They are modded so they work locally and don't need to be installed by your ISP administrator.
  4. A yet unreleased version of a command line version of an application in development by Ccy and again sponsored by Paul Pruitt, will allow conversion of corrupt docx, xlsx, and pptx files to data/text. The source code won't be released for this application as of yet, or at least not here, but the executable will be included in the download section

The script currently at http://www.saveofficedata.com uses the 0.3 version of Sandeep Kumar's docx2txt script, version 0.11 of doc2txt (which may still be unreleased), and version 0.1 of Spreadsheet::XLSX and has yet to implement Ccy command line app. This script is not included as a download here because the conversion code is embedded in a copyrighted file uploader script from here: http://www.perlservices.net/?ul. The script that is included does not work correctly and needs debugging. Instead it uses an open source uploader from here: http://www.ftls.org/en/examples/cgi/eUpload.shtml.

The current primary developer, Paul Pruitt is a beginner programmer, and has simply attempted to crowd all the conversion code into the subroutine for the html results header of the uploader script, which apparently does not work. Also there are serious problems with security and cleanup that need resolving as well as clumsy locating of the Perl modules and CGI script in the same folder as the uploads.

Last edited Oct 17, 2009 at 3:08 PM by socrtwo, version 7