opensubscriber
   Find in this group all groups
 
Unknown more information…

l : libextractor@gnu.org 1 September 2008 • 4:23PM -0400

Re: [libextractor] Microsoft Office mimetype (OLE2) is not recognized reliable
by Christian Grothoff

REPLY TO AUTHOR
 
REPLY TO GROUP




On Saturday 30 August 2008 03:44:15 pm Marc wrote:
> Hi,
>
> great work the libextractor, I like to learn with it and figure things out,
> starting to learn python.
>
> One problem I noticed:
>
> I try to distinguish file formats of the different Microsoft-Office
> formats using the mimetype information provided by libextractor (I have no
> filename extansions of the files to investigate). The problem is that often
> only a general information e.g. "application/vnd.ms-office" are extracted.
> The result depends on the specific application which has been used at last
> save of the document/spreadsheet/presentation.
>
> I found out that other programms have similar problems to do this job:
> - In the Linux-Distro Kubuntu Hardy that I use - e.g. XLS-files without
> filename extension appears as DOC in Konqueror
> - Windows XP can't do so either (in filemanager)
> - I also tried NLNZ Metadata Extractor v3.0 without success
> - The file command on the shell gives wrong application type too

Well, AFAIK the reason is that to a large extend the
document/spreadsheed/presentation format is pretty much the same -- and they
all DO have the same mime-type (so it is not incorrect for LE to sometimes
report the same mime-type).  Internally, LE has one mime-type (vnd.ms-files)
which is used if we have no idea what the actual MS application is.  If LE is
able to determine the "generator", then the MimeType is chosen to be more
specific:

  if (NULL != generator) {
    const char * mimetype = "application/vnd.ms-files";

    if((0 == strncmp(generator, "Microsoft Word", 14)) ||
       (0 == strncmp(generator, "Microsoft Office Word", 21)))
      mimetype = "application/msword";
    else if((0 == strncmp(generator, "Microsoft Excel", 15)) ||
            (0 == strncmp(generator, "Microsoft Office Excel", 22)))
      mimetype = "application/vnd.ms-excel";
    else if((0 == strncmp(generator, "Microsoft PowerPoint", 20)) ||
            (0 == strncmp(generator, "Microsoft Office PowerPoint", 27)))
      mimetype = "application/vnd.ms-powerpoint";
    else if(0 == strncmp(generator, "Microsoft Project", 17))
      mimetype = "application/vnd.ms-project";
    else if(0 == strncmp(generator, "Microsoft Visio", 15))
      mimetype = "application/vnd.visio";
    else if(0 == strncmp(generator, "Microsoft Office", 16))
      mimetype = "application/vnd.ms-office";

    prev = addKeyword(prev, mimetype, EXTRACTOR_MIMETYPE);
  }

One thing you may look at is the "generator" you get for your vnd.ms-files.  
If it is a specific application that is missing from the above list, we could
extend our list.

I'm not aware of any alternative / better way to determine the mimetype for MS
Office applications.

Christian


_______________________________________________
libextractor mailing list
libextractor@gnu....
http://lists.gnu.org/mailman/listinfo/libextractor

Bookmark with:

Delicious   Digg   reddit   Facebook   StumbleUpon

Related Messages

opensubscriber is not affiliated with the authors of this message nor responsible for its content.