On Saturday 30 August 2008 03:44:15 pm Marc wrote:
> great work the libextractor, I like to learn with it and figure things out,
> starting to learn python.
> One problem I noticed:
> I try to distinguish file formats of the different Microsoft-Office
> formats using the mimetype information provided by libextractor (I have no
> filename extansions of the files to investigate). The problem is that often
> only a general information e.g. "application/vnd.ms-office" are extracted.
> The result depends on the specific application which has been used at last
> save of the document/spreadsheet/presentation.
> I found out that other programms have similar problems to do this job:
> - In the Linux-Distro Kubuntu Hardy that I use - e.g. XLS-files without
> filename extension appears as DOC in Konqueror
> - Windows XP can't do so either (in filemanager)
> - I also tried NLNZ Metadata Extractor v3.0 without success
> - The file command on the shell gives wrong application type too
Well, AFAIK the reason is that to a large extend the
document/spreadsheed/presentation format is pretty much the same -- and they
all DO have the same mime-type (so it is not incorrect for LE to sometimes
report the same mime-type). Internally, LE has one mime-type (vnd.ms-files)
which is used if we have no idea what the actual MS application is. If LE is
able to determine the "generator", then the MimeType is chosen to be more