Bugzilla

Comment 6

•

17 years ago

I can confirm this BUG

See  mozilla\modules\libjar\nsJAR.cpp  - implements nsIZipReader

nsJAR::FindEntries(const char *aPattern, nsIUTF8StringEnumerator **result)
{
  NS_ENSURE_ARG_POINTER(result);

  nsZipFind *find;
  nsresult rv = mZip.FindInit(aPattern, &find);
  NS_ENSURE_SUCCESS(rv, rv);

  nsIUTF8StringEnumerator *zipEnum = new nsJAREnumerator(find);

which wants a UTF-8 string   

We end up in nsUTF8Utils.h

     NS_ERROR("Not a UTF-8 string. This code should only be used for
converting from known UTF-8 strings.");

Dave Townsend [:mossop]

Comment 7

•

17 years ago

The ultimate problem here is that nsZipArchive is built according to the original zip spec which only allowed for ascii characters in the names. Updating it for the UTF-8 extensions to zip would be some work and would also require changes to the nsIZipReader interface.

Comment 8

•

17 years ago

Attached file input zip file to test case — Details

Comment 9

•

17 years ago

Attached file Javascript to show bug in a console window — Details

Comment 10

•

17 years ago

Attached file xul file to run Javascript, change path to where you put source js — Details

Comment 11

•

17 years ago

(In reply to comment #7)
> original zip spec which only allowed for ascii characters in the names.

Actually the original zip just took bytes, interpreted in whatever character set the reader wanted to interpret them as (whatever the local file system understood, basically). If you archived your own files or shared zips with friends in the same country as you things basically worked.

The underlying nsZipArchive implementation is likewise character-set agnostic, and just takes C strings.

nsIZipReader appears to have turned into a horrible mix of methods that take "string" and "AUTF8String".

Comment 12

•

17 years ago

Attached file a file with code points above 127 that extracts fine — Details

file in zip is named saute.txt
Sauté C:\saute.txt C:\saute.txt
C:\saute.txt
C:\CookBook\TermsDictionaryGlossary\ZZAccented\à la bourguignonne.txt
C:\CookBook\TermsDictionaryGlossary\ZZAccented\à la Maitre d'Hotel.txt
à la king.html
Pls	no quotes	in ->Pont l'Évêque cheese

Comment 13

•

17 years ago

Attached file a file with a file name with code point above 127 — Details

file named Sauté.txt with same contents as  C:\FileWithNONAsciiInternalCharters.ZIP.  Crash and burn

Gervase Markham [:gerv]

Comment 14

•

17 years ago

Confirming due to comment #13.

Gerv

Status: UNCONFIRMED → NEW

Ever confirmed: true

Comment 15

•

17 years ago

Part of the problem is that prior to the relatively recent (2006) "general
purpose bit 11" flag to indicate utf8 filenames most zip implementations
allowed whatever filenames happened to be valid on the current OS -- it's
just storing length-counted char* so it doesn't really care what the caller
passes in. The "official" spec from the PKZIP people says that it's always
to be interpreted as IBM's old DOS codepage 437. That's already more than
just ascii, and since it's totally unsuited to anything outside US/western
europe that was taken as permission to just go ahead and use whatever
you've got. Windows versions of zip utilities do appear to call ANSIToOEM()
on the filenames, though, probably to remain compatible with old
DOS-produced archives.

Interesting facts:

- WinZip creates archives with cp437 file names (Window's "OEM" charset).
- WinRAR creates archives with "OEM" filenames.
- in contrast, command-line zip (cygwin) created archives using iso-8859
file names (window's "ANSI" character set) -- just guessing, but iso8859 is
a common file system charset on Unix platforms, so maybe that's "more
compatible" in that world.
- none of the three used utf8 encoding or the general purpose bit 11 flag
- all of command-line zip, WinZip, and 7-zip correctly read both versions,
restoring the correct characters when the archive was listed or extracted
whether cygwin's "ANSI" charset or the other's "OEM" charset.
- WinRAR created archives with "OEM" filenames, and automatically assumed
the characters were OEM; it mangled the cygwin zip filenames

The other difference between WinZip archives and zip archives is that the
cygwin zip archives are flagged as "unix" ("version made by" has a high
byte of x03 - UNIX, although the "version needed to extract" does not).
The cygwin zip also uses the optional "extra field" for file permissions and misc stuff, but I think that's less relevant.

Supporting "UTF8" zip archives is _not_ going to help these guys. If we
want to make all the interfaces UTF8 and then make a stab at charset
conversion from cp437 (or guess at other "OEM" charsets) that might help,
but the most common zip utilities are not creating archives with unicode
filenames.

If we want to support this we'll have to convert all the names as we read
the directory structure because theoretically this "made by unix" flag
could differ from file to file. Well, more than theoretical, I've created
them by using cygwin to add to a winzip-created archive and vice versa --
you get a nice mix of charsets there. Again, zip, WinZip, and 7-zip handle
the mixture gracefully, displaying the list in a unified character set
(whether at the command line or GUI display, or when extracting the files
into the filesystem charset).

Another potential wrinkle, the zip library is currently also built standalone so it can be incorporated into the old Suite installer. In that context it won't have access to our charset conversion routines. Maybe we no longer care if we
can get SuiteRunner to use the NSIS installer in the FF3 timeframe. Otherwise we'll have to put the conversion in the level above nsZipArchive which would be more awkward. I guess we could stub them out with calls to window's OEMtoAnsi calls.

If we could simply fix this internally to nsZipArchive I'd say just make
those char* interfaces always UTF8, converted from whatever they actually
are on disk.

Comment 16

•

17 years ago

There are other "version made by" values we might be concerned about. I have absolutely no idea if they come up in practice but if they do they might imply different character set conversions. The Macintosh value, for example, is probably the Mac-Roman character set. There's a different value for OS X--what character set would that be?--but we'd have to poke around and see if that's used. There's a Windows NTFS value that none of the Windows zip archivers I tried actually used, they stuck with the original "MS-DOS" value for compatibility presumably.

We could peek at the info-zip code and see that they do with all this, I think their license is compatible with ours so we don't have to worry about being tainted just by looking.

For reference the PKZIP spec is found at
http://www.pkware.com/documents/casestudies/APPNOTE.TXT

Comment 17

•

17 years ago

If we change our thinking to supporting web applications, not just the legacy zip specification I think we will be better off.  Integrate something like libiconv (http://www.gnu.org/software/libiconv/) with compression and a new string interface to support it.  Support the compression part of the old standard for the major file systems, but add options for binary files and files to be converted from one codepage to another and support the file system calls on the major platforms.  Supporting the developers working on this project with development access to MACs, PCs and Linux and Unix boxes would very helpful.  Give them some clear specifications of what has to be supported and what is optional.  Exclude a few things.  ISO-8859-1 gives you windows in the US.  Forget about things like x80 - x90 on windows 1252.  Trying to work out the wrinkles with those code point could be a career or two. This is an area were a small teams needs to form.  Split the project into 2 with a Western and Eastern focus.  Another option that just supports file transfers between systems would be to distribute an open source zipper on windows and provide a protected way of calling tar on Mac and Unix and Linux.  

I think Mr. Daniel Veditz has proved that the ZIP spec has enough ambiguities that it should not be taken literally. BTW thanks.

Comment 18

•

17 years ago

What does "supporting web applications" mean? web applications don't make zip archives, they call command-line zip, use Java classes, or perl libraries. We need to handle what those produce, and what those produce. 

The zip spec hosted by PKWARE is a good place to start, it's a living document and does represent some cross-vendor consensus in the interests of compatibility. Like web standards, specific implementations do vary.

I don't understand the link to libiconv. We already have character set conversion functionality in Mozilla.

The trick is to figure out what conversion to use since there doesn't seem to be a marker present in the files. I don't care about the difference between iso-8859-1 and cp1252, I want to know how to tell the difference between iso-8859-1 and iso-8859-2 or whatever multibyte character set the Asian countries will be using. We have the converters, we just need to know which one to use.

Comment 19

•

17 years ago

1) Bytes is bytes, unless you give them context. For example, for US English, if you want to support all valid filenames on both the PC and the MAC, first of all, you can’t. Second you can support 99.99 of them by eliminating the illegal characters on both platforms (and a couple of other restrictions). But you still have to call some kind of code page conversion so that the valid PC characters above dec 127 get converted into UTF MAC and vice versa. If you are going from MAC to PC, it gets a whole lot more complicated. You can translate a resource fork, but without a lot more application level support, it’s not going to be meaningful. But in a real sense that is presentation level data, just like application formats carry presentation level data, and you can choose not to support that. What is missing from the zip spec is the concept of From and To. It was not the pressing concern when we had 640k and most data transfer was done with an IBM mainframe tape.

2) From our earlier posts, “The trick is to figure out what conversion” is “options to convert them” and “(whatever the local file system understood, basically)”.

3) Lets say “Like” Libiconv means your character conversion (mozilla’s) routines.

4) I am sure I do not understand all the things a web application might do. But I would not preclude doing things with files and data hosted on local machines. And if you look at the success of the zip format, one of the things it did was help move data around from one machine to another. For a definition of a what I was thinking about when I mentioned web applications, you can start with this http://en.wikipedia.org/wiki/Web_application, and then expand it.

What I am trying to get at, is the state of the art is such that we cannot move all data around seamlessly. We can move around a lot of data, accurately, if we live with a few restrictions.

I do not know what would happen if someone on a mac tried to zip up files with non ascii characters and then someone tried to read that zip on a PC. Does the mac zip up UTF? What happens when you try to unzip them with various programs on the PC?

If you use plain ascii, you do not have these problems, but then you do not fully support zip applications. What we have now is going to fail as soon as filenames use non ascii data. But, that is not a big deal, as long as you know what will and will not work for you.

In my case, I can live with accessing the zip utilities on the various platforms. I will convert my data to the target machines format. In the case of the MAC and UNIX, I can try and use tar, on the PC I can distribute an open source zip from info-zip.org.

skierpage

Comment 20

•

15 years ago

Firefox's amazing build-in jar: protocol handler shows this problem.   Using the attachment from comment #13 , the URL
  jar:https://bug296795.bugzilla.mozilla.org/attachment.cgi?id=265062!/
displays the ZIP file's contents, showing the file as "File:Saut?.txt"  (that appears to be the ASCII '?' mark symbol, not a missing glyph icon).

If you click the filename, Firefox attempts to load the file
  jar:https://bug296795.bugzilla.mozilla.org/attachment.cgi?id=265062!/Saut?.txt
(no escaping or encoding), and reports File not found.  There is nothing in the Error Console in Firefox 3.6apre1.

This also seems to expose bugs in the way the jar viewer represents and parses non-ASCII characters.

Phil Ringnalda (:philor)

Updated

•

15 years ago

Assignee: file-handling → nobody

QA Contact: ian → file-handling

Nickolay_Ponomarev

Updated

•

14 years ago

Blocks: 445065

Comment 21

•

14 years ago

If you attach multiple files to an email and send them to a Hotmail recipient, the recipient gets the option to "download all" attachments.  When you click on "download all", it zips up all the files and saves them to your hard drive.  It zips up files with non ascii characters in the filename.

Assignee

Comment 23

•

13 years ago

Hi

I think the Problem above has two aspects:
1) What coding comes from the archive?
2) How works the interface nsIZipReader.idl from JavaScript to C++?

Too aspect 1:
My openSUSE Linux use UTF-8 for filenames.
For a file test_ü.txt in an archive zipinfo shows:
-rw-r--r--  3.0 unx        7 tx stor 11-Aug-10 16:15 test_??.txt


I have made a Source Code Analysis of aspect 2.

In comment 11 Dave come to the point.
>> nsIZipReader appears to have turned into a horrible mix of methods that take >> "string" and "AUTF8String".

The reason is in the files:
   mozilla/modules/libjar/nsIZipReader.idl
   mozilla/modules/libjar/nsJAR.cpp

The methods are:
   nsIUTF8StringEnumerator findEntries(in string aPattern);
   void extract(in string zipEntry, in nsIFile outFile);
   nsIZipEntry getEntry(in string zipEntry);
   boolean hasEntry(in AUTF8String zipEntry);

see IDL String types
   https://developer.mozilla.org/En/Mozilla_internal_string_guide

If I use this IDL in JavaScript the nsIUTF8StringEnumerator from findEntries() is auto converted to UTF-16. If I use this strings as input parameter in other methods there are differend types for zipEntry.

The method boolean hasEntry(in AUTF8String zipEntry); works fine!!!
The string from JavaScript is also auto converted in this direction to UTF-8.

All other methods in this IDL have ...(in string zipEntry).
In this case the high Byte value is cut off.
For germans Umlaute "äöü" this means, from UTF-8 convert to ISO-8859.
And this don't match in mZip->GetItem().

I will try to build a patch that improves the interface. Then we can test whether other functions are able to cope.

Assignee

Comment 24

•

13 years ago

Attached patch nsIZipReader.idl use always AUTF8String as in parameter v1.0 (obsolete) — Details — Splinter Review

This is a patch that improves the interface nsIZipReader.idl
I changed parameter (in string aEntryName) to (in AUTF8String zipEntry)

The JavaScript users don't need to change code. This is because the automatic type conversion from XPCOM do that.

The c++ users have to change
    parameter of type nsCString:  (xxx.get()) --> (xxx)
    parameter of type char*: (xxx) --> (nsDependentCString(xxx))
    for string literal use (NS_LITERAL_CSTRING("xxx"))
    for NULL use EmptyCString()

This patch changed all occurrence of nsIZipReader.idl in mozilla-central.
These are:
    modules/libjar/*
    caps/src/nsScriptSecurityManager.cpp
    xpcom/components/nsComponentManager.cpp
    xpinstall/src/nsXPInstallManager.cpp

Attachment #557793 - Flags: feedback?(tglek)

Attachment #557793 - Flags: feedback?(benjamin)

Benjamin Smedberg

Comment 25

•

13 years ago

Comment on attachment 557793 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v1.0

It's not clear to me that we want to support non-ASCII paths if there is any possible confusion about the character sets. But I'll defer to Taras since I really don't have the time to work through all the consequences.

Attachment #557793 - Flags: feedback?(benjamin)

Comment 26

•

13 years ago

Comment on attachment 557793 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v1.0


> NS_IMETHODIMP
>-nsJAR::Test(const char *aEntryName)
>+nsJAR::Test(const nsACString &aEntryName)
> {
>-  return mZip->Test(aEntryName);
>+  char *entry = PL_strdup(PromiseFlatCString(aEntryName).get());
>+  if (!entry)
>+    return NS_ERROR_OUT_OF_MEMORY;
>+
>+  if (*entry == '\0')
>+    entry = nsnull;

This is a memory leak.
Use nsCString for entry.
>+
>+  nsresult rv = mZip->Test(entry);
entry.get()

>-nsJAR::FindEntries(const char *aPattern, nsIUTF8StringEnumerator **result)
>+nsJAR::FindEntries(const nsACString &aPattern, nsIUTF8StringEnumerator **result)
> {
>   NS_ENSURE_ARG_POINTER(result);
> 
>+  char *pattern = PL_strdup(PromiseFlatCString(aPattern).get());
>+  if (!pattern)
>+    return NS_ERROR_OUT_OF_MEMORY;
>+
>+  if (*pattern == '\0')
>+    pattern = nsnull;

leak

Sorry for the review lag, traveling atm. The rest of the patch looks good. This is an important fix, thanks for taking it on. I'm r-ing this for now, I'd like to review it again in case I missed anything.

Attachment #557793 - Flags: feedback?(tglek) → feedback-

Assignee

Comment 27

•

13 years ago

Attached patch nsIZipReader.idl use always AUTF8String as in parameter v2 (obsolete) — Details — Splinter Review

I don't know, why I used local copies of the parameter.
Now I use aEntryName.IsEmpty()? nsnull : PromiseFlatCString(aEntryName).get()
It's easier to use and to understand. I changed that in nsJAR::Test and nsJAR::FindEntries.

What is about the IDL documentation in https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIZipReader and nsIZipReaderCache?
Is this auto generated or hand made? I miss also nsIZipWriter!

If I have to change the documentation by hand, I still need an example of how one describes a not kompatieble IDL. Must I create a new UUID for the interface?

Attachment #557793 - Attachment is obsolete: true

Attachment #560540 - Flags: review?(tglek)

O. Atsushi (Torisugari)

Comment 28

•

13 years ago

If high bit is the only problem, why AUTF8String instead of ACString?

Most of non-ascii zip entries are not encoded in UTF-8. Indeed, attachment 265062 [details] (comment #13) seems Code Page 437.

S    a    u    t    é    .    t    x    t
0x53 0x61 0x75 0x74 0x82 0x2E 0x74 0x78 0x74 (0x00)

In unicode, é is 0x000000E9 and in utf-8, 0xE9.

O. Atsushi (Torisugari)

Comment 29

•

13 years ago

(In reply to O. Atsushi (Torisugari) from comment #28)
> In unicode, é is 0x000000E9 and in utf-8, 0xE9.

s/0xE9/0xC3A9/

I'm sorry for the spam.

Assignee

Comment 30

•

13 years ago

(In reply to O. Atsushi (Torisugari) from comment #28)
> If high bit is the only problem, why AUTF8String instead of ACString?

AUTF8String is the best type to communicate with unicode between c++ and javascript.
AUTF8String is auto converted to UTF-16 in Javascript and the UI works also with UTF-16.
see: https://developer.mozilla.org/En/Mozilla_internal_string_guide#IDL_String_types

> Most of non-ascii zip entries are not encoded in UTF-8. Indeed, attachment
> 265062 [details] (comment #13) seems Code Page 437.

This depends on the platform. Windos 7 and the many Linux works with UTF-8.

> S    a    u    t    é    .    t    x    t
> 0x53 0x61 0x75 0x74 0x82 0x2E 0x74 0x78 0x74 (0x00)
> 
> In unicode, é is 0xC3A9 and in utf-8, 0xE9.

This is still a problem. In comment 23 I wrote:
   I think the Problem above has two aspects!

The Patch attachment 560540 [details] [diff] [review] solves only aspect 2!
Aspect 1 is still there  and we must remember that the bug 
is not completely fixed with my patch.

Important is, that the situation keeps getting better
and open the door for the next step!

Comment 31

•

13 years ago

Comment on attachment 560540 [details] [diff] [review]
nsIZipReader.idl use always  AUTF8String as in parameter v2

You do indeed need to bump the uuid in nsIZipReader.idl see https://developer.mozilla.org/en/Generating_GUIDs (short story: /msg firebot uuid & replace existing one)

Attachment #560540 - Flags: review?(tglek) → review+

Comment 32

•

13 years ago

Wolfgang, thanks for the patch. I think it is important to be able to read archives with unicode filenames.

Assignee

Comment 33

•

13 years ago

Attached patch nsIZipReader.idl use always AUTF8String as in parameter v3 — Details — Splinter Review

My changes to v2:
new UUID and CID for nsIZipReader and nsIZipReaderCache

Please check the patch on Try Server and check in.
Remember that the bug is not completely fixed with my patch. (comment 30)

What is about the IDL documentation in https://developer.mozilla.org/en/XPCOM_Interface_Reference/nsIZipReader and nsIZipReaderCache?
Is this auto generated or shall I change the wiki by hand?

Attachment #560540 - Attachment is obsolete: true

Attachment #562019 - Flags: checkin?(tglek)

Comment 34

•

13 years ago

Wolfgang,
You should apply for level 1 access so you can push to try yourself. Documentation is manual, assigning dev-doc-needed keyword to a bug after closing will get someone to update it.

I pushed the patch to try https://tbpl.mozilla.org/?tree=Try&usebuildbot=1&rev=bbfc4cc0c427 should get results posted in here. I would like to aim at landing this asap after the next aurora merge so this spends time in nightly testing.

Mozilla RelEng Bot

Comment 35

•

13 years ago

Try run for bbfc4cc0c427 is complete.
Detailed breakdown of the results available here:
    https://tbpl.mozilla.org/?tree=Try&rev=bbfc4cc0c427
Results (out of 171 total builds):
    exception: 3
    success: 162
    warnings: 5
    failure: 1
Builds available at http://ftp.mozilla.org/pub/mozilla.org/firefox/try-builds/tglek@mozilla.com-bbfc4cc0c427

Ed Morley [:emorley]

Updated

•

13 years ago

Assignee: nobody → wgermund

Status: NEW → ASSIGNED

OS: Windows XP → All

Hardware: x86 → All

Whiteboard: [land after 27th sept aurora uplift]

Ed Morley [:emorley]

Comment 36

•

13 years ago

Comment on attachment 562019 [details] [diff] [review]
nsIZipReader.idl use always AUTF8String as in parameter v3

Thanks for the patch!

The try run looks good, but I'm going to remove the checkin flag for now (for the reason in comment 34), so this doesn't get checked in accidentally. As soon as the next aurora uplift happens (27th Sept) please re-add it and someone will push for you.

If you need any help getting level 1 commit access so you can push to try for future patches, just ask :-)

Attachment #562019 - Flags: checkin?(tglek)

Ed Morley [:emorley]

Updated

•

13 years ago

Keywords: dev-doc-needed

Comment 37

•

13 years ago

http://hg.mozilla.org/integration/mozilla-inbound/rev/6032f7c15af8

Whiteboard: [land after 27th sept aurora uplift] → [inbound]

Michael Wu [:mwu]

Comment 38

•

13 years ago

https://hg.mozilla.org/mozilla-central/rev/6032f7c15af8

Status: ASSIGNED → RESOLVED

Closed: 19 years ago → 13 years ago

Resolution: --- → FIXED

Whiteboard: [inbound]

Matt Brubeck (:mbrubeck)

Updated

•

13 years ago

Target Milestone: --- → mozilla10

Alfred Kayser

Comment 39

•

13 years ago

Around 405:
  const char *filename = PromiseFlatCString(aFilename).get();
  if (*filename)
  {
    //-- Find the item
   nsCStringKey key(filename);
   nsJARManifestItem* manItem = static_cast<nsJARManifestItem*>(mManifestData.Get(&key));
Could have been:
  if (!aFilename.isEmpty())  {
    //-- Find the item
   nsJARManifestItem* manItem = static_cast<nsJARManifestItem*>(mManifestData.Get(&aFilename));

This saves the creation of a nsCString, and the copy of the string itself.

Assignee

Updated

•

13 years ago

Blocks: 697061

Assignee

Comment 40

•

13 years ago

(In reply to Alfred Kayser from comment #39)

I make a followup at bug 697061.

The Parameter of mManifestData.Get(&key) must be a nsCStringKey not a nsCString.