Discussion:
problem with attachments in unicode (UTF16)
(too old to reply)
Philip Steeman
2008-03-20 17:46:57 UTC
Permalink
Hello,
when I sent a mail with imp and add a attachment in UTF16, it isn't
base64 encoded as it should be.
So it becomes corrupted when opened in Outlook or Thunderbird.

You can find a very short unicode-file here to do a test:
http://users.khbo.be/steeman/unicode.html

Is it an error in horde/imp, or in php or in ...

PS: I tested it with lots of browsers (in windows and linux) to get the
same result.

Verions:
horde 3.1.7
imp 4.1.6
php 4.3.10

Philip Steeman
Andrew Morgan
2008-03-20 22:49:30 UTC
Permalink
Post by Philip Steeman
Hello,
when I sent a mail with imp and add a attachment in UTF16, it isn't
base64 encoded as it should be.
So it becomes corrupted when opened in Outlook or Thunderbird.
http://users.khbo.be/steeman/unicode.html
Is it an error in horde/imp, or in php or in ...
PS: I tested it with lots of browsers (in windows and linux) to get the
same result.
horde 3.1.7
imp 4.1.6
php 4.3.10
I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as the
client, and the latest stable releases of Horde and IMP with PHP5 on
Debian Etch as the server. The browser says the content-type is
text/plain when it uploads the attachment. Here is the exact attachment
that was sent in the email:

--=_3uemsho7ppkw
Content-Type: text/plain;
charset=UTF-8;
name="unicode.txt"
Content-Disposition: attachment;
filename="unicode.txt"
Content-Transfer-Encoding: quoted-printable

=FF=FEt=00h=00i=00s=00 =00i=00s=00 =00a=00 =00t=00e=00s=00t=00=0D=00
=00i=00n=00 =00U=00T=00F=001=006=00=0D=00
=00=0D=00
=00P=00h=00i=00l=00i=00p=00 =00S=00t=00e=00e=00m=00a=00n=00=0D=00
=00
--=_3uemsho7ppkw--

It used quoted-printable encoding instead of Base64. I'm not a
quoted-printable whiz, but it appears that the high-order bits get encoded
as 00 (NUL) values. When I download this same attachment using IMP, it is
identical to your original unicode.txt file. However, I suspect
Thunderbird and Outlook are not combining the two bytes of data back
together (=FF=FE into FFEE) but are trying to render the NUL character.

Anyways, I'm mostly writing this all down because I was interested enough
to test it myself. I have no idea if there is some way to get IMP to use
the UTF-16 character set instead.

Andy
Otto Stolz
2008-03-25 10:01:05 UTC
Permalink
Hello,
Post by Andrew Morgan
I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as the
client, and the latest stable releases of Horde and IMP with PHP5 on
Debian Etch as the server. The browser says the content-type is
text/plain when it uploads the attachment. Here is the exact attachment
--=_3uemsho7ppkw
Content-Type: text/plain;
charset=UTF-8;
name="unicode.txt"
Content-Disposition: attachment;
filename="unicode.txt"
Content-Transfer-Encoding: quoted-printable
=FF=FEt=00h=00i=00s=00 =00i=00s=00 =00a=00 =00t=00e=00s=00t=00=0D=00
=00i=00n=00 =00U=00T=00F=001=006=00=0D=00
=00=0D=00
=00P=00h=00i=00l=00i=00p=00 =00S=00t=00e=00e=00m=00a=00n=00=0D=00
=00
--=_3uemsho7ppkw--
It used quoted-printable encoding instead of Base64. I'm not a
quoted-printable whiz, but it appears that the high-order bits get encoded
as 00 (NUL) values. When I download this same attachment using IMP, it is
identical to your original unicode.txt file. However, I suspect
Thunderbird and Outlook are not combining the two bytes of data back
together (=FF=FE into FFEE) but are trying to render the NUL character.
The problem is the wrong charset specification: For an UTF-16 encoded text,
it should, of course, read ?UTF-16?, rather than ?UTF-8?. I guess, that
wrong specification stems from the browser used to upload that file.

Because of that wrong specification, the adressee will not interpret the
text as intended. In particular:
- The individual bytes will not be assembled into 16-bit units.
- Any bytes above 127 will be interpreted according to UTF-8 rules;
in particular, the two leading bytes (meant as BOM) will be considered
as illegal input values, and most probably be replaced with Replacement
Characters U+FFFD.
- In due course, the endianess of the UTF-16 text will be lost.
That particular text is little-endian; the UTF-8 bytes will be
interpreted in the opposite sequence. Hence, the two halfs of each
16-bit unit will effectievely be swapped, and even if you try
to read the attachment as a UTF-16 file, you?ll be out of luck.

The quoted-printable encoding is alright; the Content-Transfer-Encoding
is totally irrelevant for the problems the two preceding posts in this
thread have described.

Good luck,
Otto Stolz
Philip Steeman
2008-03-25 10:10:32 UTC
Permalink
I've tested the upload with all browsers I have
- IE6 (windows XP)
- IE7 (windows XP)
- Firefox 2 (windows XP)
- konqueror (Knoppix)

All gave the same wrong result.

Philip
Post by Philip Steeman
Hello,
Post by Andrew Morgan
I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as
the client, and the latest stable releases of Horde and IMP with PHP5
on Debian Etch as the server. The browser says the content-type is
text/plain when it uploads the attachment. Here is the exact
--=_3uemsho7ppkw
Content-Type: text/plain;
charset=UTF-8;
name="unicode.txt"
Content-Disposition: attachment;
filename="unicode.txt"
Content-Transfer-Encoding: quoted-printable
=FF=FEt=00h=00i=00s=00 =00i=00s=00 =00a=00 =00t=00e=00s=00t=00=0D=00
=00i=00n=00 =00U=00T=00F=001=006=00=0D=00
=00=0D=00
=00P=00h=00i=00l=00i=00p=00 =00S=00t=00e=00e=00m=00a=00n=00=0D=00
=00
--=_3uemsho7ppkw--
It used quoted-printable encoding instead of Base64. I'm not a
quoted-printable whiz, but it appears that the high-order bits get
encoded as 00 (NUL) values. When I download this same attachment
using IMP, it is identical to your original unicode.txt file.
However, I suspect Thunderbird and Outlook are not combining the two
bytes of data back together (=FF=FE into FFEE) but are trying to
render the NUL character.
The problem is the wrong charset specification: For an UTF-16 encoded text,
it should, of course, read ?UTF-16?, rather than ?UTF-8?. I guess, that
wrong specification stems from the browser used to upload that file.
Because of that wrong specification, the adressee will not interpret the
- The individual bytes will not be assembled into 16-bit units.
- Any bytes above 127 will be interpreted according to UTF-8 rules;
in particular, the two leading bytes (meant as BOM) will be considered
as illegal input values, and most probably be replaced with Replacement
Characters U+FFFD.
- In due course, the endianess of the UTF-16 text will be lost.
That particular text is little-endian; the UTF-8 bytes will be
interpreted in the opposite sequence. Hence, the two halfs of each
16-bit unit will effectievely be swapped, and even if you try
to read the attachment as a UTF-16 file, you?ll be out of luck.
The quoted-printable encoding is alright; the Content-Transfer-Encoding
is totally irrelevant for the problems the two preceding posts in this
thread have described.
Good luck,
Otto Stolz
Andrew Morgan
2008-03-25 19:12:47 UTC
Permalink
Post by Philip Steeman
I've tested the upload with all browsers I have
- IE6 (windows XP)
- IE7 (windows XP)
- Firefox 2 (windows XP)
- konqueror (Knoppix)
All gave the same wrong result.
When I view the file locally using the url file:///tmp/unicode.txt,
Iceweasel correctly identifies it as UTF-16LE according to the Page Info
screen.

I managed to grab a packet capture of my browser uploading the file to IMP
during new message composition. This is from the POST data:

Content-Disposition: form-data; name="upload_1"; filename="unicode.txt"
Content-Type: text/plain

..t.h.i.s. .i.s. .a. .t.e.s.t.
.
.i.n. .U.T.F.1.6.
.
.
.
.P.h.i.l.i.p. .S.t.e.e.m.a.n.
.
.


The periods are actually null (00) bytes in the data stream.


Further testing shows that for attachments with Primary Type = 'text'
(from type 'text/plain' for example), IMP sets the charset of the
attachment to the character set of your language in IMP. When I choose
Japanese as my language when logging into IMP, the unicode.txt attachment
charset is "SHIFT_JIS". I suppose this means if I can force IMP into a
UTF-16 language, it would correctly identify the attachment.

This seems like a tricky issue.

Andy
Tim Bannister
2008-03-25 09:17:54 UTC
Permalink
Post by Andrew Morgan
Post by Philip Steeman
So it becomes corrupted when opened in Outlook or Thunderbird.
I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as the
client, and the latest stable releases of Horde and IMP with PHP5 on
Debian Etch as the server. The browser says the content-type is
text/plain when it uploads the attachment. Here is the exact attachment
You haven't mentioned what encoding the browser claimed when uploading the
file. For this to work, the browser needs to upload the file with
metadata like this:
Content-Type: text/plain; charset=UTF-16

I don't think my main browser would get the upload part right. It tends
to be guesswork on most systems as filesystem metadata (etc) are not
trustworthy enough.
--
Tim Bannister
IT Services division
The University of Manchester

w: http://www.manchester.ac.uk/itservices
Michael M Slusarz
2008-03-25 17:02:04 UTC
Permalink
Post by Tim Bannister
Post by Andrew Morgan
Post by Philip Steeman
So it becomes corrupted when opened in Outlook or Thunderbird.
I tested this with Iceweasel (Firefox) 2.0.0.12 on Debian Unstable as the
client, and the latest stable releases of Horde and IMP with PHP5 on
Debian Etch as the server. The browser says the content-type is
text/plain when it uploads the attachment. Here is the exact attachment
You haven't mentioned what encoding the browser claimed when uploading the
file. For this to work, the browser needs to upload the file with
Content-Type: text/plain; charset=UTF-16
No - this is incorrect. The correct (and unfortunate) answer is that
we can not detect the charset of a text attachment if it is in a
different charset than the browser. Browser upload information does
not contain the charset of the uploaded data, only the type - all we
have to go by is the charset the browser reports to us via the HTTP
headers.

This is the reason UTF-8 is used to encode the file, and this is why
the quoted-printable encoding is incorrect. There is nothing wrong
with the way we Q-P - but if we Q-P using the wrong charset, the data
is going to be invalid.

The greater issue is that PHP provides us no means to determine what
the charset of the given file is. There is a function in the mb
extension called mb_detect_encoding(). However this function is
non-mandatory for use of IMP, is buggy and not fully reliable, and
doesn't detect, among other charsets, UTF-16 data. So it is useless
for present purposes. The libmagic file extension can provide charset
guesses when it determines a file is a text file, but again it is not
required for IMP and doesn't produce correct results reliably enough
(for example, on my system it detects the UTF-16 test file as
audio/mpeg).

We could attach text files as application/octet-stream, base64 encoded
data but this gets us no further - it may work to view if you download
the text file to an OS that can detect the charset, but it would not
render properly in an environment where charset detection is not
available (say, for example, PHP and IMP).

michael
--
___________________________________
Michael Slusarz [slusarz at horde.org]
Otto Stolz
2008-03-26 09:36:34 UTC
Permalink
Hello,
Post by Michael M Slusarz
No - this is incorrect. The correct (and unfortunate) answer is that
we can not detect the charset of a text attachment if it is in a
different charset than the browser. Browser upload information does
not contain the charset of the uploaded data, only the type - all we
have to go by is the charset the browser reports to us via the HTTP
headers.
...
Post by Michael M Slusarz
The greater issue is that PHP provides us no means to determine what
the charset of the given file is.
You could inspect the leading two or three bytes of the uploaded
text file:
- If they are EF BB BF, it is almost certainly UTF-8.
- If they are FE FF, it is most probably UTF-16BE.
- If they are FF FE, it is most probably UTF-16LE.

This would correctly identify every Unicode-encoded text file
uploaded from a Windows system (which still constitutes the
majority of the end-user systems). Of course, this method does
not detect every encoding from every end-user system, but it
would make a great step toward a correct tagging of text type
attachments.

If the uploaded text file does not contain a BOM, you could
take the first entry from the Accept-Charset header as a guess
for the file?s encoding. This is, of course, less reliable,
but would be right for most files from out-of-the-box browser
installations.

To be on the safe side, you could add a Charset field to the
Attachments line in the Message Composition form (similar to
the Charset field in the header zone of that form). That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.
Post by Michael M Slusarz
There is nothing wrong
with the way we Q-P - but if we Q-P using the wrong charset, the data
is going to be invalid.
To avoid a possible misunderanding of this wording: The Q-P encoding
poses no problem, even if sailing under false colours, charsetwise.
Q-P simply encodes the bytes 3D, and above 7F, by their hexadekadic
values, which will be decoded without any problem. When tagged as UTF-8,
as in the examples discussed so far, even the byte-order is sure to be
preserved.

The only problem is the wrong Charset tag, as it will cause particular
byte values (or sequences thereof) to be considered illegal and, in due
course, to be replaced with Replacement Characters (or, perhaps, even
dropped).

Best wishes,
Otto Stolz
Andrew Morgan
2008-03-26 17:43:37 UTC
Permalink
Post by Philip Steeman
Hello,
Post by Michael M Slusarz
No - this is incorrect. The correct (and unfortunate) answer is that
we can not detect the charset of a text attachment if it is in a
different charset than the browser. Browser upload information does
not contain the charset of the uploaded data, only the type - all we
have to go by is the charset the browser reports to us via the HTTP
headers.
...
Post by Michael M Slusarz
The greater issue is that PHP provides us no means to determine what
the charset of the given file is.
You could inspect the leading two or three bytes of the uploaded
- If they are EF BB BF, it is almost certainly UTF-8.
- If they are FE FF, it is most probably UTF-16BE.
- If they are FF FE, it is most probably UTF-16LE.
This would correctly identify every Unicode-encoded text file
uploaded from a Windows system (which still constitutes the
majority of the end-user systems). Of course, this method does
not detect every encoding from every end-user system, but it
would make a great step toward a correct tagging of text type
attachments.
This seems like a reasonable method to detect UTF-16 encoded text files.
I'm not sure about using it for UTF-8 though. Wikipedia says:

Although not part of the standard, many Windows programs (including
Windows Notepad) use the byte sequence EF BB BF at the beginning of a
file to indicate that the file is encoded using UTF-8. This is the Byte
Order Mark U+FEFF encoded in UTF-8, which appears as the ISO-8859-1
characters "??????" in most text editors and web browsers not prepared
to handle UTF-8.
Post by Philip Steeman
If the uploaded text file does not contain a BOM, you could
take the first entry from the Accept-Charset header as a guess
for the file?s encoding. This is, of course, less reliable,
but would be right for most files from out-of-the-box browser
installations.
If IMP is not able to use the Byte Order Mark to detect the encoding, then
it should assume the file is encoded using the currently selected
language/encoding in Horde.
Post by Philip Steeman
To be on the safe side, you could add a Charset field to the
Attachments line in the Message Composition form (similar to
the Charset field in the header zone of that form). That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.
This is probably overkill, and would certainly clutter the interface a
lot. :)

Andy
Otto Stolz
2008-03-27 12:00:54 UTC
Permalink
Hello,
Post by Michael M Slusarz
Browser upload information does
not contain the charset of the uploaded data, only the type - all we
have to go by is the charset the browser reports to us via the HTTP
headers.
This morning, I have tried to learn, from RFC 2616, the syntax and
requirements for uploading files via POST; but I couldn?t make head
or tail of it. At least, I have found, in section 3.7.1
<http://rfc-ref.org/RFC-TEXTS/2616/chapter3.html#sub7sub1>, the
Post by Michael M Slusarz
When no explicit charset parameter is provided by the sender,
media subtypes of the "text" type are defined to have a default
charset value of "ISO-8859-1" when received via HTTP. Data in
character sets other than "ISO-8859-1" or its subsets MUST be
labeled with an appropriate charset value.
Apparently, the former clause says that Imp should tag uploaded
text-type attachments with ?charset=ISO-8859-1?, if no charset
parameter is given by the browser. Apparently, the latter clause
says that all browsers are buggy, as they do not provide a
charset parameter when uploading text files. Or am I mistaken?

So I am still in doubt, whether the right way is to lobby for
better, standard-conforming browsers, or to mend Imp to cope with
current browsers? behaviour (as discussed below), or even both.
Post by Michael M Slusarz
You could inspect the leading two or three bytes of the uploaded
- If they are EF BB BF, it is almost certainly UTF-8.
- If they are FE FF, it is most probably UTF-16BE.
- If they are FF FE, it is most probably UTF-16LE.
Although not part of the standard, many Windows programs (including
Windows Notepad) use the byte sequence EF BB BF at the beginning of a
file to indicate that the file is encoded using UTF-8.
This information is obsolete. The Unicode Standard 5.0 says otherwise,
cf. Table 2-4 in <http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G19273>,
sub-section ?Unicode Signature? in
<http://www.unicode.org/versions/Unicode5.0.0/ch02.pdf#G9354>,
and <http://www.unicode.org/versions/Unicode5.0.0/ch16.pdf#G25817>.

Note, however, that an attachment commencing with a BOM
should be tagged with ?charset=UTF-16?. If it is tagged with
?charset=UTF-16BE?, say, an initial FEFF code-unit would be interpreted
as a Zero Width No-Breaking Space, which would not be visible in the
rendering, but could well impede the automatic processing of the
data.
Post by Michael M Slusarz
If the uploaded text file does not contain a BOM, you could
take the first entry from the Accept-Charset header as a guess
for the file?s encoding. This is, of course, less reliable,
but would be right for most files from out-of-the-box browser
installations.
If IMP is not able to use the Byte Order Mark to detect the encoding,
then it should assume the file is encoded using the currently selected
language/encoding in Horde.
My rationale is that, in a typical Windows system, most text files
will be stored in the system codepage, e. g. CP 1252 in a German
system; likewise, most browsers would be configured to accept the
system codepage as 1st priority (or an almost compatible one, such
as ISO 8859-1, in a German system). So you could take the browser?s
Accept-Encoding header as a hint for the prevaelnt encoding of text
files to be uploaded from that very system.

I am not familiar, though, with Mac, and Linux, workstations, so I
cannot exclude that they might deserve a different treatment (which
could be based on the User-Agent header). Any opinions from experts?

In contrast, the language in Horde is selected by the
current user of the system, e. g. a guest in an internet-shop,
or a student at a public workstation in our university. Of course,
they could bring in their text files on memory sticks, but they
also could cut various texts from various sources and paste and
then store them locally, using the system codepage. I am really
not sure what will be the more common case.

And the Horde encoding is selected by the translator of the
language files. For many languages, several different encodings
are possible, and even widely used. Hence, from the language
selected by the Horde user, you cannot reliably infer the pertinent
encoding (let alone the encoding of the files uploaded by him).
Post by Michael M Slusarz
To be on the safe side, you could add a Charset field to the
Attachments line in the Message Composition form (similar to
the Charset field in the header zone of that form).
This is probably overkill, and would certainly clutter the interface a
lot. :)
If the POST info does not (and never will) specify the encoding of
an uploaded text file, this is the only feasable way to comply with
section 4.1.2 of RFC 2046
<http://rfc-ref.org/RFC-TEXTS/2046/chapter4.html#sub1sub2>.

And I think, this will not clutter the interface, cf. attached
screen shots (faked, of course).
Post by Michael M Slusarz
That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.
On second thought, I am not sure whether this is feasable:
This would amount to reading the file, via JavaScript, immediately
after it has been selected, and before it is uploaded. I have not
delved enough into JavaScript to assess the feasibility of this
approach. But if JavaScript code is used to guess the encoding
of the text file to be uploaded, the second step proposed above
does not apply; rather, the JavaScript code should be able to
find the system codepage, directly.

Best wishes,
Otto Stolz

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Imp-attachment-updated.png
Type: image/png
Size: 2683 bytes
Desc: not available
Url : Loading Image...
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Imp-attachment.png
Type: image/png
Size: 1680 bytes
Desc: not available
Url : Loading Image...
Jan Schneider
2008-03-27 12:36:45 UTC
Permalink
Post by Otto Stolz
Post by Otto Stolz
That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.
This would amount to reading the file, via JavaScript, immediately
after it has been selected, and before it is uploaded. I have not
delved enough into JavaScript to assess the feasibility of this
approach. But if JavaScript code is used to guess the encoding
of the text file to be uploaded, the second step proposed above
does not apply; rather, the JavaScript code should be able to
find the system codepage, directly.
This won't work of course, because you can't read local files with JavaScript.

Jan.
--
Do you need professional PHP or Horde consulting?
http://horde.org/consulting/
Philip Steeman
2008-03-27 12:49:07 UTC
Permalink
I did some more small tests (perhaps not very usefull)

- webmail used from my Belgian provider: unicode is encoded in base64
(so OK!). Really don't know what software it is.
- tried it with squirrelmail. Got a php-error with unicode attachments
(not with normal textfiles). I'm not a squirrelmail wizzard, I installed
it from debian-packages in a hurry.
- when you test it with a standard Linux-command
file -i unicode.txt
linux% file -i unicode.txt
unicode.txt: text/plain; charset=utf-16
So the file-utility gets it right (as does the webmail of my provider).

Philip
Post by Otto Stolz
Post by Otto Stolz
That
attachment-charset field would be preset to the value resulting
from the procedure outlined above, but would provide an
opportunity to override the preset value.
This would amount to reading the file, via JavaScript, immediately
after it has been selected, and before it is uploaded. I have not
delved enough into JavaScript to assess the feasibility of this
approach. But if JavaScript code is used to guess the encoding
of the text file to be uploaded, the second step proposed above
does not apply; rather, the JavaScript code should be able to
find the system codepage, directly.
This won't work of course, because you can't read local files with JavaScript.
Jan.
Otto Stolz
2008-03-27 12:52:38 UTC
Permalink
Hello,
Post by Jan Schneider
This won't work of course, because you can't read local files with JavaScript.
This is what I had feared.

So, I think the best solution would be:
- Provide a Charset field next to the file-selection widget
for the user to specify the encoding of the file he chooses
for uploading;
- if the user chooses a text file and a charset, tag the
attachment so; optionally, warn if the uploaded file contains
illegal data w.r.t. the charset chosen;
- if the user chooses a text file, but leaves the Charset
default value ?unknown? alone, try to guess the charset,
as discussed earlier in this thread;
- if the user chooses a non-text file, ignore the value
of the Charset field.

Best wishes,
Otto Stolz
Andrew Morgan
2008-03-27 18:06:33 UTC
Permalink
Post by Philip Steeman
Hello,
Post by Jan Schneider
This won't work of course, because you can't read local files with JavaScript.
This is what I had feared.
- Provide a Charset field next to the file-selection widget
for the user to specify the encoding of the file he chooses
for uploading;
- if the user chooses a text file and a charset, tag the
attachment so; optionally, warn if the uploaded file contains
illegal data w.r.t. the charset chosen;
- if the user chooses a text file, but leaves the Charset
default value ?unknown? alone, try to guess the charset,
as discussed earlier in this thread;
- if the user chooses a non-text file, ignore the value
of the Charset field.
Or, only provide a Charset selection widget after the file has been
uploaded and identified as "text". Hmm, or would it make more sense to
just use the Charset chosen for the message in the existing Charset
selection widget?

Maybe I'm just being a silly American, but is this really such a large
problem that we need to add all this complexity to IMP? This discussion
is the first time I've ever seen a UTF-16 file. :)

Andy
Michael M Slusarz
2008-03-27 18:22:13 UTC
Permalink
Post by Andrew Morgan
Maybe I'm just being a silly American, but is this really such a large
problem that we need to add all this complexity to IMP? This discussion
is the first time I've ever seen a UTF-16 file. :)
It is important for the somewhat more common/plausible example of a
browser reporting UTF-8 and uploading a text file encoded in
iso-8859-1 or vice versa.

michael
--
___________________________________
Michael Slusarz [slusarz at horde.org]
Philip Steeman
2008-03-27 18:35:47 UTC
Permalink
Post by Andrew Morgan
Maybe I'm just being a silly American, but is this really such a large
problem that we need to add all this complexity to IMP? This discussion
is the first time I've ever seen a UTF-16 file. :)
When you want to mail a .rdp file (remote desktop program), it becomes
corrupt.
Other problem files: .sav (from SPSS program). I think there must be
more examples.

So the real problem (for me) is not real textfiles (made with a editor),
but when a student has to mail his work (.sav-file) with webmail and the
teacher uses Outlook/Thunderbird there is a problem.
For the moment we have 2 sollutions:
- zip the file
- let the teacher use imp (best!)

Philip
Michael M Slusarz
2008-03-27 18:40:44 UTC
Permalink
Post by Philip Steeman
Post by Andrew Morgan
Maybe I'm just being a silly American, but is this really such a large
problem that we need to add all this complexity to IMP? This discussion
is the first time I've ever seen a UTF-16 file. :)
When you want to mail a .rdp file (remote desktop program), it becomes
corrupt.
Other problem files: .sav (from SPSS program). I think there must be
more examples.
NO. This is an entirely different issue dealing with a broken
browser. This has nothing to do with charsets.

michael
--
___________________________________
Michael Slusarz [slusarz at horde.org]
Philip Steeman
2008-03-28 08:54:48 UTC
Permalink
Post by Michael M Slusarz
Post by Philip Steeman
When you want to mail a .rdp file (remote desktop program), it becomes
corrupt.
Other problem files: .sav (from SPSS program). I think there must be
more examples.
NO. This is an entirely different issue dealing with a broken
browser. This has nothing to do with charsets.
You mean all the browser I tested (IE6 IE7 Firefox konqueror) are
broken? I don't see a difference between a unicode textfile (utf-16) and
a configuration-file '.rdp' or a output-file '.sav' all in unicode.
To test you can download a .rdp file from
http://users.khbo.be/steeman/unicode.html

Thank you for all your effort.

Philip
Andrew Morgan
2008-03-28 23:20:14 UTC
Permalink
Post by Philip Steeman
Post by Michael M Slusarz
Post by Philip Steeman
When you want to mail a .rdp file (remote desktop program), it becomes
corrupt.
Other problem files: .sav (from SPSS program). I think there must be
more examples.
NO. This is an entirely different issue dealing with a broken
browser. This has nothing to do with charsets.
You mean all the browser I tested (IE6 IE7 Firefox konqueror) are
broken? I don't see a difference between a unicode textfile (utf-16) and
a configuration-file '.rdp' or a output-file '.sav' all in unicode.
To test you can download a .rdp file from
http://users.khbo.be/steeman/unicode.html
Since test.rdp is simply a text file that happens to be encoded in UTF-16,
I agree that it will suffer from the same problem.

Andy
Philip Steeman
2008-04-23 11:46:50 UTC
Permalink
Hello,
is it a solution to map certain extensions to the correct mime types
(not perfect but enough for me).

e.g.:
.rdp --> Content-Type: text/plain;
--> Content-Transfer-Encoding: base64
.sav --> Content-Type: text/plain;
--> Content-Transfer-Encoding: base64

best solution: via a configuration-file

other possibility: can I hardcode it anywhere in the source?

Philip
Post by Andrew Morgan
Post by Philip Steeman
Post by Michael M Slusarz
Post by Philip Steeman
When you want to mail a .rdp file (remote desktop program), it becomes
corrupt.
Other problem files: .sav (from SPSS program). I think there must be
more examples.
NO. This is an entirely different issue dealing with a broken
browser. This has nothing to do with charsets.
You mean all the browser I tested (IE6 IE7 Firefox konqueror) are
broken? I don't see a difference between a unicode textfile (utf-16) and
a configuration-file '.rdp' or a output-file '.sav' all in unicode.
To test you can download a .rdp file from
http://users.khbo.be/steeman/unicode.html
Since test.rdp is simply a text file that happens to be encoded in
UTF-16, I agree that it will suffer from the same problem.
Andy
Michael M Slusarz
2008-03-27 18:44:58 UTC
Permalink
Post by Otto Stolz
- Provide a Charset field next to the file-selection widget
for the user to specify the encoding of the file he chooses
for uploading;
No. Because 99.9% of users have no idea idea what a charset is. Even
I, as a somewhat experienced user, have no idea what charset my text
docs are in (and nor do I care what their charset is).
Post by Otto Stolz
- if the user chooses a text file and a charset, tag the
attachment so; optionally, warn if the uploaded file contains
illegal data w.r.t. the charset chosen;
No. See above.
Post by Otto Stolz
- if the user chooses a text file, but leaves the Charset
default value ?unknown? alone, try to guess the charset,
as discussed earlier in this thread;
Alter a bit: if a user uploads a text file, attempt to "guess" the
charset. This will need to be done in PHP code. Possible perl
modules that may be useful to port to PHP for this purpose:

http://search.cpan.org/dist/Encode-Detect/
http://search.cpan.org/~dankogai/Encode-2.24/ (More specific, the
Encode::Guess module)

Fallback to the charset the browser is using since that is (most
likely) the charset used by the underlying OS.

michael
--
___________________________________
Michael Slusarz [slusarz at horde.org]
Loading...