Pileus Git - ~andy/fetchmail/blob - Mailbox-Names-UTF7.txt

   1 IMAP4r1 Mailbox Names vs. Unicode
   2 =================================
   3 :author: Matthias_Andree_(ed.)_and_Mark_Crispin
   4 :email: matthias.andree@gmx.de
   5 :author initials: MA and MC
   6 :revision: 1.001
   7 :revdate: 2010-05-28
   8 :toc:
   9 :data-uri:
  10 :icons:
  11 :numbered:
  12
  13 ''''
  14
  15 .Acknowledgment
  16 ****
  17 This article would not have been possible without the
  18 substantial contributions from Mark Crispin.
  19 &mdash; Matthias Andree, editor
  20 ****
  21
  22 .Abstract
  23 ****
  24 IMAP4rev1 is a widely used Internet Standards Track Protocol for remote
  25 email access. Its adoption to international environments posed
  26 interpretation problems as the construction and interpretation of
  27 mailbox names, it particularly raised the question if there was
  28 contractictory information within IMAP4rev1.
  29
  30 This article describes the problem, and shows that IMAP4rev1 is
  31 consistent with respect to mailbox names. We document how the evolution
  32 of Unicode character sets and transformation formats made the
  33 interpretation of the IMAP4rev1 standard difficult, and how it is to
  34 interpret properly.
  35
  36 Finally, we show that UTF-7, which is used in IMAP4rev1 to encode
  37 mailbox names, does not impose artificial restrictions on the Unicode
  38 character set.
  39 ****
  40
  41 == IMAP Mailbox Names in RFC-3501
  42
  43 In May 2010, some confusion arose on the getmail mailing list around a bug
  44 report to Debian that complained getmail4 wouldn't allow non-ASCII characters
  45 in an IMAP folder name http://bugs.debian.org/513116[Debian Bug#513116], and
  46 the interpretation of support of international mailbox names
  47 vs. http://tools.ietf.org/html/rfc3501[RFC-3501]. It seemed at first
  48 glance that IMAP4rev1 were limited to the Basic Multilingual Plane of
  49 Unicode.
  50
  51 === Problem statement
  52
  53 Notably, RFC-3501 mandates that mailbox names are 7-bit, however clients are
  54 supposed to accept 8-bit data and interpret it as UTF-8.  This is apparently
  55 contradictory or extraneous, because 7-bit ASCII data need not be encoded.
  56
  57 Let us look at the IMAP4rev1 standard:
  58
  59 [quote, Mark Crispin, RFC3501]
  60 ____
  61 5.1.    Mailbox Naming
  62
  63 Mailbox names are 7-bit.  Client implementations MUST NOT attempt to
  64 create 8-bit mailbox names, and SHOULD interpret any 8-bit mailbox names
  65 returned by LIST or LSUB as UTF-8.  Server implementations SHOULD
  66 prohibit the creation of 8-bit mailbox names, and SHOULD NOT return
  67 8-bit mailbox names in LIST or LSUB.  See section 5.1.3 for more
  68 information on how to represent non-ASCII mailbox names. [...]
  69 ____
  70
  71 [quote, Mark Crispin, RFC3501]
  72 ____
  73 5.1.3.  Mailbox International Naming Convention
  74
  75 By convention, international mailbox names in IMAP4rev1 are specified
  76 using a modified version of the UTF-7 encoding described in [UTF-7].
  77 Modified UTF-7 may also be usable in servers that implement an earlier
  78 version of this protocol. [...]
  79 ____
  80
  81 This appears to be contradictory, because UTF-7 is not UTF-8. However, a UTF-7
  82 mailbox name is not an 8-bit mailbox name, hence the clause "interpret any
  83 8-bit mailbox names ... as UTF-8" does not apply. Mark writes:
  84
  85 === Clarification
  86 _by Mark Crispin_
  87
  88 8-bit octets are prohibited in mailbox names.  Clients MUST use 7-bit
  89 names, and servers MUST reject CREATE commands that contain 8-bit
  90 octets.
  91
  92 However, clients MUST also interpret any 8-bit names in a list of
  93 mailbox names (from LIST or LSUB) as UTF-8.
  94
  95 To understand the history here, we must go back to the 1990s where
  96 people (in spite of being told not to do so) were writing IMAP2 clients
  97 and servers which used ISO-8859-1 and Shift-JIS mailbox names.  At that
  98 time, it was by no means certain that UTF-8 would become the standard
  99 Internet character set; I played an important role in making that
 100 happen, but that was still a few years in the future.
 101
 102 The adoption of UTF-8 offered a chance to exterminate non-UTF-8 8-bit
 103 mailbox names, and in 1996 the current rules were adopted.  The
 104 transition to IMAP4 (which required substantial changes to any IMAP2
 105 servers) provided an opportunity to exterminate these non-interoperable
 106 names once and for all.
 107
 108 The modified UTF-7 was a temporary expedient to allow non-ASCII mailbox
 109 names while remaining with the 7-bit framework.  Had punycode existed at
 110 the time, it would have been a much better choice than UTF-7.  But
 111 punycode did not exist for several years later with IDN.  In fact,
 112 punycode was created because people learned the problems of UTF-7 from
 113 IMAP.
 114
 115 The intent was always to move to a UTF-8 only environment and leave
 116 behind UTF-7.  When that happens, clients will start encountering UTF-8
 117 names.  It is therefore necessary to tell clients that, even though they
 118 are not permitted to send them, they need to be written to handle them
 119 so they work properly when the restriction is relaxed in the future.
 120
 121 === Recommendations
 122 _by Mark Crispin_
 123
 124 *Options for server implementors*
 125
 126 From the perspective of a server implementor, you have one of two choices
 127 of how to implement MUTF-7:
 128 footnote:[editor's note: Modified UTF-7 as specified by the ensemble of RFC-2152 and RFC-3501]
 129
 130 [horizontal]
 131 [S1]:: Ignore it; just forbid 8-bit octets in the CREATE command.
 132 [S2]:: Convert mailbox names in commands from MUTF-7 to UTF-8.  When doing a
 133 LIST or LSUB, convert mailbox names from UTF-8 to MUTF-7 before sending
 134 them to the client.
 135
 136 Servers of type [S1] were far more common in the 1990s.  [S2] is more
 137 common today.  However, a client neither knows, nor cares, which type of
 138 server it is because the rules make both servers interoperate the same.
 139
 140 *Options for client implementors*
 141
 142 [horizontal]
 143 [C1]:: Ignore it; you're an ASCII client.
 144 [C2]:: Convert mailbox names from UTF-8 to MUTF-7 when sending a command.
 145 When receiving a listing of mailboxes, convert MUTF-7 to UTF-8.
 146
 147 This all works, and works well.  The routines to do the conversions are
 148 quite straightforward.  The only thing that you can't do well are mixed
 149 wildcards with strings with non-ASCII names; and that is primarily a
 150 curiousity since no clients do that with ASCII names.
 151
 152 == Unicode, UCS-2, UTF-16, and UTF-7
 153
 154 .Incomplete specification:
 155 WARNING: This section and its subsections are not normative references,
 156          and are insufficient to implement UCS-2, UTF-16 or UTF-7 based
 157          software.
 158
 159 === UCS-2 and UTF-16
 160 _by Mark Crispin_
 161
 162 RFC-3501 uses http://tools.ietf.org/html/rfc2152[RFC-2152] by reference.
 163 Some of the confusion on the getmail list arose from the fact that
 164 RFC-2152 talks about UCS-2 representation, which is limited to the Basic
 165 Multilingual Plane (BMP) range U+0000 to U+FFFF.
 166
 167 However, RFC-2152 also (page 5) refers to the handling of surrogate
 168 pairs, which are defined in UTF-16 but not UCS-2.
 169
 170 The correct interpretation is that the wording in RFC-2152 was written
 171 at a time when "UCS-2" was interpreted as a synonym for "16-bit value"
 172 as opposed to "BMP-only codepoints".  This happens frequently in older
 173 standards.  Since UTF-7 is deprecated, nobody has done the work to
 174 update RFC-2152 to clarify this point.
 175
 176 Using surrogate pairs extends the capability of 16-bit words beyond the
 177 BMP range.
 178
 179 The 0x0000 to 0xFFFF range comprises so-called surrogates, two character
 180 ranges (0xD800 to 0xDBFF and 0xDC00 to 0xDFFF) of 1024 characters (2^10^)
 181 each. These ranges are technically removed from the BMP (thus there is
 182 no such thing as U+D800); and hence the BMP only contains 64,512
 183 possible codepoints.
 184
 185 Both UTF-7 and UTF-16 transformation leverages these ranges to map
 186 Unicode code points in the range from U+010000 to U+10FFFF (which is the
 187 highest Unicode code point) to a pair of UCS-2 characters in the
 188 surrogates ranges.
 189
 190 This happens by first subtracting 0x10000, which maps the input into the
 191 range 0x0 to 0xFFFFF, representable in 20 bits. The most significant
 192 10-bit portion is mapped into the range 0xD800…0xDBFF, the least
 193 significant 10-bit portion into the range 0xDC00…0xDFFF, and these two
 194 16-bit values are used in this order.  UTF-7 does a further step of
 195 encoding in modified BASE64.
 196
 197 Thus, UTF-7 and UTF-16 both deal with ``16-bit values'' and use the same
 198 surrogate pair mechanism to access non-BMP codepoints.  Although not
 199 strictly accurate (the two are technically independent encodings of
 200 Unicode), it may be helpful to think of UTF-7 as a further encoding of
 201 UTF-16.
 202
 203 === UTF-7
 204
 205 UTF-7 is a 7-bit representation of Unicode that makes use of character set
 206 shifting. A character that is directly representable represents itself. Other
 207 characters are subjected to a modified BASE64-encoding (that omits the padding
 208 "=" characters at the end of a group) which is preceded by a "+" character
 209 and trailed by a "-" character, which is discarded, or any other character
 210 not in the modified BASE64 set, which remains in the stream.
 211
 212 As a special case, the sequence "\+-" is a shorthand to represent
 213 the "+" character itself.
 214
 215 The modified BASE64 character set uses the characters A-Z, a-z, digits 0-9,
 216 and the characters "+" and "/", omitting "=" to avoid collisions with
 217 RFC-2047 encoding.
 218
 219 === Modified UTF-7
 220
 221 This works similar to UTF-7, but mandates that printable ASCII characters
 222 0x20...0x7E except 0x26 (the ampersand "&") represent themselves, and uses yet
 223 another BASE64 alphabet consisting of the upper- and lowercase letters, the
 224 digits, and the characters "+" and ",", with some further rules specified in
 225 RFC-3501. The leading shift character is replaced by the ampersand "&",
 226 the trailing remains "-", and the "&" can be encoded as "&-".
 227
 228 == Conclusions
 229
 230 IMAP Clients that want to support international mailbox names should send UTF-7,
 231 and be prepared to handle UTF-7 (if no 8-bit data is found) and UTF-8 (if
 232 8-bit data is found).
 233
 234 Modified UTF-7 as per the IMAP RFC #3501 is not limited to the Unicode Basic
 235 Multilingual Plane, but maps the entire Unicode range.