168

I'm using a "fun" HTML special-character (✰)(see http://html5boilerplate.com/ for more info) for a Server HTTP-header and am wondering if it is "allowed" per spec.

  • Using the Network Tab in the dev tools in Chrome on Windows Xp Pro SP 3 I see the ✰ just fine.

  • In IE8 the ✰ is not rendered correctly.

  • The w3.org HTML validator does not render it correctly (displays "â°" instead).

Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)


Is the disparity caused by bugs in the different parsers/browses/engines/(whatever-they-are-called)?

Is there a spec for this or maybe a list of allowed characters for an HTTP-header "value"?

3
  • 41
    This question would be much better asked generally: "Which characters are allowed in an http header value" Commented Jan 20, 2012 at 23:22
  • 2
    related: What encoding should I use for HTTP Basic Authentication? Commented Nov 12, 2014 at 7:41
  • 2
    "Now, I'm not too keen on character encodings ... and frankly I don't really care too much about them; I just blindly use UTF-8 cus I'm told to. :-)" <----- Obligatory link to joelonsoftware.com/2003/10/08/… Commented May 2, 2018 at 15:36

2 Answers 2

159

In short: Only ASCII is guaranteed to work. Some non-ASCII bytes are allowed for backwards compatibility, but are not supposed to be displayable.

HTTPbis gave up and specified that in the headers there is no useful encoding besides ASCII:

Historically, HTTP has allowed field content with text in the ISO-8859-1 charset [ISO-8859-1], supporting other charsets only through use of [RFC2047] encoding. In practice, most HTTP header field values use only a subset of the US-ASCII charset [USASCII]. Newly defined header fields SHOULD limit their field values to US-ASCII octets. A recipient SHOULD treat other octets in field content (obs-text) as opaque data.


Previously, RFC 2616 from 1999 defined this:

Words of *TEXT MAY contain characters from character sets other than ISO- 8859-1 [22] only when encoded according to the rules of RFC 2047 [14].

and RFC 2047 is the MIME encoding, so it'd be:

=?UTF-8?Q?=E2=9C=B0?=

but I don't think that many (if any) clients support it.

Sign up to request clarification or add additional context in comments.

4 Comments

so, what does that mean? Is "✰" valid/allowed?
To expand a bit on a very useful answer: "UTF-8" is the character set, and "Q" means the value will be "quoted-printable". "B" could also be used if you wanted to BASE64-encode the value.
@porneL, So what does "opaque data" mean? What exactly should the HTTP recipient do when it receives these "opaque data"?
@Pacerier "opaque data" means it's a black box with a bunch of bytes that applications shouldn't try to display or interpret (like binary data). What happens with it depends on the header, might range from "nothing" to "discard".
11

Please read comments first, this answer likely draws wrong conclusions from the right sources, needs edit.


You can use any printable ASCII chars, and no special chars like ✰ (Which is not ASCII)

Tip: you can encode anything in JSON.

Edit: may not be obvious at first, the character encoding defined in the header only applies for the response body, not for the header itself. (As it would cause a chicken-&-egg problem.)


I'd like to sum up all the relevant definitions as per the spec linked by Penchant.

message-header = field-name ":" [ field-value ]
field-name     = token
field-value    = *( field-content | LWS )

So, we are after field-value.

LWS            = [CRLF] 1*( SP | HT )
CRLF           = CR LF
CR             = <US-ASCII CR, carriage return (13)>
LF             = <US-ASCII LF, linefeed (10)>
SP             = <US-ASCII SP, space (32)>
HT             = <US-ASCII HT, horizontal-tab (9)>

LWS stands for Linear White Space. Essentially, LWS is Space or Tab, but you can break your field-value into multiple lines by starting a new line before a Space or Tab.

Let's simplify it to this:

field-value    = <any field-content or Space or Tab>

Now we are after field-content.

field-content  = <the OCTETs making up the field-value
                 and consisting of either *TEXT or combinations
                 of token, separators, and quoted-string>
OCTET          = <any 8-bit sequence of data>
TEXT           = <any OCTET except CTLs,
                 but including LWS>
CTL            = <any US-ASCII control character
                 (octets 0 - 31) and DEL (127)>
token          = 1*<any CHAR except CTLs or separators>
separators     = "(" | ")" | "<" | ">" | "@"
                 | "," | ";" | ":" | "\" | <">
                 | "/" | "[" | "]" | "?" | "="
                 | "{" | "}" | SP | HT

TEXT is the most general and includes all the rest -so forget about the rest-. Here is the US-ASCII charset (= ASCII)

As you can see, all printable ASCII chars are allowed.

5 Comments

You are contradicting the passages you quoted. Why do you say "and no special chars like ✰"? Special characters are just OCTETs, and Since TEXT is any OCTET except 0 - 31, this means that all the OCTETs from 32 to 255 are allowed. The octets of ✰ are 226, 156, and 176 and all three of them are allowed, therefore ✰ is allowed according to the passages you quoted.
@Pacerier you seem completely right, I don't see why I drew the conclusion I did.
@Pacerier yet I'm not ready to edit it as I needed to double check the spec again. I'm afraid additional details are restricting to the US-ASCII charset which in turn would support the conclusion yet render the reasoning insufficient.
Saying "you can encode anything in JSON" is a bit misleading. JSON allows for Unicode characters, whereas, HTTP headers should be US-ASCII. Unicode characters would be treated as "opaque" data and thus the behavior is undefined by the HTTP specification. That being said, JSON can be made safe for inclusion in a HTTP header by escaping the Unicode characters via the \uXXXX escape sequence.
@zupa, Another issue is... what does "except CTLs" mean? Does it mean the characters CR, LF are allowed? Or does it mean only the continuous sequence "CR LF SP/HT" is allowed? (In other words, can header values contain a single CR or LF or HT? Can header values contain the characters CR, LF, and HT in any order and amount?)

Your Answer

By clicking “Post Your Answer”, you agree to our terms of service and acknowledge you have read our privacy policy.

Start asking to get answers

Find the answer to your question by asking.

Ask question

Explore related questions

See similar questions with these tags.