The following will convert Unicode text to HTML numeric character references, using both
single codepoint values and surrogate pairs.
You may copy and paste text from any source, even UTF-8 or UTF-16 encoded, and even if
you do not have the actual font.
from Unicode surrogate pairs, and some tests of displaying supplementary plane characters.
For more explanation of surrogate pairs and supplementary plane characters, see
–Tex Texin's site (I18n Guy)
–David Perry's site
–James Kass's site, home of the Code2001 font
–another useful font is
Alphabetum. This and Code2001 are used for
supplemental character display on this page.
The algorithm for converting to and from surrogate pairs is not widely published
on the internet. The official source is The Unicode Standard 3.0
(not later versions), Section 3.7, Surrogates.
Conversion of a Unicode scalar value S to a surrogate pair <H, L>;
H = (S - 1000016) / 40016 + D80016
L = (S - 1000016) % 40016 + DC0016
where the operator "/" is defined in Section 0.2, Notational Conventions,
as "integer division (rounded down),"
and "%" as "modulo operation; equivalent to the integer remainder for positive numbers."
The conversion of a surrogate pair <H, L> to a scalar value:
N = (H - D80016) * 40016 + (L - DC0016) + 1000016
Here is some miscellaneous test information.
Using Windows XP, I am
only able to display plane 1 characters on Internet Eplorer 6 when:
1. they are written as numeric code references,
2. the character encoding for the page is set to "User Defined," and
3. the page is saved as an ANSI document, not any type of Unicode.
I am sometimes able to change the encoding of online documents displayed in the browser;
but for documents that reside on my computer, if they are saved as anything other
than ANSI, I am unable to view plane 1 characters in IE6 at all.
I am able to view UTF-8 and UTF-16 encoded characters on Mozilla Firefox 1.5,
Netscape 8, Opera 9, and K-Meleon 1.02 after making the following registry
the MSDN library and
Tex Texin's site for details):
The following change is also recommended (use the font of your choice),
but seems to have no effect on my system. The keys in this group store the settings
made via Tools-Internet Options-Fonts. Key number 40 corresponds to
"User Defined"; adding key number 42 makes no change that I can identify.
Using Windows XP Professional, Internet Explorer 7 displays all the character encodings above without any registry changes.
When creating or saving documents with Notepad, four encoding choices are available, but the
nomenclature is not specific; in Windows applications, "Unicode" means "UTF-16
• ANSI (plain text)
• Unicode (= UTF-16 little endian, the native Windows XP encoding)
• Unicode big endian (= UTF-16 big endian)
This page was saved as an ANSI document, and the charset declaration is
<meta http-equiv="Content-Type" content="text/html; charset=x-user-defined">
You should be able to view the following if you have ALPHABETUM Unicode or Code2001 on your computer. You can try changing the character encoding to "User Defined" (under the "View" menu) if it is not already. You can set a default "User Defined" font
via Tools-Internet Options-Fonts, or Tools-Options-Content or General-Fonts & Colors-Advanced.
The following test character is the one at codepoint 10381, Ugaritic letter beta. It displays properly when the character is written with numeric code references,
either as the original codepoint OR as a surrogate pair, using either hexadecimal
String.fromCharCode(0xD800) + String.fromCharCode(0xDF81);
String.fromCharCode(55296) + String.fromCharCode(57217);
It does not display when scripted using the original codepoint:
when the HTML is written as a single character.
The following form field contains the single HTML reference 𐎁
document.theForm.theField.value.charCodeAt(0) + " + " +
document.theForm.theField.value.charCodeAt(0).toString(16).toUpperCase() + " + " +