HTML character references

From Just Solve the File Format Problem
Jump to: navigation, search
File Format
Name HTML character references
Wikidata ID Q15932945

HTML character references are sometimes referred to as "entities", but pedants will tell you that this term should apply only to the named character references, not the numbered ones. This distinction is more significant in the parent format, SGML, which has a more elaborately-developed entity system where named entities can be used as macros in a complex way; in HTML both named and numbered references are simply ways of representing literal characters in HTML without them being treated as syntactic elements (e.g., the less-than and greater-than signs) or when they are not supported by the document's character encoding or not easy to see or type in the text editor you are using.

In addition to HTML and SGML, XML (and the multiplicity of formats derived from it) uses such references. The exact list of supported character references can vary between specific versions of formats.

HTML character references start with the ampersand (&) and normally end in the semicolon (;), though in some versions of formats it is acceptable to leave that out if the reference is followed by a character which unambiguously is not part of a reference; the exact syntax rules vary by version, and some implementations (e.g., browser versions) may not exactly follow the standards on what is and isn't acceptable; it's always safest to include the semicolon.

Numeric character references

A numeric reference follows the pattern &#nnnn;, where nnnn is a decimal number, or &#xhhhh;, where hhhh is a hexadecimal number. In both cases, the number refers to a Unicode character code position. (This wasn't always true in older versions and implementations; in the early days of HTML, Unicode was just barely getting started, and all sorts of other character encodings were in use; sometimes HTML numeric references were construed as being in one of those such as Windows 1252. Also, the hexadecimal references didn't work in really old browsers, so for a long time decimal references were safer.)

Named character references or entities

Named references are in the form &name;, with names taken from a list which varies by format and version. Commonly used ones include &lt; for the less-than sign (<), &gt; for the greater-than sign (>), &quot; for a double-quote ("), &apos; for an apostrophe/single-quote ('), and &amp; for an ampersand. A number of other named references exist, though if the file is being developed in a Unicode-capable editor to be served as UTF-8 or another Unicode encoding, you should be able to include anything else as literal characters with no escape sequence needed.


Personal tools