URL

From Just Solve the File Format Problem
Revision as of 12:31, 3 December 2013 by AndyJackson (Talk | contribs)

Jump to: navigation, search
File Format
Name URL
Ontology
Released 1990

A URL (Uniform Resource Locator) is an address of a resource as used on the World Wide Web. Technically speaking, a URL is just one category of such addresses, a subset of URI (Uniform Resource Identifier) and parallel to URN (Uniform Resource Name), but such distinctions aren't always consistently maintained even by technical people, and URL has entered the popular language in a way those other terms have not.

Over time, the precise definitions of the various terms for Web-related addresses have changed and been argued extensively about by technical people, and some more have been added: an IRI (Internationalized Resource Identifier) is like a URI, but extended to allow non-ASCII characters so that languages other than English can be supported. However, the newest HTML 5 standards drafts choose to take a more pragmatic approach of just using "URL" to refer to anything that a browser is expected to resolve as an address, as one of many "willful violations" of earlier tech specs they did there. (The "techie" equivalent of social conservatives may consider this to be "defining deviancy down" and hence an "abomination".)

Use of URLs (and URIs, etc.) is not limited to the Web, as there are a number of other technical usages such as in defining namespaces for file formats (e.g., XML), and in identifying even non-Web-accessible objects for the purpose of expressing taxonomic relations. In less-technical usage, URLs turn up in all sorts of places like TV commercials, billboards, and on the side of vans, but usually with the protocol portion left off because everybody assumes HTTP. These days most browsers don't even show the "http://" part in the address bar, though it's still officially part of the URL.

Contents

Types of identifiers

  • URI: The official "parent term" for URLs, URNs, and other such identifiers, but limited to ASCII characters, with anything else needing to be specially encoded. Even within the ASCII range, some characters such as the space are prohibited, reserved, or designated to be used only for specific syntactic purposes, with encoding necessary for all other uses.
  • IRI: The internationalized version of URIs, with more liberal rules about what characters in the entire Unicode range may be included. This allows text in non-English languages to be included without messy encoding, though various transfer protocols may still require the entire string to be encoded on transmission to produce an ASCII-based URI.
  • URL: Technically only the subset of URIs that are "locators", able to be used to retrieve resources because they designate a specific address for them, but in practice the distinction is very fuzzy and usually ignored. Some newer standards such as HTML 5.0 simply follow common non-techie usage and use URL to refer to the whole universe of Web-style addresses (encompassing URIs and IRIs, and anything else a browser can accept as an address even if it fails to comply with any of the standards).
  • URN: Uniform Resource Name. Another type of URI which is supposed to provide a stable permanent identifier for a resource which does not include a specific (and changeable) address for it. To resolve a URN, one needs a resolver such as a server or website that stores a table of current locations of items with URNs. Currently the standards call for all URNs to begin with the 'urn:' scheme identifier, and the next item after this is a URN namespace, followed by another colon and the namespace-specific information. Some common naming schemes have been adopted as URNs, such as ISBNs (International Standard Book Number), which have the format "urn:isbn:1-234567-890". Unfortunately, browsers haven't been quick to implement URN resolvers as standard features, though add-ons can be installed to do it.

Standard syntax

URLs/URIs/etc. always start with a scheme (protocol). (At least, absolute URLs do; there are also relative URLs that leave off parts at the beginning because they are construed as being relative to the current URL they are accessed from.) The most common is HTTP. The scheme part ends with a colon (:).

After this, the rest of the URL is protocol-dependent; there are a number of different syntaxes used in different types of URLs. A common syntax, expected by the standards to be used in all schemes with hierarchical path structures, follows the scheme part with a double slash (//) which introduces a host or authority portion (usually a domain name), which is then followed by another slash and then the full path being addressed, which uses forward slashes to separate hierarchical levels (which may, but needn't, correspond to subdirectories in a filesystem).

There's a common misconception that URLs always have a double slash after the colon, sometimes causing developers of new schemes to put this in their syntax where the standards don't call for it; it is only supposed to be used if the following element is some sort of "authority" by which a following path is to be interpreted. There are a number of schemes with no such authority, and hence no double slash; for insstance "mailto:".

data: URLs

One scheme, data:, is actually a file format in its own right, since it encodes the entire contents of a file within the URL instead of referencing an external resource as other schemes do.

Official documents

Official sites

Other links

Personal tools
Namespaces

Variants
Actions
Navigation
Toolbox