article

URL normalization (or URL canonicalization) is the process by which URLs are modified and standardized in a consistent manner. The goal of the normalization process is to transform a URL into a normalized or canonical URL so it is possible to determine if two syntactically different URLs are equivalent.

Search engines employ URL normalization in order to assign importance to web pages and to reduce indexing of duplicate pages. Web crawlers perform URL normalization in order to avoid crawling the same resource more than once. Web browsers may perform normalization to determine if a link has been visited or to determine if a page has been cached.

Normalization process


There are several type of normalization that may be performed:

  • Converting the scheme and host to lower case – The scheme and host portion of the URL is case insensitive, and therefore most normalizers will convert them to lowercase. Example:
HTTP://www.FooBar.com/ → http://www.foobar.com/
  • Converting the entire URL to lower case – Some web servers that run on top of case-insensitive file systems allow URLs to be case insensitive. Therefore all URLs from a case-insensitive web server may be converted to lowercase to avoid ambiguity. Example:
http://foo.org/BAR.htmlhttp://foo.org/bar.html
  • Capitalizing hexadecimal digits – All hexadecimal digits within a percent-encoding triplet (e.g., "%3a") are case-insensitive, and therefore the digits A-F should be capitalized. Example:
http://foo.org/?mode=%3a%b1+abchttp://foo.org/?mode=%3A%B1+abc
  • Removing the fragment – The fragment portion of a URL is usually removed because a URL with and without the fragment represent the same resource. Example:
http://foo.org/bar.html#section1http://foo.org/bar.html
  • Removing port 80 – The default port (80) may be removed from (or added to) a URL. Example:
http://foo.org:80/bar.htmlhttp://foo.org/bar.html
  • Removing ".." and "." segments – The ".." and "." segments are usually removed from a URL. Many normalizers use the algorithm described in RFC 3986 (or a similar algorithm) to remove the segments. Example:
http://foo.org/../a/b/../c/./d.htmlhttp://foo.org/a/c/d.html
  • Add terminating slash – A terminating slash may be added at the end of a URL that points to a directory. Most web servers will redirect HTTP requests that are missing a terminating slash to a URL with the terminating slash. Example:
http://foo.orghttp://foo.org/ http://foo.org/dirhttp://foo.org/dir/
  • Removing "www" prefix – Some websites allow access to them through using an optional "www" prefix. For example, http://foo.org/ and http://www.foo.org/ may access the same website. Although many websites will redirect the user to the non-www prefix version (or vice versa), some do not. A normalizer may perform extra processing to determine if there is a non-www prefix version and then normalize all URLs to the non-www prefix. Example:
http://www.foo.org/http://foo.org/

References


  • RFC 3986 - Uniform Resource Identifier (URI): Generic Syntax

See also


World Wide Web

 

This article is licensed under the GNU Free Documentation License. It uses material from the "URL normalization".

Home Pageartsbusinesscomputersgameshealthhospitalshomekids & teensnewsphysiciansrecreationreferenceregionalscienceshoppingsocietysportsworld