Website Page Code Validation Guide - 2
Part 2 - Validation How-toIn Part 1 - Code Validation Background we looked at the reasoning and the theory. In Part 2 we'll look at the code used, and fixing some errors. Please take careful note of the fact
we advise real-world, practical solutions that actually work. In some
cases the theoretical or 'ideal' answer might be different to the
solution we provide - but ours works, for most people, most of the time.Choosing a Doctype
Here is a list of the four most common Doctypes used. It is unlikely
you would ever need anything else as these apply to 99% of websites. As
stated before, you should pick one to fit your code. If in doubt, pick
HTML Strict - but there could be a better choice - you must analyse
your code carefully. The output from 'tricky' web editors (like
FrontPage) may require the use of HTML Transitional.
1. HTML Strict <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
2. HTML Transitional <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">
3. xHTML Strict <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
4. xHTML Transitional <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
The W3C websiteThe
W3C is the place to go for resources in this area. In the past the site
was notorious for its user-unfriendliness, and it certainly qualified
as one of the top ten worst websites in the world for usability.
However, changes are being made and improvements are being seen all the
time.
The problem is that it is geek central - the world's
most geeky geeks run it, and the fact is that many of them seem to live
on another planet. Even other geeks cannot understand what they are
saying - never mind ordinary people - and their communication skills
and appreciation of usability issues are somewhere on a par with that
of a coelacanth.
However, progress is being made, and it is now
actually possible to derive real-world useful help from their site.
Here are two page links for more information on the subjects we cover
here:
http://validator.w3.org/docs/ http://www.w3.org/QA/2002/04/valid-dtd-list.html
CharsetThe Charset or character set is the language script interpretation the document uses.
The
W3C are pressing very hard for all web pages, websites, and servers
(which is a tricky area) to use UTF-8. This is a fine idea but doesn't
work in practice as there is too little all-round support. At 2008 it
is not something we can universally advise as being practical. It's
well worth trying on your own site - but you must test ruthlessly as
some browser / platform combinations (ie operating systems) will
introduce lots of meaningless junk characters. If you see gibberish
characters on the page, the first thing to check is if the page uses a
UTF-8 Charset. You can often fix the problem by reverting to an ISO
Charset - which normally works just fine.
UTF-8 does not work
for extended punctuation or symbols, on many browser / platform
combinations. Cynics would say it doesn't work in any situation for
these characters. Where you see gibberish on the page, 9 times out of
10 this will be a UTF-8 fault. Of course, UTF-8 is not faulty at all;
but there are many implementation issues to be resolved before it can
be used without multiple errors.
The server also has a hand here
as the page serve mode can be set at server level, though it is
overridden at local level. However if there is a conflict then pages
can crash, which is why hosts tend to leave their servers set to ISO
not UTF.
The advantage of UTF-8 is seen as the ability to cover all types of
languages on all platforms. This is fine but it's not supported yet.
There are 20 billion web pages out there using an ISO or Windows
Charset, so progress may be slow.
Therefore
in most circumstances an ISO Charset will be the right choice for
European- based languages. However, Windows applications output a
Windows Charset so this may have to be used. There is no point in
changing it; though if gibberish characters are seen on some browser /
platform combinations, an ISO Charset could be tried.
Here are the three main Charsets used:
iso-8859-1 windows-1252 utf-8
The HTML statement for their use is as follows:
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
<meta http-equiv="content-type" content="text/html; charset=windows-1252">
This
next line is from xHTML as you can see from the final tag closure. To
use it in an HTML doc, just remove the final slash and the gap before
the greater-than symbol at the end:
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
W3C example meta headerHere is the example the W3C give for an xHTML page header. My annotations are the ## hash symbols.
You
will see at the first ## that there is an XML prolog. You do not use
this unless you wish to put (probably all versions of) Internet
Explorer into quirks mode. There is also a possibility it may affect
other browsers. Why they have included it here is unclear.
The second ## marks the language statement, which may or may not be placed here. Other layouts do not use this placement.
Remove the ## symbols if you use this example.
------------------------------- ##<?xml version="1.0" encoding="utf-8"?> <!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd"> ##<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en"> <head> <title>An XHTML 1.0 Strict standard template</title> <meta http-equiv="content-type" content="text/html;charset=utf-8" /> <meta http-equiv="content-style-type" content="text/css" /> </head>
<body> <p>… Your HTML content here …</p>
</body> </html> ---------------------------------
Here
is an example you can use for a full HTML header including most of the
basic metadata. You should check each line and ensure the details apply
to your site. For example the CSS script is given as 'main.css', which,
here, is in the webroot. If it was in a folder 'css', perhaps with
other CSS scripts, it would be written as 'css/main.css'. The hash
symbols are replaced with your choice of meta title, meta desc., and
meta keywords. The best advice I can give here is: be brief. This has
the Transitional Doctype in order to make your life easier; once you
have achieved some success I recommend you try the Strict version. Just
change the Doctype here.
--------------------------------- <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd"> <html lang="en"> <head> <meta http-equiv="content-type" content="text/html; charset=iso-8859-1"> <title>################</title> <meta name="description" content="#######################."> <meta name="keywords" content="#######################"> <!-- <base href="http://www.##################.com" > --> <meta name="generator" content="NoteTabPro"> <meta name="author" content="admin"> <link rel="StyleSheet" href="main.css" type="text/css"> <link rel="shortcut icon" href="favicon.ico"> <meta name="robots" content="index,follow"> </head> ---------------------------------
Technical note: base href issuesIn
the header above you will see a tag, base href. This tells browsers
(and search engines) that the base or root of the site is at the domain
name you specify there. This is done for several reasons although the
main reason for its introduction was the pagejacking problem, whereby
sites with loose technical management were easy targets for content
stealers. However it is a good way to tighten up the site code
management. It is not strictly necessary and can be left out. Modern
code editors include it but no older ones do as it is a relatively
modern inclusion.
There are two important things you should know
about this tag, otherwise it will either cause inexplicable,
infuriating problems - or it just won't work:
1. In dev it is commented out - when working on the site you switch it off. 2. On a livesite it is uncommented.
If
it is uncommented and working, while the site is being built, you will
often find that images or scripts won't load. That's because the code
is saying, "Everything on this page comes from the server at......".
But while it's on your LAN that won't be true, and there will be page malfunctions - so comment it out.
However, you have to uncomment it on the livesite or it has no effect. It's invisible and does not operate. Duh!
Fixing validation errorsOK, so now for the difficult bit... It would be no use attempting this without getting all the previous items correct, but we've covered that well enough. Let's
assume you're validating an HTML Transitional page. It's about the
easiest choice. It is definitely a cop-out and a good developer would
never use HTML Transitional - you are not paying them for quirks mode
shortcuts, after all. However, for the average user, it may be
necessary. We've uploaded a test page to this site that you can
play with. You can validate it here or, better still, take it away and
work with it on your computer. Upload it to your own site so that you
can change things and test it. You'll need the image that comes with
it, the nice boat picture, so make sure to save that when you save the
page. Also, you will need to FTP that image up to your site as well.
Alternatively, you can validate a page by uploading the file to the W3C
- they don't insist you run it from a website. Here is our test page for your use: validator test pageThere are about 8 faults reported on this page and we can use them to illustrate some general principals. These are: 1.
Start at the top and work down. Many faults are cascading - that is to
say, they are only there because of a previous fault. Remove the
earlier fault and they disappear. 2. You can safely remove some
proprietary code. Some faulty applications (most, in fact, a few years
ago) insert their own weird statements or pseudo-commands which have no
known relevance to any modern browser's requirements. They might
originally have been in there for IE3 or Netscape 1 etc. Remove them
and see what the result is. Your ideal method here would be to look up
an HTML tutorial to see what the code should look like in reality, then
adjust it. Removing some proprietary code can cause a page to
crash, though. Even though it has no relevance now, browsers know to
apply quirks mode and build the page as an obsolete-coded page for
Internet Explorer 5, for example, when they see this code. What you
must do is to transfer the functionality to CSS. This applies to code
tags seen such as <microsoft_border> that have been obsolete for
years and were never part of HTML in any case. Text editor for validationYou need a decent text
editor to fix these faults. Ideally it will have line numbers and
colour highlighting. If you are trying to do this job with something
like Notepad, you are making life impossible for yourself. There are
plenty of good free or cheap editors out there that are efficient for
this work, such as Crimson Editor or NoteTabPro. A good text HTML
editor will also give you the correct code to insert.
Head and body of page
An
HTML page comprises two parts, the head and the body. The head does
not print, ie it does not show on the page. Anything in the head is
invisible. It starts with <head> and ends with </head>. The
body starts immediately after, with <body>, and ends of course
with </body>. HTML opening and closing tags are above and below
this. So an outline example is:
<!DOCTYPE> <html> <head> - - -your header tags here- - - </head> <body> - - -your page content here- - - </body> </html>
Text case in validationIn
theory, only xHTML needs to be all lower case. In practice it won't
hurt if all your code is lower case. It's good practice now. Best to
change all code to lower case if you are changing elements of it. Of
course, this would be a major job by hand - if not impossible - but
it's one click to do the entire page in a real text editor. Just blue
all the text out (highlight it with Cntrl+A) >> find your text
controls in the menu >> then, Case >> then hit Lower Case.
Sorted.
Oh, leave the Doctype as upper case. Sorry.
OK - let's look at this dud page. Go to the W3C online validator and put in that test page address,
www.a3webtech.com/bad-page.html
...into the validator at:
http://validator.w3.org
Validation errors and how to fix themThe
first error is reported as being on line 16. Well, this is a good
start, at least we have 15 lines with no errors then... Ignore the
Column number unless you have a text editor that can identify columns.
The number does give you a good idea where the fault is, though, in a
long line.
1. Error Line 16, Column 13: there is no attribute "SCROLL".
<body SCROLL="auto" BGCOLOR="#FFFAFA" TOPMARGIN=0 LEFTMARGIN=0>
We
are looking here at the very first line of script on the page true -
the previous code is in the header. The Validator is telling us it
doesn't recognise the word 'scroll' since it does not exist in HTML.
Well, if it doesn't exist, it won't hurt to lose it then - chop it out.
Remove SCROLL="auto".
2. Error Line 16, Column 48: there is no attribute "TOPMARGIN".
<body SCROLL="auto" BGCOLOR="#FFFAFA" TOPMARGIN=0 LEFTMARGIN=0>
Again, the 'attribute' (a fancy term for a code tag) TOPMARGIN apparently does not exist in HTML so it is superfluous. Lose it.
3. Error Line 16, Column 61: there is no attribute "LEFTMARGIN".
… SCROLL="auto" BGCOLOR="#FFFAFA" TOPMARGIN=0 LEFTMARGIN=0>
And again - more of the same. LEFTMARGIN doesn't exist in HTML. Ditch it.
OK,
you can probably see what is happening here. I've used an old HTML
editor that inserts long-obsolete commands for Netscape 1 or something.
All these old tags can go, they are not part of HTML any more (if they
ever were). The final result when that line is cleaned up is as follows:
<body bgcolor="#fffafa">
That's
all that is needed - a body start command and a background (bg) colour.
If the page is white it doesn't need the colour either. Occam's Razor.
In that case you'd simply have:
<body>
Note
that spellings in code are always in US English, not International
English. There are also several famous mispellings as well, where the
original geeks who set up the HTML spec couldn't spell for toffee and
got it wrong. For example 'seperator'; you have to spell it wrongly
like this, or the code doesn't work...
4. Line 18, Column 32: end tag for element "H3" which is not open.
<h2>A Slightly Bad Test Page</h3>
This
is hard to interpret for a newbie, but you should look carefully at the
line identified, which has a typo. A text heading level like this
determines the size of the headings on the page. This is a large
heading and is an H2. Or is it? It's closed with, oops, H3. It should
be </h2> because that's what the opening tag says.
This
error has created a bunch of problems that cascade down the page.
Because the tags didn't match, there is an open tag and a closed one
with no opener. Hah. A number of cascading errors have resulted, which
appear lower down. One simple typo has wrecked the page.
5. [Warning] Line 24, Column 33: cannot generate system identifier for general entity "also".
Lots of lovely text here anyway &also a nice image just below.
Here
is a common issue: the use of an ampersand (&) on a page. In fact
we shouldn't use these, for two reasons: it's bad style in English to
use this, you should just say 'and'; also that symbol is used in code
and has a specific meaning. If we use it we risk many problems. Here,
because of a typo it's close up against another word and the Validator
is trying to work out what code statement is being made. None of course
- so that's caused a bunch of errors, which, again, cascade down the
page. Whoops.
The Validator is totally confused and keeps
looking for an 'entity' that isn't there: a code object which doesn't
exist. The answer here is simply to correct the typo; but in reality we
would be better off not using this symbol at all as it is asking for
trouble. However you can 'escape' it in the raw code if you write the
correct code for an ampersand directly into the code: & amp ;
I
have had to insert two spaces here in order to stop it being read as
code - you need to remove the spaces so it all runs together. You would
use this if you cannot avoid using the symbol in text and it creates an
issue. This might happen if it were part of a firm's official name for
example, like Smith & Sons. Otherwise - don't use ampersands.
6. Line 24, Column 33: general entity "also" not defined and no default entity.
Lots of lovely text here anyway &also a nice image just below.
Again the ampersand problem, which has cascaded down to here.
7. Line 24, Column 37: reference to entity "also" for which no system identifier could be generated.
Lots of lovely text here anyway &also a nice image just below.
[info] Line 24, Column 32: entity was defined here.
Lots of lovely text here anyway &also a nice image just below.
Still the ampersand problem. Change it to 'and', as it should be.
8. Line 27, Column 69: required attribute "ALT" not specified.
…width="242" height="351" border="0" align="left">
Right,
we're on to something new at last. This is an image / graphics problem
and there will be plenty of those. The Validator tells us we are
missing an attribute (tag) called 'alt'. What this refers to is the alt
tag for an image. This is a slang description for the 'alt' or
'alternative text' attribute, which gives a text description for an
image. Every image needs this.
We need to add this tag into the line so that it reads:
<img src="boats.jpg" alt="" width="242" height="351" border="0" align="left">
That's
all you need, a blank, empty alt tag (there is nothing between the
quotation marks). However, for best results we should indeed use some
sort of description, so we could put this:
<img src="boats.jpg" alt="A nice picture of boats" width="242" height="351" border="0" align="left">
The
main purpose is to provide an alternative description of the content in
case images are turned off in the browser, or for sight-handicapped
people using screen reader browsers (a browser that converts page
content to a voice readout). This is an accessibility issue - you
should not be barring people with equipment or physical handicaps from
accessing your page content.
9. Error Line 29, Column 6: end tag for "H2" omitted, but its declaration does not permit this.
</body>
[Info] Line 18, Column 0: start tag was here.
<h2>A Slightly Bad Test Page</h3>
Well
- this is the last of the cascading errors caused (somehow) by an error
in that title tag. Fix the title H-level fault and this will disappear.
And that's it for a basic validation tute. It's possible to
go on for pages like this, but you should have got the basic principles
of it now - which are:
- Use a decent text editor to hunt down these problems and fix them
- Start at the top, not in the middle somewhere
- Have a good HTML code reference at hand so you can see what it should be if done correctly
- Remove
all proprietary tags because most are no longer necessary. Deleting
some, though, will require you to convert some missing functionality
into CSS scripting.
Did you find this page useful?
If so, please consider linking to it. Thank you. Please
use our forum if you need more advice. Post to any board, don't worry
about the title of the board - we'll move the post if necessary. Tell
us if you need more help, or if you can add to this tute. We are
certainly not experts here and can use any help offered. Validation is
impossibly complex at first, but gets easier.
And if you want to make
it really easy, get a clean modern CMS :-)
|