Home arrow Code Validation - 2
Website Page Code Validation Guide - 2



Website Page Code Validation Guide - 2

Part 2 - Validation How-to

In Part 1 - Code Validation Background  we looked at the reasoning and the theory. In Part 2 we'll look at the code used, and fixing some errors.

Please take careful note of the fact we advise real-world, practical solutions that actually work. In some cases the theoretical or 'ideal' answer might be different to the solution we provide - but ours works, for most people, most of the time.

Choosing a Doctype

Here is a list of the four most common Doctypes used. It is unlikely you would ever need anything else as these apply to 99% of websites. As stated before, you should pick one to fit your code. If in doubt, pick HTML Strict - but there could be a better choice - you must analyse your code carefully. The output from 'tricky' web editors (like FrontPage) may require the use of HTML Transitional.

1. HTML Strict
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

2. HTML Transitional
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN" "http://www.w3.org/TR/html4/loose.dtd">

3. xHTML Strict
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

4. xHTML Transitional
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">

The W3C website

The W3C is the place to go for resources in this area. In the past the site was notorious for its user-unfriendliness, and it certainly qualified as one of the top ten worst websites in the world for usability. However, changes are being made and improvements are being seen all the time.

The problem is that it is geek central - the world's most geeky geeks run it, and the fact is that many of them seem to live on another planet. Even other geeks cannot understand what they are saying - never mind ordinary people - and their communication skills and appreciation of usability issues are somewhere on a par with that of a coelacanth.

However, progress is being made, and it is now actually possible to derive real-world useful help from their site. Here are two page links for more information on the subjects we cover here:

http://validator.w3.org/docs/
http://www.w3.org/QA/2002/04/valid-dtd-list.html

Charset

The Charset or character set is the language script interpretation the document uses.

The W3C are pressing very hard for all web pages, websites, and servers (which is a tricky area) to use UTF-8. This is a fine idea but doesn't work in practice as there is too little all-round support. At 2008 it is not something we can universally advise as being practical. It's well worth trying on your own site - but you must test ruthlessly as some browser / platform combinations (ie operating systems) will introduce lots of meaningless junk characters. If you see gibberish characters on the page, the first thing to check is if the page uses a UTF-8 Charset. You can often fix the problem by reverting to an ISO Charset - which normally works just fine.

UTF-8 does not work for extended punctuation or symbols, on many browser / platform combinations. Cynics would say it doesn't work in any situation for these characters. Where you see gibberish on the page, 9 times out of 10 this will be a UTF-8 fault. Of course, UTF-8 is not faulty at all; but there are many implementation issues to be resolved before it can be used without multiple errors.

The server also has a hand here as the page serve mode can be set at server level, though it is overridden at local level. However if there is a conflict then pages can crash, which is why hosts tend to leave their servers set to ISO not UTF.

The advantage of UTF-8 is seen as the ability to cover all types of languages on all platforms. This is fine but it's not supported yet. There are 20 billion web pages out there using an ISO or Windows Charset, so progress may be slow.

Therefore in most circumstances an ISO Charset will be the right choice for European- based languages. However, Windows applications output a Windows Charset so this may have to be used. There is no point in changing it; though if gibberish characters are seen on some browser / platform combinations, an ISO Charset could be tried.

Here are the three main Charsets used:

iso-8859-1
windows-1252
utf-8

The HTML statement for their use is as follows:

<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">

<meta http-equiv="content-type" content="text/html; charset=windows-1252">

This next line is from xHTML as you can see from the final tag closure. To use it in an HTML doc, just remove the final slash and the gap before the greater-than symbol at the end:

<meta http-equiv="content-type" content="text/html;charset=utf-8" />

W3C example meta header

Here is the example the W3C give for an xHTML page header. My annotations are the ## hash symbols.

You will see at the first ## that there is an XML prolog. You do not use this unless you wish to put (probably all versions of) Internet Explorer into quirks mode. There is also a possibility it may affect other browsers. Why they have included it here is unclear.

The second ## marks the language statement, which may or may not be placed here. Other layouts do not use this placement.

Remove the ## symbols if you use this example.

-------------------------------
##<?xml version="1.0" encoding="utf-8"?>
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
##<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
<head>
<title>An XHTML 1.0 Strict standard template</title>
<meta http-equiv="content-type" content="text/html;charset=utf-8" />
<meta http-equiv="content-style-type" content="text/css" />
</head>

<body>
<p>… Your HTML content here …</p>

</body>
</html>
---------------------------------


Here is an example you can use for a full HTML header including most of the basic metadata. You should check each line and ensure the details apply to your site. For example the CSS script is given as 'main.css', which, here, is in the webroot. If it was in a folder 'css', perhaps with other CSS scripts, it would be written as 'css/main.css'. The hash symbols are replaced with your choice of meta title, meta desc., and meta keywords. The best advice I can give here is: be brief. This has the Transitional Doctype in order to make your life easier; once you have achieved some success I recommend you try the Strict version. Just change the Doctype here.

---------------------------------
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
"http://www.w3.org/TR/html4/loose.dtd">
<html lang="en">
<head>
<meta http-equiv="content-type" content="text/html; charset=iso-8859-1">
<title>################</title>
<meta name="description" content="#######################.">
<meta name="keywords" content="#######################">
<!-- <base href="http://www.##################.com" > -->
<meta name="generator" content="NoteTabPro">
<meta name="author" content="admin">
<link rel="StyleSheet" href="main.css" type="text/css">
<link rel="shortcut icon" href="favicon.ico">
<meta name="robots" content="index,follow">
</head>
---------------------------------

Technical note: base href issues

In the header above you will see a tag, base href. This tells browsers (and search engines) that the base or root of the site is at the domain name you specify there. This is done for several reasons although the main reason for its introduction was the pagejacking problem, whereby sites with loose technical management were easy targets for content stealers. However it is a good way to tighten up the site code management. It is not strictly necessary and can be left out. Modern code editors include it but no older ones do as it is a relatively modern inclusion.

There are two important things you should know about this tag, otherwise it will either cause inexplicable, infuriating problems - or it just won't work:

1. In dev it is commented out - when working on the site you switch it off.
2. On a livesite it is uncommented.

If it is uncommented and working, while the site is being built, you will often find that images or scripts won't load. That's because the code is saying, "Everything on this page comes from the server at......".

But while it's on your LAN that won't be true, and there will be page malfunctions - so comment it out.

However, you have to uncomment it on the livesite or it has no effect. It's invisible and does not operate. Duh!

Fixing validation errors

OK, so now for the difficult bit...

It would be no use attempting this without getting all the previous items correct, but we've covered that well enough.

Let's assume you're validating an HTML Transitional page. It's about the easiest choice. It is definitely a cop-out and a good developer would never use HTML Transitional - you are not paying them for quirks mode shortcuts, after all. However, for the average user, it may be necessary.

We've uploaded a test page to this site that you can play with. You can validate it here or, better still, take it away and work with it on your computer. Upload it to your own site so that you can change things and test it. You'll need the image that comes with it, the nice boat picture, so make sure to save that when you save the page. Also, you will need to FTP that image up to your site as well. Alternatively, you can validate a page by uploading the file to the W3C - they don't insist you run it from a website.

Here is our test page for your use:

validator test page

There are about 8 faults reported on this page and we can use them to illustrate some general principals. These are:

1. Start at the top and work down. Many faults are cascading - that is to say, they are only there because of a previous fault. Remove the earlier fault and they disappear.

2. You can safely remove some proprietary code. Some faulty applications (most, in fact, a few years ago) insert their own weird statements or pseudo-commands which have no known relevance to any modern browser's requirements. They might originally have been in there for IE3 or Netscape 1 etc. Remove them and see what the result is. Your ideal method here would be to look up an HTML tutorial to see what the code should look like in reality, then adjust it.

Removing some proprietary code can cause a page to crash, though. Even though it has no relevance now, browsers know to apply quirks mode and build the page as an obsolete-coded page for Internet Explorer 5, for example, when they see this code. What you must do is to transfer the functionality to CSS. This applies to code tags seen such as <microsoft_border> that have been obsolete for years and were never part of HTML in any case.

Text editor for validation

You need a decent text editor to fix these faults. Ideally it will have line numbers and colour highlighting. If you are trying to do this job with something like Notepad, you are making life impossible for yourself. There are plenty of good free or cheap editors out there that are efficient for this work, such as Crimson Editor or NoteTabPro. A good text HTML editor will also give you the correct code to insert.

Head and body of page

An HTML page comprises two parts, the head and the body. The head does not print, ie it does not show on the page. Anything in the head is invisible. It starts with <head> and ends with </head>. The body starts immediately after, with <body>, and ends of course with </body>. HTML opening and closing tags are above and below this. So an outline example is:

<!DOCTYPE>
<html>
<head>
 - - -your header tags here- - -
</head>
<body>
- - -your page content here- - -
</body>
</html>

Text case in validation

In theory, only xHTML needs to be all lower case. In practice it won't hurt if all your code is lower case. It's good practice now. Best to change all code to lower case if you are changing elements of it. Of course, this would be a major job by hand - if not impossible - but it's one click to do the entire page in a real text editor. Just blue all the text out (highlight it with Cntrl+A) >> find your text controls in the menu >> then, Case >> then hit Lower Case. Sorted.

Oh, leave the Doctype as upper case. Sorry.


OK - let's look at this dud page. Go to the W3C online validator and put in that test page address,

www.a3webtech.com/bad-page.html

...into the validator at:

http://validator.w3.org

Validation errors and how to fix them

The first error is reported as being on line 16. Well, this is a good start, at least we have 15 lines with no errors then...  Ignore the Column number unless you have a text editor that can identify columns. The number does give you a good idea where the fault is, though, in a long line.

1. Error  Line 16, Column 13: there is no attribute "SCROLL".

<body SCROLL="auto" BGCOLOR="#FFFAFA" TOPMARGIN=0 LEFTMARGIN=0>

We are looking here at the very first line of script on the page true - the previous code is in the header. The Validator is telling us it doesn't recognise the word 'scroll' since it does not exist in HTML. Well, if it doesn't exist, it won't hurt to lose it then - chop it out. Remove SCROLL="auto".


2. Error Line 16, Column 48: there is no attribute "TOPMARGIN".

<body SCROLL="auto" BGCOLOR="#FFFAFA" TOPMARGIN=0 LEFTMARGIN=0>

Again, the 'attribute' (a fancy term for a code tag) TOPMARGIN apparently does not exist in HTML so it is superfluous. Lose it.


3. Error  Line 16, Column 61: there is no attribute "LEFTMARGIN".

… SCROLL="auto" BGCOLOR="#FFFAFA" TOPMARGIN=0 LEFTMARGIN=0>

And again - more of the same. LEFTMARGIN doesn't exist in HTML. Ditch it.

OK, you can probably see what is happening here.  I've used an old HTML editor that inserts long-obsolete commands for Netscape 1 or something. All these old tags can go, they are not part of HTML any more (if they ever were). The final result when that line is cleaned up is as follows:

<body bgcolor="#fffafa">

That's all that is needed - a body start command and a background (bg) colour. If the page is white it doesn't need the colour either. Occam's Razor. In that case you'd simply have:

<body>

Note that spellings in code are always in US English, not International English. There are also several famous mispellings as well, where the original geeks who set up the HTML spec couldn't spell for toffee and got it wrong. For example 'seperator'; you have to spell it wrongly like this, or the code doesn't work...


4. Line 18, Column 32: end tag for element "H3" which is not open.

<h2>A Slightly Bad Test Page</h3>

This is hard to interpret for a newbie, but you should look carefully at the line identified, which has a typo. A text heading level like this determines the size of the headings on the page. This is a large heading and is an H2. Or is it? It's closed with, oops, H3. It should be </h2> because that's what the opening tag says.

This error has created a bunch of problems that cascade down the page. Because the tags didn't match, there is an open tag and a closed one with no opener. Hah. A number of cascading errors have resulted, which appear lower down. One simple typo has wrecked the page.


5. [Warning] Line 24, Column 33: cannot generate system identifier for general entity "also".

Lots of lovely text here anyway &also a nice image just below.

Here is a common issue: the use of an ampersand (&) on a page. In fact we shouldn't use these, for two reasons: it's bad style in English to use this, you should just say 'and'; also that symbol is used in code and has a specific meaning. If we use it we risk many problems. Here, because of a typo it's close up against another word and the Validator is trying to work out what code statement is being made. None of course - so that's caused a bunch of errors, which, again, cascade down the page. Whoops.

The Validator is totally confused and keeps looking for an 'entity' that isn't there: a code object which doesn't exist. The answer here is simply to correct the typo; but in reality we would be better off not using this symbol at all as it is asking for trouble. However you can 'escape' it in the raw code if you write the correct code for an ampersand directly into the code:
& amp ;

I have had to insert two spaces here in order to stop it being read as code - you need to remove the spaces so it all runs together. You would use this if you cannot avoid using the symbol in text and it creates an issue. This might happen if it were part of a firm's official name for example, like Smith & Sons. Otherwise - don't use ampersands.


6. Line 24, Column 33: general entity "also" not defined and no default entity.

Lots of lovely text here anyway &also a nice image just below.

Again the ampersand problem, which has cascaded down to here.


7. Line 24, Column 37: reference to entity "also" for which no system identifier could be generated.

Lots of lovely text here anyway &also a nice image just below.

[info] Line 24, Column 32: entity was defined here.

Lots of lovely text here anyway &also a nice image just below.

Still the ampersand problem. Change it to 'and', as it should be.


8. Line 27, Column 69: required attribute "ALT" not specified.

…width="242" height="351" border="0" align="left">

Right, we're on to something new at last. This is an image / graphics problem and there will be plenty of those. The Validator tells us we are missing an attribute (tag) called 'alt'. What this refers to is the alt tag for an image. This is a slang description for the 'alt' or 'alternative text' attribute, which gives a text description for an image. Every image needs this.

We need to add this tag into the line so that it reads:

<img src="boats.jpg" alt="" width="242" height="351" border="0" align="left">

That's all you need, a blank, empty alt tag (there is nothing between the quotation marks). However, for best results we should indeed use some sort of description, so we could put this:

<img src="boats.jpg" alt="A nice picture of boats" width="242" height="351" border="0" align="left">

The main purpose is to provide an alternative description of the content in case images are turned off in the browser, or for sight-handicapped people using screen reader browsers (a browser that converts page content to a voice readout). This is an accessibility issue - you should not be barring people with equipment or physical handicaps from accessing your page content.


9. Error  Line 29, Column 6: end tag for "H2" omitted, but its declaration does not permit this.

</body>

[Info]  Line 18, Column 0: start tag was here.

<h2>A Slightly Bad Test Page</h3>

Well - this is the last of the cascading errors caused (somehow) by an error in that title tag. Fix the title H-level fault and this will disappear.


And that's it for a basic validation tute. It's possible to go on for pages like this, but you should have got the basic principles of it now - which are:
  • Use a decent text editor to hunt down these problems and fix them

  • Start at the top, not in the middle somewhere

  • Have a good HTML code reference at hand so you can see what it should be if done correctly

  • Remove all proprietary tags because most are no longer necessary. Deleting some, though, will require you to convert some missing functionality into CSS scripting.

Did you find this page useful?
If so, please consider linking to it. Thank you.


75% or more of websites you check will be faulty, even on the front page - and this applies to sites built by experts, as well. You have to decide if quality is important to you or not. Cheap trash is everywhere, and is the common standard.
 
Please tell us if you need more help, or if you can add to this tute. We are certainly not experts here and can use any help offered. Validation is impossibly complex at first, but gets easier.  
 
And if you want to make it really easy, get a clean modern CMS :-)

 
Web Business Managers