LenientHtmlDocument making changes to the code...

Topics: Developer Forum, User Forum
Jan 15, 2008 at 2:47 AM
First let me say your Library has proven very educational. I appreciate your work on so many useful utilities and classes.

I do have a question about the LenientHtmlDocument. It seems to be modifying the html it parses. An example would be <a name="top" id="top"></a> becomes <a name="top" id="top" />. In the application I am considering using this class in I have to maintain the html from the source without modification. Can you help me figure this one out? It looks to me like your code discards the end tag and relies on the xml element to close itself. Any thoughts?

Thanks
Jan 15, 2008 at 1:51 PM
I have a fix for the issue but its outside of the parsing. I am modifying the Xml being returned from the parsing. But I am curious about the design decision behind the modification to these elements. Again really fantastic code for me to play with. It's introduced a whole bunch of techniques I am unfamiliar with.
Coordinator
Jan 21, 2008 at 8:54 PM
Hey jhersh, glad you found it helpful. The behavior you're seeing is expected because all I'm doing in the parser is creating an XmlDocument, so when I add a node without child nodes (such as <a></a>), the default behavior of .NET's System.Xml when you print that out using InnerXml or OuterXml is to print these nodes without an ending tag. To get the ending tag explicitly, you'll have to use an XmlWriter, something like:

using System;
using System.Collections.Generic;
using System.Text;
using PublicDomain.Xml;
using PublicDomain;
using System.Xml;
using System.IO;
 
namespace play
{
    public class Program
    {
        public static void Main(string[] args)
        {
            LenientHtmlDocument doc = new LenientHtmlDocument();
            doc.LoadXml("<html><head><title>Test</title></head><body><a href=\"test.html\"></a></body></html>");
            XmlWriterSettings settings = new XmlWriterSettings();
            settings.ConformanceLevel = ConformanceLevel.Auto;
            settings.Indent = true;
            Console.WriteLine(FormatXml(doc, settings));
            Console.ReadKey(true);
        }
 
        public static string FormatXml(XmlDocument doc, XmlWriterSettings settings)
        {
            StringBuilder sb = new StringBuilder(512);
 
            using (StringWriter stringWriter = new StringWriter(sb))
            {
                using (XmlWriter xmlWriter = new HtmlDocumentWriter(XmlWriter.Create(stringWriter, settings)))
                {
                    doc.Save(xmlWriter);
                }
            }
 
            return sb.ToString();
        }
    }
 
    public class HtmlDocumentWriter : XmlWriter
    {
        public static readonly string[] EmptyElements = new string[] {
            "area",
            "base",
            "basefont",
            "br",
            "col",
            "frame",
            "hr",
            "img",
            "input",
            "isindex",
            "link",
            "meta",
            "param"
        };
 
        protected string m_current;
        protected XmlWriter m_baseWriter;
 
        public HtmlDocumentWriter(XmlWriter baseWriter)
        {
            m_baseWriter = baseWriter;
        }
 
        public override void WriteEndElement()
        {
            if (Array.IndexOf<string>(EmptyElements, m_current) != -1)
            {
                m_baseWriter.WriteEndElement();
            }
            else
            {
                m_baseWriter.WriteFullEndElement();
            }
        }
 
        public override void WriteStartElement(string prefix, string localName, string ns)
        {
            m_current = localName.ToLower();
            m_baseWriter.WriteStartElement(prefix, localName, ns);
        }
 
        public override void Close()
        {
            m_baseWriter.Close();
        }
 
        public override void Flush()
        {
            m_baseWriter.Flush();
        }
 
        public override string LookupPrefix(string ns)
        {
            return m_baseWriter.LookupPrefix(ns);
        }
 
        public override void WriteBase64(byte[] buffer, int index, int count)
        {
            m_baseWriter.WriteBase64(buffer, index, count);
        }
 
        public override void WriteCData(string text)
        {
            m_baseWriter.WriteCData(text);
        }
 
        public override void WriteCharEntity(char ch)
        {
            m_baseWriter.WriteCharEntity(ch);
        }
 
        public override void WriteChars(char[] buffer, int index, int count)
        {
            m_baseWriter.WriteChars(buffer, index, count);
        }
 
        public override void WriteComment(string text)
        {
            m_baseWriter.WriteComment(text);
        }
 
        public override void WriteDocType(string name, string pubid, string sysid, string subset)
        {
            m_baseWriter.WriteDocType(name, pubid, sysid, subset);
        }
 
        public override void WriteEndAttribute()
        {
            m_baseWriter.WriteEndAttribute();
        }
 
        public override void WriteEndDocument()
        {
            m_baseWriter.WriteEndDocument();
        }
 
        public override void WriteEntityRef(string name)
        {
            m_baseWriter.WriteEntityRef(name);
        }
 
        public override void WriteFullEndElement()
        {
            m_baseWriter.WriteFullEndElement();
        }
 
        public override void WriteProcessingInstruction(string name, string text)
        {
            m_baseWriter.WriteProcessingInstruction(name, text);
        }
 
        public override void WriteRaw(string data)
        {
            m_baseWriter.WriteRaw(data);
        }
 
        public override void WriteRaw(char[] buffer, int index, int count)
        {
            m_baseWriter.WriteRaw(buffer, index, count);
        }
 
        public override void WriteStartAttribute(string prefix, string localName, string ns)
        {
            m_baseWriter.WriteStartAttribute(prefix, localName, ns);
        }
 
        public override void WriteStartDocument(bool standalone)
        {
            m_baseWriter.WriteStartDocument(standalone);
        }
 
        public override void WriteStartDocument()
        {
            m_baseWriter.WriteStartDocument();
        }
 
        public override WriteState WriteState
        {
            get
            {
                return m_baseWriter.WriteState;
            }
        }
 
        public override void WriteString(string text)
        {
            m_baseWriter.WriteString(text);
        }
 
        public override void WriteSurrogateCharEntity(char lowChar, char highChar)
        {
            m_baseWriter.WriteSurrogateCharEntity(lowChar, highChar);
        }
 
        public override void WriteWhitespace(string ws)
        {
            m_baseWriter.WriteWhitespace(ws);
        }
    }
}
 

I haven't tested this completely, but it looks like it works. Let me know if it doesn't...

Thanks!
Jan 23, 2008 at 5:20 PM
schizoid,

Your posted code has an issue with attributes in the code I am trying to process. The error is The prefix '' cannot be redefined from '' to 'http://www.w3.org/1999/xhtml' within the same start element tag. . In the end its not something I would worry about. Your response is much appreciated and I have a different solution to my problem which involves adding a text node as a child to the elements I want to expand.

On thing I found is the LenientHtmlDocument has a defaultRootElementName that defaults to "html". In the case listed below I end up with duplicated <html> elements in the final document. This is because the DOCTYPE is ignored and the next sequence in the parsed HTML is "/r/n" which is converted to XmlNodeType.WhiteSpace. Which then triggers the defaultRootElement to be appended to my document.

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>

My amateur fix is to modify your LenientXmlDocument as follows:

        protected virtual void InternalAppendChild(XmlNode child, bool mayHaveChildren)
        {
            XmlNode root = DocumentElement;
 
            // There's already a root but we may be trying to
            // add another element causing multiple roots, which
            // is not allowed
            // additional check is performed against the first element being a line feed.  
            if (root == null && child.NodeType != XmlNodeType.Element && child.Value != "\r\n")
            {
                AppendChild(GetDefaultDocumentNode());
                m_current = DocumentElement;
            }


Hope that helps.
Coordinator
Feb 11, 2008 at 4:37 AM
Hi jhersh,

In release 0.2.41.0, I have fixed the problem of "the prefix cannot be redefined," and also the problem of multiple HTML element.

Also, I've added a convenience OuterHtml property to LenientHtmlDocument to make it easier to grab well-formed HTML (it basically uses HtmlDocumentWriter):

LenientHtmlDocument doc = new LenientHtmlDocument();
doc.LoadXml(@"<a name=""top"" id=""top""></a>");
Console.WriteLine(doc.OuterHtml);

Thanks for reporting the bugs!
Kevin