Using ScrapedPage.ConvertToTag

Topics: User Forum
Dec 6, 2007 at 10:59 PM
Hi, I love the screen scraping stuff in PublicDomain. I was just wondering if someone could give me an example of using the ScrapedPage.ConvertToTag and ScrapedPage.ConvertToTagList?

Thanks!
Coordinator
Dec 9, 2007 at 5:09 AM
Hi bigjoe714,

To be honest, the ScreenScraper stuff is one of the oldest pieces of the code, and one of the more poorly designed pieces. Some of those static methods I hobbled together just to get a little screen scraper working on bankofamerica.com, there wasn't much thought put into the API. For this specific method, I think I used it for something like:

ScreenScraperTag tag = ScrapedPage.ConvertToTag("<td align=\"right\" style=\"background-color: white\">test</d>", true);
foreach (string attributeName in tag.Attributes.AllKeys)
{
   string attributeValue = tag.Attributes[attributeName];
   Console.WriteLine("{0}={1}", attributeName, attributeValue);
}
 
//Output:
// align=right
// style=background-color: white

Recently I've only been using the ScreenScraper for its redirecting and cookie handling capabilities and basic regex stuff, but what I'd like to do in the future is really make it conducive to screen scraping with logic (perhaps even XML directive based scraping). This would include things like extracting out all FORMs on the page, and all of their respective INPUTs, etc. One other cool thing to do would be to use the recently built PublicDomain.Xml.LenientXmlDocument parser. This would turn any HTML into a well-formed XmlDocument object on which you can execute XPath, etc.

If you have specific requests, let me know and I might be able to code them up quickly, although a fuller redesign as above will probably take some time (although feel free to contribute! :-D).

Thanks,
Kevin
Coordinator
Dec 9, 2007 at 5:15 AM
Edited Dec 9, 2007 at 5:16 AM
Also, just to note that ScrapedPage.ConvertToTag breaks pretty easily with malformed HTML. If you want to do something like extract out attributes or tags, I would suggest something like:

using PublicDomain.ScreenScraper;
using PublicDomain.Xml;
 
...
 
Scraper scraper = new Scraper();
ScrapedPage page = scraper.Scrape(ScrapeType.GET, "http://www.google.com/");
LenientHtmlDocument html = new LenientHtmlDocument();
html.LoadXml(page.RawStream);
foreach (XmlElement td in html.SelectNodes("//td"))
{
    // Now we can do things like td.Attributes["align"], etc.
}
 
Dec 9, 2007 at 3:48 PM
Kevin,
Thanks for the info. I have been using the Scraper classes in PublicDomain to only retreive the html, then I have been using another class that I got from code project to parse the html into objects. It has been working really well, I just noticed the ScreenScraperTag the other day and thought maybe I could use PublicDomain for all the parsing instead of also using the other class.
Thanks again,
Joe
Coordinator
Dec 9, 2007 at 9:15 PM
Hey Joe, out of curiosity what is the other project you're using? I'd suggest the LenientHtmlDocument, I haven't found anything like out there. There was an open source project called HtmlAgilityPack, but it made some weird decisions sometimes. I wouldn't suggest using ScreenScraperTag, it's one of those pieces of code that I'm not too proud of :-D

Good luck,
Kevin
Dec 15, 2007 at 5:25 PM
Kevin,
I am using DOL HTML Parser from codeproject called "A non-well-formed HTML Parser and CSS Resolver".
http://www.codeproject.com/KB/cs/DOLHTMLParser.aspx?df=100&forumid=398849&exp=0&fr=26&select=1966853

It is easy to use like this:

                    DHtmlDocument htmlDoc = new DOL.DHtml.DHtmlParser.DHtmlDocument(resultsPage.RawStream, htmlParser);
                    DHtmlNodeCollection tables = htmlDoc.Nodes.FindByName("table", true);
 
                    DHtmlElement table = (DHtmlElement)tables[7];
 
                    DHtmlNodeCollection linkNodes = ((DHtmlElement)table).Nodes.FindByName("a", true);
                    foreach (DHtmlElement linkElem in linkNodes)
                    {
                        string thisUrl = linkElem.Attributes["href"].Value;
		    }

However, Im going to check out LenientHtmlDocument now. Thanks!
-Joe
Coordinator
Dec 16, 2007 at 2:53 AM
Thanks Joe, I didn't know about that package. I haven't tried it out, but it looks really good and easy to use.
Dec 17, 2007 at 12:40 PM
Kevin,
The only problem I am having with the DOL HTML parser is that when I parse a very large HTML file, it pegs the CPU and uses tons of memory. I was trying to find info about LenientHtmlDocument. Can you tell me more about it? Do you think it would help?
Thanks,
Joe
Coordinator
Dec 17, 2007 at 3:21 PM
Hey Joe,

I haven't done scale or performance testing, but I tried to design for efficiency. LenientHtmlDocument is based off of LenientXmlDocument, which uses the System.Xml namespace for most of its data modeling. As far as I see, DOL HTML has created its own abstraction over the HTML document, whereas LenientXmlDocument actually subclasses System.Xml.XmlDocument and creates nodes using System.Xml, so in that sense, it might be more efficient or less efficient. I haven't added many convenience methods that DOL HTML has, such as getting particular FORMs or TABLEs, but since it's just a big XML document, you can just use XPath.

As far as implementation, LenientXmlDocument (c:\Program Files\Public Domain\xml\LenientXmlDocument.cs) uses a state machine to emulate a language parser, but it should be more efficient than just a simple LR parser generated from a grammar. I also try to minimize string allocations.

So, hope that helps, but to answer your question: I don't know what the comparison is between the two, you'll have to try it. My guess is that both will peg the CPU since this is purely CPU processing here and it is not I/O bound, so it is just a matter of which implementation is more efficient.

I would also suggest the Microsoft CLR Profiler to profile CPU and memory usage of a test app:

CLR Profiler 2.0:
http://www.microsoft.com/downloads/details.aspx?FamilyId=A362781C-3870-43BE-8926-862B40AA0CD0&displaylang=en
http://msdn2.microsoft.com/en-us/library/ms979205.aspx


Good luck, and let me know your results,
Thanks,
Kevin