arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

What is the best method to parse html tags!

rated by 0 users
Answered (Verified) This post has 1 verified answer | 3 Replies | 4 Followers

Top 200 Contributor
Male
1 Posts
Milan Solanki posted on Thu, Oct 22 2009 10:23 AM

What would be the best method of parsing html, i want to get all the tags in a array, using less possible cpu.

like for parsing u need to specify what pieces of information you need to parse.

to avoid problems in parsing, and extracting wrong information from a web page, it would be good to find a piece of text in a page get its index and add to index till you find the right info.

like for a jobs site,

index

0-[Home]
1-[Job Title] 2-[asp.net developer]
3-[City] 4-[New York City]

Here  we have parsed the index to an array, now to extract job title of this job, we will find text "Job Title" and get its index and add +1 to it and extract that info, this will help avoid extracting wrong info in case of page updation.

So to achieve need to parse and extract all tags to array, what would be the best and easiest method to achieve this??

mshtml, htmlagilitypack... or other.

 

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Use HtmlAgilityPack and query with xpath.  Hands down.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

Use HtmlAgilityPack and query with xpath.  Hands down.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
Male
101 Posts
Kevin replied on Fri, Oct 23 2009 2:15 PM

Milan, if you don't have it already, here's a link to the HtmlAgility Pack docs:

http://htmlagilitypack.codeplex.com/Release/ProjectReleases.aspx?ReleaseId=33903

 

 

Top 200 Contributor
1 Posts

 

Milan Solanki:

> What would be the best method of parsing html

 

I do this daily. I use biterscripting for parsing our own web pages, and extracting all kinds of info from it in all kinds of formats. You can start with the sample script posted at http://www.biterscripting.com/SS_WebPageToText.html  as I did. To try this, simply enter the following command in biterscripting.

 

script SS_WebPageToText.txt page("http://arachnode.net/forums/t/710.aspx")

 

It will show you this very page by extracting plain text from it. If you don't have biterscripting, that you can download free from any download site.

 

Jenni

 

Page 1 of 1 (4 items) | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC