arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Regarding partial crawling - indexing

rated by 0 users
Answered (Verified) This post has 1 verified answer | 17 Replies | 2 Followers

Top 10 Contributor
83 Posts
InvestisDev posted on Thu, Oct 10 2013 11:23 PM

Hi,

We are currently looking into partial crawling and indexing. I also saw a previous post from this account @http://arachnode.net/forums/p/1747/16842.aspx. The problem here is we cannot store the already crawled data to database and filesystem because of the number of sites we crawl and size of the downloaded data.

My question here is, is it possible to store last modified date to index file and based on that we crawl and download only updated pages and update existing index file with the changed documents only? May be I am optimistic here, but any solution near to the above question would be working great for us given we can't store the downloaded data to our resources. Smile

Thanks

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

AN won't crawl a page if the Last-modified header indicates the page hasn't changed.

Continuous indexing/searching is supported/provided out of the box - as I understand your solution you have changed the service that does this - perhaps something has changed on your end?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
83 Posts

Also, you might be aware but just to add to the above question, I would like to explain our architecture.

  • We have customized AN.net solution to crawl 500+ sites of our clients
  • Crawling is done every night for the sites which have some new data published from our CMS
  • Now issue lies here, even when one page is updated/published, full site is crawled again Sad
  • Site-wise index files are stored in individual folders
  • Search service read these index files for results continuously
  • We do not store the downloaded webpages/files and cleanup the database every time a site is crawled due to feasibility issues

Let me know if any further details are required

Top 10 Contributor
1,905 Posts
Verified by arachnode.net

AN won't crawl a page if the Last-modified header indicates the page hasn't changed.

Continuous indexing/searching is supported/provided out of the box - as I understand your solution you have changed the service that does this - perhaps something has changed on your end?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

>>AN won't crawl a page if the Last-modified header indicates the page hasn't changed

I get the Last-Modified header stuff. But, we don't store anything in database or file system post crawling. So, will this help us in partial crawling?

>>Continuous indexing/searching is supported/provided out of the box - as I understand your solution you have changed the service that does this - perhaps something has changed on your end?

We have customized some projects, but not Service project. And it wouldn't help though because we don't use AN service. We have created custom project to handle dynamic site list that need to be crawled and indexed.

Top 10 Contributor
1,905 Posts

OK, insert the WebPages but don't insert the Source or other fields you don't need.  You will need to store something to tell AN the age/freshness of the content - otherwise, no, it won't help in partial crawling.  You should only need the AbsoluteUri and the ResponseHeaders.  You can drop all but the 'Last-modified' ResponseHeader.

The Web/WebService projects support continuous searching while a crawling process may be updating the index.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

If you wanted hints on new pages from a domain, sign up for Google alerts - then parse the email out.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Yes, it is possible to store the date, but I would just use the built-in tables.

Turn off all indexes that aren't being used and optionally (preferably) disable all data inserts into the WebPages table for data you don't need.

You only need the AbsoluteUri and the ResponseHeaders - sorry, didn't mean to mislead you in a previous post.  Look in DataManager.cs - this is where the Last-Modified filtering is executed.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Mike,

I took Arachnode.net 2.6 + AN.Next and one sample website with 4 pages.

Response header was added and also, I tried persisting all the data in database as well as file system.

On first crawl, four entries were made in [arachnode.net].[dbo].[WebPages] with LastModified field NULL

On second crawl, LastDiscovered field changed to latest datetime, but LastModified remained NULL, which I guess shouldn't be null now.

I also tried debugging DataManager.cs code where request was made using "HEAD" method

"if ((crawlRequest.WebClient.HttpWebResponse).LastModified > lastModified)" was false and crawlRequest.ProcessData was set to false;

Also, the sample website that was crawled shows that full code was executed for the request from AN.Net crawler as per log entries.

So, am I missing something in configuration here? Huh?

Top 10 Contributor
83 Posts

FYI, following are field values for same discovery from [arachnode.net].[dbo].[WebPages]:

ResponseHeaders - Content-Encoding: gzip  Vary: Accept-Encoding  Connection: close  Content-Length: 812  Cache-Control: private  Content-Type: text/html; charset=utf-8  Date: Mon, 21 Oct 2013 12:23:05 GMT  Last-Modified: Mon, 21 Oct 2013 10:11:15 GMT  Set-Cookie: ASP.NET_SessionId=1cfpxs45novklz555m0h5zve; path=/; HttpOnly  Server: Microsoft-IIS/7.5  X-AspNet-Version: 2.0.50727  X-Powered-By: ASP.NET

LastDiscovered - 2013-10-21 13:33:49.637

InitiallyDiscovered - 2013-10-21 13:33:21.820

LastModified - NULL

Top 10 Contributor
1,905 Posts

Not every website implements 'Last-modified'.  AN detects the modification, but of course, only after the source has been pulled down.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Ok, but for that we need to download all data. What we are trying here is to reduce the time taken for second crawl. Is it possible?

Top 10 Contributor
1,905 Posts

Yes, I follow.  Don't insert the Source...  :)  ApplicationSettings.InsertWebPageSource = false;

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Already ApplicationSettings.InsertWebPageSource = false;

We have tried not requesting images at all from DataManager.cs as we don't need images in results, and that has improved the time somewhat.

Is there any possibility to squeeze it further Smile ?

Top 10 Contributor
1,905 Posts

You can disable indexes you don't need - you can also NOT insert the ResponseHeaders as well.

Look at cfg.AllowedDataTypes - you can filter down to specific ContentTypes to download.

What are your full ApplicationSettings?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

arachnode.net:
You can disable indexes you don't need

Didn't get this point. can you please elaborate? we need response headers.

arachnode.net:
Look at cfg.AllowedDataTypes

will surely do that as it seems better option than code changes.

Also, I came across your blog http://arachnode.net/blogs/arachnode_net/archive/2009/03/11/400-performance-gain.aspx

did changes accordingly and it looks great on development machine. will deploy it on live and hope for best Smile.

arachnode.net:
What are your full ApplicationSettings?

//ApplicationSettings can be set from code, overriding Database settings found in cfg.Configuration.
//ApplicationSettings.AllowedQueryString = ConfigurationManager.ConnectionStrings["AllowedQueryStringParams"].ConnectionString;
ApplicationSettings.AssignCrawlRequestPrioritiesForFiles = true;
ApplicationSettings.AssignCrawlRequestPrioritiesForHyperLinks = true;
ApplicationSettings.AssignCrawlRequestPrioritiesForImages = false; //true
ApplicationSettings.AssignCrawlRequestPrioritiesForWebPages = true;
ApplicationSettings.AssignEmailAddressDiscoveries = false;
ApplicationSettings.AssignFileAndImageDiscoveries = true;
ApplicationSettings.AssignHyperLinkDiscoveries = true;
ApplicationSettings.ClassifyAbsoluteUris = false;
//ApplicationSettings.ConnectionString = "";
//ApplicationSettings.ConsoleOutputLogsDirectory = "";
ApplicationSettings.CrawlRequestTimeoutInMinutes = 1;
ApplicationSettings.CreateCrawlRequestsFromDatabaseCrawlRequests = true;
ApplicationSettings.CreateCrawlRequestsFromDatabaseFiles = false;
ApplicationSettings.CreateCrawlRequestsFromDatabaseHyperLinks = false;
ApplicationSettings.CreateCrawlRequestsFromDatabaseImages = false;
ApplicationSettings.CreateCrawlRequestsFromDatabaseWebPages = false;
ApplicationSettings.DesiredMaximumMemoryUsageInMegabytes = 1024;
//ApplicationSettings.DownloadedFilesDirectory = "";
//ApplicationSettings.DownloadedImagesDirectory = "";
//ApplicationSettings.DownloadedWebPagesDirectory = "";
ApplicationSettings.EnableConsoleOutput = true;
ApplicationSettings.ExtractFileMetaData = true;
ApplicationSettings.ExtractImageMetaData = false;
ApplicationSettings.ExtractWebPageMetaData = true;
ApplicationSettings.HttpWebRequestRetries = 5;
ApplicationSettings.InsertDisallowedAbsoluteUriDiscoveries = false;
ApplicationSettings.InsertDisallowedAbsoluteUris = true;
ApplicationSettings.InsertEmailAddressDiscoveries = false;
ApplicationSettings.InsertEmailAddresses = false;
ApplicationSettings.InsertExceptions = true;
ApplicationSettings.InsertFileDiscoveries = false;
ApplicationSettings.InsertFileMetaData = true;
ApplicationSettings.InsertFiles = true;
ApplicationSettings.InsertFileSource = false;
ApplicationSettings.InsertHyperLinkDiscoveries = true; // false
ApplicationSettings.InsertHyperLinks = true; //false
ApplicationSettings.InsertImageDiscoveries = false;
ApplicationSettings.InsertImageMetaData = false;
ApplicationSettings.InsertImages = false; // true
ApplicationSettings.InsertImageSource = false;
ApplicationSettings.InsertWebPageMetaData = true;
ApplicationSettings.InsertWebPages = true;
ApplicationSettings.InsertWebPageSource = false;
ApplicationSettings.MaximumNumberOfCrawlRequestsToCreatePerBatch = 1000;
int maxCrawlThreads = Convert.ToInt32(ConfigurationManager.AppSettings["MaxCrawlThreads"]);
ApplicationSettings.MaximumNumberOfCrawlThreads = maxCrawlThreads;
ApplicationSettings.MaximumNumberOfHostsAndPrioritiesToSelect = 10000;
ApplicationSettings.OutputConsoleToLogs = false;
ApplicationSettings.OutputStatistics = false;
ApplicationSettings.SaveDiscoveredFilesToDisk = true;
ApplicationSettings.SaveDiscoveredImagesToDisk = false; //true
ApplicationSettings.SaveDiscoveredWebPagesToDisk = true;
ApplicationSettings.SqlCommandTimeoutInMinutes = 60;
ApplicationSettings.UserAgent = "Investis Search Crawler"; // "Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/534.24 (KHTML, like Gecko) Chrome/11.0.696.71 Safari/534.24"; //If you find yourself blocked from crawling a website, change this to a common crawler string, such as 'Googlebot' or 'Slurp'...
ApplicationSettings.VerboseOutput = false;

we are looking to make following changes

ApplicationSettings.InsertWebPageMetaData = false;

Any more changes you see that would help us?

Thanks, you've been great help mike! Smile

Page 1 of 2 (18 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC