arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Where to enter the initial crawl request?

rated by 0 users
Answered (Verified) This post has 1 verified answer | 16 Replies | 2 Followers

Top 25 Contributor
26 Posts
simonjng posted on Sat, Nov 7 2009 4:14 AM

HI - have been through all the various threads and the inline commentary in the code itself, but I can't for the life of me see exactly where to enter the "ad.InsterCrawlRequest..." code"

I'm in the ArachnodeDAO.cs program and the following seems to be the place... but where?!

       }

        /// <summary>
        /// Inserts the crawl request.
        /// See: http://arachnode.net/forums/p/564/10762.aspx#10762 if you experience difficulty with this function.
        /// Use this function if you want to insert CrawlRequests into the database.  The overload with the split 'currentDepth' and 'maximumDepth' parameters is intended for use by the Crawler itself.
        /// </summary>
        /// <param name="created">The created.</param>
        /// <param name="absoluteUri1">The absolute uri1.</param>
        /// <param name="absoluteUri2">The absolute uri2.</param>
        /// <param name="depth">The depth.</param>
        /// <param name="restrictCrawlTo">The restrict crawl to.  Restricting a Crawl to a specific UriQualificationType means that the Crawl will only crawl WebPages that match the initial UriQualificationType.</param>
        /// <param name="restrictDiscoveriesTo">The restrict discoveries to. Restricting a Crawl's Discoveries to a specific UriQualificationType means that the Crawl will only collect Discoveries that match the initial UriQualificationType.</param>
        /// <param name="priority">The priority.</param>
        /// <returns></returns>
       
        public long? InsertCrawlRequest(DateTime created, string absoluteUri1, string absoluteUri2, int depth, byte restrictCrawlTo, byte restrictDiscoveriesTo, double priority)
        {
            return InsertCrawlRequest(created, absoluteUri1, absoluteUri2, 1, depth, restrictCrawlTo, restrictDiscoveriesTo, priority);
        }

 

If I insert above the "public long?..." line I'm getting errors around the opening ( and , after DateTime.Now.

I've obviously overwrittensomething critical now as I'm getting repetitons of the following error:

Warning    5    The variable 'i' is assigned but its value is never used    G:\Text analysis\Arachnode\SiteCrawler\Core\Cache.cs    306    21    SiteCrawler

I've really been trying to follow the examples in the forum posts but some seem to refer to code that isn't in v1.3 and non of the code examples show which lines are above or below so I don't know WHERE in the appropriate project elements to insert.

I know it's boring to handhold a C# newbie through what is obvious to anyone who recognises the code, so apologies for that!

Thanks

Simon

Answered (Verified) Verified Answer

Top 10 Contributor
1,905 Posts

Simon -

See Program.cs... (in the 'crawl test code' region...)

_crawler.Crawl(new CrawlRequest(new Discovery("http://fark.com"), int.MaxValue, UriClassificationType.None, UriClassificationType.None, 1));

The warning you see are debug artifacts that I have left behind.  As long as the solution builds you are fine.

The DAO methods do work, but it is also valid to submti a CrawlRequest directly to the Crawler.cs.

Always glad to answer your questions!

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

All Replies

Top 10 Contributor
1,905 Posts

Simon -

See Program.cs... (in the 'crawl test code' region...)

_crawler.Crawl(new CrawlRequest(new Discovery("http://fark.com"), int.MaxValue, UriClassificationType.None, UriClassificationType.None, 1));

The warning you see are debug artifacts that I have left behind.  As long as the solution builds you are fine.

The DAO methods do work, but it is also valid to submti a CrawlRequest directly to the Crawler.cs.

Always glad to answer your questions!

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
26 Posts

Hi there - getting further forward... slowly, sowly, everso slowly!

Thanks for the above - I thought from earlier forum posts/responses that using the DAO was the preferred method, but I'll use whatever works!

I've got an execption popped up in crawler.cs - pasted below:

System.NullReferenceException was unhandled
  Message="Object reference not set to an instance of an object."
  Source="Arachnode.SiteCrawler"
  StackTrace:
       at Arachnode.SiteCrawler.Crawler.Crawl(CrawlRequest crawlRequest) in G:\Text analysis\Arachnode\SiteCrawler\Crawler.cs:line 191
       at Arachnode.Console.Program.Main() in G:\Text analysis\Arachnode\Console\Program.cs:line 81
       at System.AppDomain._nExecuteAssembly(Assembly assembly, String[] args)
       at Microsoft.VisualStudio.HostingProcess.HostProc.RunUsersAssembly()
       at System.Threading.ExecutionContext.Run(ExecutionContext executionContext, ContextCallback callback, Object state)
       at System.Threading.ThreadHelper.ThreadStart()
  InnerException:

Any idea what this might be?

(I had to resolve the UriClassificationType variable in the code above before getting this far - have I causewd the exception?)

Thanks

Simon

Top 10 Contributor
1,905 Posts

Great!  Progress!  :)

I haven't seen this exception before but I am glad to take a look.  Let me know when you have a TV session ready?

(resolving the reference wouldn't have caused this...)

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
26 Posts

Hi Mike - in about 20 to 30 mins suit you?

Simon

Top 25 Contributor
26 Posts

Thanks, Mike...

All was running well, until I got two BSODs.  Console ran for about 20 secs both times pre <kaboom>.

I've got the report from the second one but I've left the office for the evening - getting late, freezing cold and sausage and mash was waiting - as the database got corrupted by the second crash so I'll have to delete, restore, re-reset and re-upgrade etc. I'll post it here tomorrow AM my time.

Don't know if this is going to turn out to be memory-related (as it ran for a while before crashing) but I'll be interested to see if I can turn up the memory config setting I saw in the database earlier (set to 1GB as default) quite a bit higher as I've got 12GB to get through. Will AN make use of the 64-bit RAM headroom?

Onwards and upwards!

Cheers

Simon

 

 

Top 10 Contributor
1,905 Posts

Have you run a full MEMTEST on your machine?

If your DB was corrupted it is a strong sign you have bad RAM.  Sad

AN makes full use of the 64-bit address space and operates across multiple processors.

To the moon!

-Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
26 Posts

Hi - just put all 12GB through a workout via Memtest 86+. Passed with flying colours.

Here's the error report from the second BSOD yesterday:

Problem signature:
  Problem Event Name:    BlueScreen
  OS Version:    6.0.6002.2.2.0.256.1
  Locale ID:    2057

Additional information about the problem:
  BCCode:    1e
  BCP1:    FFFFFFFFC0000005
  BCP2:    FFFFF800046FA21E
  BCP3:    0000000000000000
  BCP4:    0000000000000000
  OS Version:    6_0_6002
  Service Pack:    2_0
  Product:    256_1

Files that help describe the problem:
  C:\Windows\Minidump\Mini110909-02.dmp
  C:\Users\Simon\AppData\Local\Temp\WER-162022-0.sysdata.xml
  C:\Users\Simon\AppData\Local\Temp\WER209A.tmp.version.txt

Of course, it's possible that this isn't an AN problem (I'm not intending to use the forum as a free general IT service!) but I think this is my only BSOD experienced on this machine in 18 months (something of a miracle bearing in mind I'm running Vista x64) so I imagine there's a connection somewhere even if isn't an AN fault per se.

Couldn't the DB corruption simply be due to the fact that the machine blue-screened during a write process? I couldn't access the DB anyway - it was listed as 'suspect' in SQL and neither SQL or VS would connect. So I've renewed it entirely.

Per adua ad astra

Simon

Top 10 Contributor
1,905 Posts

What is the faulting module in the crash dumps?  Do you know how to use windebug to view the source of the crash?

AN makes only one PInvoke call/hook/reference, and so I doubt that AN is causing the problem.  (All .NET/Managed code) What about a bad block on your HDD?

SQL is supposed to conform to ACID.  So, if the machine goes down and the H/W is AOK, the DB shouldn't be corrupted.  Are there errors in the System log for disk failures?

Always glad to help!

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
26 Posts

Well... having re-configed and backed the DB up this time (!), I tinkered with a few settings and <F5>'d again...  this time, no problems!

I started by turning DOWN the crawl depth and turning up the memory allocation (yes, I know doing 2 things at once doesn't help the analaysis)- crawl depth to 1 and memallocation up to 4GB.

It crawled a couple of shallow sites to depth 1 pretty fast, so I set depth back to int.MaxValue and unleashed it again. No BSOD after about 5 mins of crawling.

When I've got time, I'll decrease the memallocation again to see if I can recreate the issue but for the moment, I'm very happy to be running and looking forward to twiddling with some settings rather than debugging.

Thanks

Simon

Top 10 Contributor
1,905 Posts

Ahh... computers.

Glad to hear you are back up and running... Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
26 Posts

Well- I was for a while!  Got another BSOD about 15 mins after my last post :(

Same error output as before.  The 2 XML files listed in the report didn't seem to be there, but the 'minidump' file is screengrabbed below for what it's worth (doesn't seem to say anything?)...

To answer your question, I haven't used any error tracking methods before, but I'll search on your suggestion and see if I can come up with something more meaningful later today.

At least I manged to run a couple of crawlsd and gather some DB content, so I can know focus on the actual functionality as well as configuration issues!

Cheers

Simon

 

Top 10 Contributor
1,905 Posts

What events, if any, are in the Event Logs?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 25 Contributor
26 Posts

Hi Mike - have emailed you the last 24hrs worth of windows>aplication events (errors, warnings and criticals only).

I've done complete surface scans of every disk in and attached to this machine now - the only errors were found on an external HDD that isn't referenced in any way by AN.

One thing I have managed to do (and I post this here for the potential greater good) is finally get rid of Acronis True Image Home, which I used to close a drive after a complete OS fail a few months ago). A nastier more perniscious piece of cr*pware can hardly exist as Acronis' own instructions to remove when uninstall fails (as it always does on Vista, apparently) involves manually removing registry keys and settins and manually deleting about 20 background services that were causing the machine to take about 15 mins to shut down.

Cheers

Simon

Top 10 Contributor
1,905 Posts

Is that external HDD being used as a page file?

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (17 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC