arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Result shows numbers while crawling a PDF file.

rated by 0 users
Answered (Verified) This post has 2 verified answers | 16 Replies | 29 Followers

Top 10 Contributor
83 Posts
InvestisDev posted on Wed, Jun 13 2012 5:43 AM

Hello,

While crawling a pdf file, it gives description of the text as below

where first line is the file name with url and the text is of the description of the pdf file. Here description shows numbers in place of the text of PDF file. 

PDF which we crawled was :

0876.Atkins_Renewables.pdf

Please suggest a way to solve this.

Let me know if you need any further detail for the same.

Thanks,

Answered (Verified) Verified Answer

Top 10 Contributor
83 Posts
Verified by InvestisDev

Hello,

 i had tried to get some solution from the itextsharp but no reply still, in between i had tried another method of itextsharp which is as below:

 iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage() using this the attached PDF file in this communication which gives result as numbers also shows proper data.

can you please take a look on this method to verify that Is it fine to use this method in place of the existing?

below is the screen of code which i had changed and so get the response of the pdf too. I had changed the code in "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" class - "PerformAction()" method

Let me know if you have any concern.

Thanks,

 

Top 10 Contributor
83 Posts
Verified by InvestisDev

I got that method from some forum , but the same method is having one more overloaded form. which does not required this text Strategies to be specified.

so In that way too i had crawled the same pdf file which was not showing proper data is now showing actual content. i had changed the code as below.

So suggest me if this we can use as a solution or not. Below is the screen of code change in the same class "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" method "PerformAction()"

 

 

Thanks,

All Replies

Top 10 Contributor
1,905 Posts

Thanks for the heads' up on this fix...  I really appreciate it.  Big Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Smile Thanks for your support 

- VC

Page 2 of 2 (17 items) < Previous 1 2 | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC