arachnode.net
An Open Source C# web crawler with Lucene.NET search using SQL Server 2008/2012/2014/2016/CE An Open Source C# web crawler with Lucene.NET search using MongoDB/RavenDB/Hadoop

Completely Open Source @ GitHub

Does arachnode.net scale? | Download the latest release

Result shows numbers while crawling a PDF file.

rated by 0 users
Answered (Verified) This post has 2 verified answers | 16 Replies | 29 Followers

Top 10 Contributor
83 Posts
InvestisDev posted on Wed, Jun 13 2012 5:43 AM

Hello,

While crawling a pdf file, it gives description of the text as below

where first line is the file name with url and the text is of the description of the pdf file. Here description shows numbers in place of the text of PDF file. 

PDF which we crawled was :

0876.Atkins_Renewables.pdf

Please suggest a way to solve this.

Let me know if you need any further detail for the same.

Thanks,

Answered (Verified) Verified Answer

Top 10 Contributor
83 Posts
Verified by InvestisDev

Hello,

 i had tried to get some solution from the itextsharp but no reply still, in between i had tried another method of itextsharp which is as below:

 iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage() using this the attached PDF file in this communication which gives result as numbers also shows proper data.

can you please take a look on this method to verify that Is it fine to use this method in place of the existing?

below is the screen of code which i had changed and so get the response of the pdf too. I had changed the code in "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" class - "PerformAction()" method

Let me know if you have any concern.

Thanks,

 

Top 10 Contributor
83 Posts
Verified by InvestisDev

I got that method from some forum , but the same method is having one more overloaded form. which does not required this text Strategies to be specified.

so In that way too i had crawled the same pdf file which was not showing proper data is now showing actual content. i had changed the code as below.

So suggest me if this we can use as a solution or not. Below is the screen of code change in the same class "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" method "PerformAction()"

 

 

Thanks,

All Replies

Top 10 Contributor
1,905 Posts

Hmm... I don't see this.  What is the full AbsoluteUri of the .pdf from atkins, etc.?

Looks like changes have been made to the source over there.  Could this be the cause?  Perform a diff. between your source and the trunk?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Hello Mike,

Apologies for the delay in reply.

At First , I had made some changes in the "ManageLuceneDotNetIndexes.cs" and "CustomManageLuceneDotNetIndexes.cs". I had added "text" field in the document as below,

document.Add(new Field("text", UserDefinedFunctions.ExtractText(contentToIndex).Value, Field.Store.YES, Field.Index.ANALYZED));

and this "text" field value will be the response of the PDF file or a web page.

Means while crawling such PDF, getting response of the PDF as the numbers.

you can check that from the  response variable "stringBuilder" as highlighted below by crawling that PDF again in the class  "ManageLuceneDotNetIndexes.cs": Just add the breakpoint on CreateDocument() and check the reponse of "stringBuilder" variable

 

below is the URL while crawling getting numbers in a description ("text" field). 

http://northamerica.atkinsglobal.com/~/media/Files/A/Atkins-North-America/Attachments/sectors/renewables/library-docs/brochures/Atkins_Renewables.pdf

Also on the screen which you had shared there is no description its showing "..." only. 

let me know if you need any further detail for the same.

Thanks,

Top 10 Contributor
1,905 Posts

I can take a look in a day or less... busy, busy...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Looks like iTextSharp isn't crashing when I use the latest version.

It still doesn't know how to extract the text though...  looking at it further...

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

As best as I can tell, the fonts/text are actually vector positions.  So, there isn't any text to capture.  ???

I updated itextsharp.dll FWIW.  It seems to have fixed a problem in reading the byte[] into the PdfReader object.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Thanks for the reply, Can you plz tell me which project you had updated so i can take its latest from svn.

:)

Top 10 Contributor
1,905 Posts

I updated the Library and the Plugins projects.  Can you always look at SVN .> Check for Modifications to let you know what has changed.  Smile

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts

Hello Mike,

Apologies for delay in reply, I had taken a latest dump from the svn and checked the same file by crawling it again.

It still shows same issue.

:(

I had taken latest copy of plugins and library project both.

Please suggest some way.

Thanks,

 

Top 10 Contributor
1,905 Posts

The issue is that it appears there isn't any text to extract.  AN uses iTextSharp, so beyond this, if iTextSharp doesn't work to extract the text it looks like an iTextSharp bug, or...  it also could be that the text is presented in vector format and there is a code per letter.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts
Verified by InvestisDev

Hello,

 i had tried to get some solution from the itextsharp but no reply still, in between i had tried another method of itextsharp which is as below:

 iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage() using this the attached PDF file in this communication which gives result as numbers also shows proper data.

can you please take a look on this method to verify that Is it fine to use this method in place of the existing?

below is the screen of code which i had changed and so get the response of the pdf too. I had changed the code in "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" class - "PerformAction()" method

Let me know if you have any concern.

Thanks,

 

Top 10 Contributor
1,905 Posts

Yes, I will take a look.

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
1,905 Posts

Yes!  Perfect!

What have you discovered in the usage of the text strategies?

Mike

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Top 10 Contributor
83 Posts
Verified by InvestisDev

I got that method from some forum , but the same method is having one more overloaded form. which does not required this text Strategies to be specified.

so In that way too i had crawled the same pdf file which was not showing proper data is now showing actual content. i had changed the code as below.

So suggest me if this we can use as a solution or not. Below is the screen of code change in the same class "Arachnode.Plugins.CrawlActions.ManageLuceneDotNetIndexes" method "PerformAction()"

 

 

Thanks,

Top 10 Contributor
1,905 Posts

Yes, this is correct and should match what I checked in.

Thanks,
Mike 

For best service when you require assistance:

  1. Check the DisallowedAbsoluteUris and Exceptions tables first.
  2. Cut and paste actual exceptions from the Exceptions table.
  3. Include screenshots.

Skype: arachnodedotnet

Page 1 of 2 (17 items) 1 2 Next > | RSS
An Open Source C# web crawler with Lucene.NET search using SQL 2008/2012/CE

copyright 2004-2017, arachnode.net LLC