lets see: Best method to extract text from PDF in asp.net C#

Thursday, January 21, 2016

Best method to extract text from PDF in asp.net C#

There are many ways to extract text from PDF.

Microsoft IFilter interface and Adobe IFilter implementation
iTextSharp
PDFBox

Among those solutions using PDFBox is the best according to my experiences because,

Using Adobe IFilter implementation need to have lot of other things set up in running machine like,
- Windows 2000 or later
- Adobe Acrobat or Reader 7.0.5+
- IFilter COM wrapper class
iTextSharp is opensource but can't be used in commercial products.

So the best solution is using PDFBox. It doesn't need other things installed on your machine.

PDFBox is a java PDF library and there is a .net version of PDFBox (It can be downloaded from here

Here is an example how to use PDFBox


using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

// ...

private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}

NOTE:

Following dll files of PDFBox library should be added to bin file of your project
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
Reference for following dll s should be added
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll

No comments:

Post a Comment

Subscribe to: Post Comments (Atom)