Thursday, January 21, 2016

Best method to extract text from PDF in asp.net C#


There are many ways to extract text from PDF.
  • Microsoft IFilter interface and Adobe IFilter implementation
  • iTextSharp
  • PDFBox
Among those solutions using PDFBox is the best according to my experiences because,
  • Using Adobe IFilter implementation need to have lot of other things set up in running machine like,
    • Windows 2000 or later
    • Adobe Acrobat or Reader 7.0.5+
    • IFilter COM wrapper class
  • iTextSharp is opensource but can't be used in commercial products.
So the best solution is using PDFBox. It doesn't need other things installed on your machine.

PDFBox is a java PDF library and there is a .net version of PDFBox (It can be downloaded from here


Here is an example how to use PDFBox


using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;

// ...

private static string ExtractTextFromPdf(string path)
{
  PDDocument doc = null;
  try {
    doc = PDDocument.load(path)
    PDFTextStripper stripper = new PDFTextStripper();
    return stripper.getText(doc);
  }
  finally {
    if (doc != null) {
      doc.close();
    }
  }
}  

NOTE:
  1. Following dll files of PDFBox library should be added to bin file of your project
    • commons-logging.dll
    • fontbox-1.8.9.dll
    • IKVM.OpenJDK.Text.dll
    • IKVM.OpenJDK.Util.dll
    • IKVM.Runtime.dll
  2. Reference for following dll s should be added
    • IKVM.OpenJDK.Core.dll
    • IKVM.OpenJDK.SwingAWT.dll
    • pdfbox-1.8.9.dll

No comments:

Post a Comment

Optimize you working enviorenment : Single command to create & move to a directory in linux (C Shell, Bash)

Usually move to a directory just after creating is bit of a anxious task specially if the directory name is too long. mkdir long-name-of...