There are many ways to extract text from PDF.
- Microsoft IFilter interface and Adobe IFilter implementation
- iTextSharp
- PDFBox
Among those solutions using PDFBox is the best according to my experiences because,
- Using Adobe IFilter implementation need to have lot of other things set up in running machine like,
- Windows 2000 or later
- Adobe Acrobat or Reader 7.0.5+
- IFilter COM wrapper class
- iTextSharp is opensource but can't be used in commercial products.
So the best solution is using PDFBox. It doesn't need other things installed on your machine.
PDFBox is a java PDF library and there is a .net version of PDFBox (It can be downloaded from here
Here is an example how to use PDFBox
using org.apache.pdfbox.pdmodel;
using org.apache.pdfbox.util;
// ...
private static string ExtractTextFromPdf(string path)
{
PDDocument doc = null;
try {
doc = PDDocument.load(path)
PDFTextStripper stripper = new PDFTextStripper();
return stripper.getText(doc);
}
finally {
if (doc != null) {
doc.close();
}
}
}
NOTE:
- Following dll files of PDFBox library should be added to bin file of your project
- commons-logging.dll
- fontbox-1.8.9.dll
- IKVM.OpenJDK.Text.dll
- IKVM.OpenJDK.Util.dll
- IKVM.Runtime.dll
- Reference for following dll s should be added
- IKVM.OpenJDK.Core.dll
- IKVM.OpenJDK.SwingAWT.dll
- pdfbox-1.8.9.dll
No comments:
Post a Comment