I have come across the case where I have to read PDF
document, parse it and write content to database using SSIS. PDF document is
not usual source for any data so there is no component available to read PDF
file from SSIS. I have used iTextSharp library
in Script Component to read the PDF file from SSIS package. I also found very good link where I found good
example on how to use iTextsharp library.
Using iTextsharp library it is very easy to process PDF document. In my case it was simple case of reading PDF document page by page and parses the content using custom logic. You can do much more than just reading PDF document using this library.
To read data from PDF file and write to database, I have
used Script Component as data source, which reads the PDF document and return
result and output record of Script Component. The simply used OLE DB
destination to write it to database. You might be aware that any third party
dlls can be used in SSIS Script component by referencing it. But to add it as
reference and use it in script component you have to register dll in GAC. It
can be referenced without adding it but dll might not be available on the production
environment at the same location. So better add them to GAC. In my case I was
not allowed to add dll to production server directly, so I used .Net reflection
to load the dll dynamically and added dll location as configuration in SSIS
package.
Below is my code
//Below two variable are used to pass ITextSharp
dll location and PDF file path.
//Which ultimately supplied from dts config file.String configITextPath = Variables.itextdllfilepath;
String configSourceDataFile = Variables.filepath;
//Load ITextSharp library
Assembly assembly = Assembly.LoadFrom(configITextPath);
//Get PdfReader object and create its instance
Type readerType = assembly.GetType("iTextSharp.text.pdf.PdfReader");object reader = Activator.CreateInstance(readerType, new object[] { configSourceDataFile });
Type extractorType = assembly.GetType("iTextSharp.text.pdf.parser.PdfTextExtractor");
Type extractionStrategyType = assembly.GetType("iTextSharp.text.pdf.parser.LocationTextExtractionStrategy");
object
extractionStrategy = Activator.CreateInstance(extractionStrategyType);
//Get number of pages in PDF document
Int32 pages = (Int32)reader.GetType().GetProperty("NumberOfPages").GetValue(reader, null);
//Read PDF document page by page and process it and generate Output rows
for script component.
String pageText;List<PepRecord> pepRecords = new List<PepRecord>();
for (int page = configStartPageNumber; page <= pages; page++)
{
reader = Activator.CreateInstance(readerType, new object[] { configSourceDataFile });
extractionStrategy = Activator.CreateInstance(extractionStrategyType);
pageText = (String)extractorType.GetMethods()[0].Invoke(reader, new Object[] { reader, page, extractionStrategy });
//code to process PDF file content and create Output rows for script component
}
I know better way to do this is to add dll in GAC and refer
from there. But in my case I have to quickly found the solution to process few
PDF document suing SSIS package, so I have used reflection to load it.
How to add DLL In GAC
ReplyDeletehttps://bipassion.wordpress.com/2012/03/08/ssis-dll-assembly-reference/
ReplyDelete