How to Search for Text inside PDF Files in SQL Server – Configuring Adobe PDF iFilter and implementing Full Text Search

Hi Guys,  In this post I will explore the implementation details for searching the contents inside the pdf's,
Writing this post based on a real-time requirement in my project, the requirement was to provide a UI where the user can enter the text to search, and display the results i.e. pdf's in a grid view that contains the search keyword with the option to download the pdf files.

Initial plan was to search by reading the pdf's from physical file system one by one, but since No of pdfs were large we decided to store the pdf files in the database.

Below Third-party libraries options were initially explored for searching pdf 's stored in File system –

Sl No
Library
Type
Remarks
1
iText
Paid

2
BitMiracle
Paid

3
FreeSpire.PDF
Free
Page limit per pdf is limited to 10 Pages in the Free version
4
Spire.PDF
Paid


To Search pdf files a software called as PDF iFilter is provided by Adobe so that we can search for pdf files that was stored and indexed in Sql Server.
Adobe PDFiFilter provides the ability for SQL Server to read the pdf contents.
Adobe PDFiFilter uses the Microsoft iFilter interface and allows third-party indexing tools to extract text from Adobe PDF files.

Below steps will all the configurations steps for configuring the Adobe PDF iFilter -  

Step 1) Full Text Search Installation check -

To check whether Full Text Search is installed or not, query the IsFullTextInstalled parameter.
select serverproperty('IsFullTextInstalled')
It should return 1
If it returns 0, the feature needs to be installed from SQL Server installation. install the Full Text Search in the feature selection page of Sql server installation



Explanation - Using Full-Text Search we can query formatted binary data or query a large amount of unstructured text data. Full-Text Search in SQL Server lets users and applications run full-text queries against character-based data in SQL Server tables.
Full Text queries are required since LIKE predicate query against a large amount of unstructured text data is much slower than an equivalent full-text query against the same data
A full-text query against large rows of text data will take only seconds; whereas a LIKE predicate query against same data can take minutes to return.

Step 2) Download and install iPDF Filter
PDF iFilter is required by SQL Server to read the PDF Documents and is a free download provided by Adobe, It must be installed in the server where SQL Server is installed and running –

Link to download Adobe PDF iFilter -

Install the PDFiFilter in any folder without any spaces
(As per adobe its preferred to install  in - C:\AdobePDFiFilter)

Step 3) After installation, we need to add the bin folder of the Adobe PDF iFilter (bin folder will be inside the AdobePDF iFilter installation) to the system Path variable.




Step 4) Execute below commands to register Adobe PDF iFilter in the SQL Server instance that you want to enable the search functionality.

Exec sp_fulltext_service 'load_os_resources', 1
Exec sp_fulltext_service 'verify_signature', 0

Explanation - First line is to make sure SQL Server uses the PDF iFilter, i.e. saying its available
Second line is used to validate the signature should be disabled for faster in testing environment
In Production it should be Enabled, It can be enabled by setting the parameter from 0 to 1.

Step 5) Restart all SQL Server Services (using services.msc)
Once instance is restarted, Verify Adobe PDFiFilter is correctly installed and configured by executing below query –

Select * From sys.fulltext_document_types where Document_Type = '.pdf'


Step 6) Create the tables to store the pdf documents
CREATE TABLE dbo.DocumentsTable (
   Id int IDENTITY(1, 1) not null,
   [FileName] nvarchar(255),
   [Filetype] nvarchar(6),
   [FileContent] varbinary(max),
   CreatedDate Datetime Default(GETDATE())
);

ALTER TABLE dbo.DocumentsTable ADD CONSTRAINT PK_DocumentsTable PRIMARY KEY CLUSTERED (
   Id
);

ALTER TABLE [dbo].[DocumentsTable] ADD  DEFAULT (GETDATE()) FOR [CreatedDate]
GO

Explanation - Here a Table named Documents Table is created with multiple columns with description that are self-explanatory, the most important column in FileContent that contains the actual pdf content.

Step 7) Create the FullText Index
To enable Full-text indexing,
Launch Microsoft SQL Server Management Studio. Right-click the database you have created > select Properties > choose Files tab > check “Use full-text indexing” > OK;
OR
Execute the following command:
exec sp_fulltext_database 'enable' 3.

Explanation -  Full-text indexes must be created as the Full Text Queries to search the pdf columns use these indexes to search the values in the pdf columns.
Full Text index is similar to SQL Server indexes, but the differences is - Only one full-text index allowed per table where Several regular indexes are allowed per table and

Step 8) Create the FullText Catalog
A Full-text CATALOG to store full-text indexes by executing the following command:
Below is for my Database Implementation
CREATE FULLTEXT CATALOG DocumentsTable;

A Full-Text Search index created as -
CREATE FULLTEXT INDEX ON dbo.DocumentsTable(
   [FileContent] TYPE COLUMN [Filetype] LANGUAGE 1033
)
KEY INDEX PK_DocumentsTable ON DocumentsTable

Explanation - Full-text CATALOG can be considered as warehouse which will contain all the Indexes, i.e. Full Text Search indexes.

Step 9) Insert the PDF documents (PDF Documents can be inserted in multiple ways like using OpenRowset command or using C# Code)
Below code can be used to insert the pdf document – Here the pdf file is present in the given directory and also we have mentioned the File Name and File Type expicitly,

INSERT INTO dbo.DocumentsTable([FileContent], [Filetype], [FileName])
SELECT bulkcolumn, '.pdf', ' VehicleInsurance '
FROM OPENROWSET(BULK 'C:\Users\pnuser\Desktop\PDFFiles\VehicleInsurance.pdf', SINGLE_BLOB) AS t;

Step 10) Full Text Queries to search pdf contents –
DECLARE  @varSearchCriteriaIN    NVARCHAR(2000)= NULL,
SET @varSearchCriteriaIN     = 'name'

SELECT *
FROM dbo.DocumentsTable
WHERE contains([FileContent], @varSearchCriteriaIN);


1 comment:

Rohit Rana said...

This article is so useful for me, thanks for sharing with us PDF Files Phone Number and Email Extractor