Hi Guys, In this post I will explore the implementation details for searching the contents inside the pdf's,
Writing this post based on a real-time requirement in my project, the requirement was to provide a UI where the user can enter the text to search, and display the results i.e. pdf's in a grid view that contains the search keyword with the option to download the pdf files.
Step 1) Full Text Search Installation check -
Step 5) Restart all SQL Server Services (using services.msc)
Step 8) Create the FullText Catalog
Step 10) Full Text Queries to search pdf contents –
Writing this post based on a real-time requirement in my project, the requirement was to provide a UI where the user can enter the text to search, and display the results i.e. pdf's in a grid view that contains the search keyword with the option to download the pdf files.
Initial plan
was to search by reading the pdf's from physical file system one by one, but since No of pdfs were large we decided to store the pdf files in the database.
Below Third-party libraries options
were initially explored for searching pdf 's stored in File system –
Sl No
|
Library
|
Type
|
Remarks
|
1
|
iText
|
Paid
|
|
2
|
BitMiracle
|
Paid
|
|
3
|
FreeSpire.PDF
|
Free
|
Page limit
per pdf is limited to 10 Pages in the Free version
|
4
|
Spire.PDF
|
Paid
|
To Search
pdf files a software called as PDF iFilter is provided by Adobe so that we can search for pdf files that was stored and indexed in Sql Server.
Adobe PDFiFilter provides the ability for SQL Server to read the pdf contents.
Adobe PDFiFilter provides the ability for SQL Server to read the pdf contents.
Adobe PDFiFilter uses the Microsoft iFilter interface and
allows third-party indexing tools to extract text from Adobe PDF files.
Below steps will all the configurations steps for configuring the Adobe PDF iFilter -
Step 1) Full Text Search Installation check -
To check whether Full Text Search is installed or not, query the IsFullTextInstalled parameter.
select serverproperty('IsFullTextInstalled')
It should return 1
If it returns 0, the feature needs to be
installed from SQL Server installation. install the Full Text Search in the feature selection page of Sql server installation
Explanation - Using Full-Text Search we can query formatted binary data or query a large amount of unstructured text data. Full-Text Search in SQL Server lets users and applications run full-text queries against character-based data in SQL Server tables.
Explanation - Using Full-Text Search we can query formatted binary data or query a large amount of unstructured text data. Full-Text Search in SQL Server lets users and applications run full-text queries against character-based data in SQL Server tables.
Full Text queries are required since LIKE
predicate query against a large amount of unstructured text data is much slower
than an equivalent full-text query against the same data
A full-text query against large rows of text data will take only
seconds; whereas a LIKE predicate query against same data can take minutes to
return.
Step 2) Download and install iPDF Filter
Step 2) Download and install iPDF Filter
PDF iFilter is required by SQL Server to
read the PDF Documents and is a free download provided by Adobe, It must be
installed in the server where SQL Server is installed and running –
Link to download Adobe PDF iFilter -
Install the PDFiFilter in any folder
without any spaces
(As per adobe its preferred to install in - C:\AdobePDFiFilter)
Step 3) After installation, we need to add the bin folder of the
Adobe PDF iFilter (bin folder will be inside the AdobePDF iFilter installation)
to the system Path variable.
Step 4) Execute below commands to register Adobe PDF iFilter in
the SQL Server instance that you want to enable the search functionality.
Exec sp_fulltext_service 'load_os_resources', 1
Exec sp_fulltext_service
'verify_signature', 0
Explanation - First line is to make sure SQL Server uses the PDF iFilter, i.e. saying its available
Second line is used to validate the signature should be disabled for
faster in testing environment
In Production it should be Enabled, It can be enabled by setting the
parameter from 0 to 1.
Step 5) Restart all SQL Server Services (using services.msc)
Once instance is restarted, Verify Adobe
PDFiFilter is correctly installed and configured by executing below query –
Select *
From sys.fulltext_document_types
where Document_Type =
'.pdf'
Step 6) Create the tables to store the pdf documents
CREATE TABLE dbo.DocumentsTable
(
Id int IDENTITY(1, 1) not null,
[FileName] nvarchar(255),
[Filetype] nvarchar(6),
[FileContent] varbinary(max),
CreatedDate Datetime Default(GETDATE())
);
ALTER TABLE dbo.DocumentsTable ADD CONSTRAINT PK_DocumentsTable PRIMARY KEY CLUSTERED
(
Id
);
ALTER TABLE [dbo].[DocumentsTable]
ADD
DEFAULT (GETDATE()) FOR [CreatedDate]
GO
Explanation - Here a Table named Documents Table is
created with multiple columns with description that are self-explanatory, the
most important column in FileContent that contains the actual pdf content.
Step 7) Create the FullText Index
To enable Full-text indexing,
Launch Microsoft
SQL Server Management Studio. Right-click the database you have created >
select Properties > choose Files tab > check “Use full-text indexing”
> OK;
OR
Execute the
following command:
exec sp_fulltext_database
'enable'
3.
Explanation - Full-text indexes must be
created as the Full Text Queries to search the pdf columns use these indexes to
search the values in the pdf columns.
Full Text index is similar to SQL Server indexes, but the differences
is - Only one full-text index allowed per table where Several regular indexes are
allowed per table and
Step 8) Create the FullText Catalog
A Full-text CATALOG to store full-text indexes by executing the
following command:
Below is for my Database Implementation
CREATE FULLTEXT CATALOG DocumentsTable;
A Full-Text Search index created as -
CREATE FULLTEXT INDEX ON dbo.DocumentsTable(
[FileContent] TYPE COLUMN [Filetype] LANGUAGE 1033
)
KEY INDEX
PK_DocumentsTable ON DocumentsTable
Explanation - Full-text CATALOG can be
considered as warehouse which will contain all the Indexes, i.e. Full Text
Search indexes.
Step 9) Insert the PDF documents (PDF Documents
can be inserted in multiple ways like using OpenRowset command or using C#
Code)
Below code can be used to insert the pdf
document – Here the pdf file is present in the given directory and also we have
mentioned the File Name and File Type expicitly,
INSERT INTO dbo.DocumentsTable([FileContent], [Filetype], [FileName])
SELECT bulkcolumn, '.pdf', ' VehicleInsurance '
FROM OPENROWSET(BULK 'C:\Users\pnuser\Desktop\PDFFiles\VehicleInsurance.pdf', SINGLE_BLOB) AS t;
Step 10) Full Text Queries to search pdf contents –
DECLARE
@varSearchCriteriaIN NVARCHAR(2000)= NULL,
SET @varSearchCriteriaIN
= 'name'
SELECT *
FROM dbo.DocumentsTable
WHERE contains([FileContent], @varSearchCriteriaIN);
1 comment:
This article is so useful for me, thanks for sharing with us PDF Files Phone Number and Email Extractor
Post a Comment