Saturday, April 9, 2011

Implementing the Windows 2008 TIFF IFilter and FAST Search for SharePoint 2010 (FS4SP)

Topic: SharePoint 2010, FAST Search Server (FS4SP), Windows 2008 TIFF IFilter, optical character recognition
Subject:  Extending the FAST Search for SharePoint 2010 pipeline. (FS4SP) user_converter_rules.xml, Windows 2008 TIFF IFilter
Problem:  Every once in a while someone will have non-OCR’d “Optical character recognition” Tiffs which they would like to crawl.   How do I crawl them?
Response: Windows 2008 Server has a built-in Windows TIFF IFilter which can be used to extend the FS4SP pipeline to use this IFilter bypassing the built-in IFilterConverter.
Solution:
1.      Enable the Windows TIFF IFilter on each FS4SP which has Document Processors enabled.
a.      Open Server Manager -> Features - Add Features
b.      Select Windows TIFF IFilter and click install

2.      Configure the FS4SP pipeline to use the new IFilter
a.      Navigation to the
<FAST Install Drive>\FASTSearch\etc\config_data\DocumentProcessor\formatdetector
b.      Edit: user_converter_rules.xml
c.      Add new entries for TIFFs for both the “trust” and “MimeMapping” nodes.
<ConverterRules>
               <IFilter>
                              <trust>
                                             <!--     <ext name=".xxx" mimetype="application/xxx" /> -->
                                             <ext name=".tif" mimetype="image/tif"/>
                                             <ext name=".tiff" mimetype="image/tiff"/>
                              </trust>
               </IFilter>

               <MimeMapping>
                              <!--
                              A mapping between mime types and the description of them.
                              <mime type="application/xxx">XXX Document</mime>
                              -->
                              <mime type="image/tif">TIF File</mime>
                              <mime type="image/tiff">TIF File</mime>
               </MimeMapping>
</ConverterRules>
3.      Reset the FAST Processor Servers (pipeline)
a.      Open FAST Command Shell as Administrator
b.      Issue the command: “psctrl reset”.  (As all processor servers are tied together so this command only needs to be run once after all processor servers are configured)
4.      Setup new Content Source Crawl to test
5.      Search for newly crawled Content


Summary:  TIFF’s can be crawled using the built-in IFilterConverter (w/Advanced Filter Pack enabled) and will be searchable by Crawled properties such as Title and Author.  If the TIFF has been OCR’d it will also be searchable by the physical content of the TIFF.  Using the Windows 2008 TIFF IFilter will add overhead to the FAST pipeline due to the OCR component so when should you use the built-in IFilterConverter  vs. Windows 2008 TIFF IFilter.
a.      TIFF has already been OCR’d.  Use IFilterConverter.
b.      TIFF has not been OCR’d but has no useful content to index except title.  Use IFilterConverter.
c.      TIFF has not been OCR’d but has useful content to index.  Use Windows 2008 TIFF IFilter.
Side Note: If you are NOT using the SharePoint Crawler see my Blog on “FS4SP and User_Converter_Rules.xml”
KORITFW

No comments:

Post a Comment