Indexing files in Solr using Tika

Sunday, January 22, 2012

#.NET #C# #Solr #Tika

After installing the latest Solr release I noticed that the schema.xml file (the XML-file that holds the information about how Solr should index and search your data) already was setup for something called Apache Tika that can be used to index all sorts of documents and other files that holds all kinds of metadata (audio-files for instance).

See this post to learn how to install Solr on Windows: Installing Apache Solr on Windows

The great thing about Tika is that you don’t need to do anything in order to make it work – Tika does everything for you – or almost everything. You can also set it up so that it suits your needs just by adding a config-XML file in the Solr directory and a reference in the Solr-config XML-file. I tried to get it to work, but found that quit difficult also because there wasn’t much help to get out on Google. Because of all my problems getting it to work I created this post on StackOverflow. Before I got a reply on the subject though, I found the solution myself burried deep inside some forum posts about SolrNet.

This how you do in order to get it to work inside Solr using SolrNet as client.

Install Solr using the link from my earlier post above or something similiar – the main thing is that you install it using the release from Solr that is already build. Create a new folder called “lib” inside your Solr-install folder. Copy the apache-solr-cell-3.4.0.jar file from the “dist”-folder from the Solr zip-file to the newly created “lib”-folder the folder where you installed Solr. Copy the content of contrib\extraction\lib from the Solr-zip to the same newly created “lib”-folder the folder where you installed Solr. Now Tika is installed in Solr! Remember to go to http://localhost:8080/solr and confirm that it is installed correctly.

To use it in a .NET client application you can use the newest release of SolrNet (currently the 0.4.0.2002 beta version release) and add the DLLs to your .NET project (all of them – seriously!). This is an example of how to use it in C#:

Startup.Init("YOUR-SOLR-SERVICE-PATH");
var solr = ServiceLocator.Current.GetInstance();
 
using (FileStream fileStream = File.OpenRead("FILE-PATH-FOR-THE-FILE-TO-BE-INDEXED"))
{
   var response =
      solr.Extract(
         new ExtractParameters(fileStream, "doc1")
         {
            ExtractFormat = ExtractFormat.Text,
            ExtractOnly = false
         });
}

solr.Commit();

The response will hold all the metadata that has been extracted from the file using Apache Tika. The ExtractParameters is given a FileStream object and an ID for the Solr index (here just “doc1” – can be anything as long as it is unique). The ExtractOnly property can be set to true if you don’t wan’t to get Tika to index the data, but only wan’t it to extract the metadata from the file that is sent. The file is streamed to the Solr API using HTTP POST. You can read more about that here: http://wiki.apache.org/solr/ExtractingRequestHandler

In the above code the data sent to Solr is indexed in the last line, where the data is committed to Solr. If you would like Solr to index and commit the files when sent to the service you can set the AutoCommit property to true inside the initiation of ExtractParameters:

   var response =
      solr.Extract(
         new ExtractParameters(fileStream, "doc1")
         {
            ExtractFormat = ExtractFormat.Text,
            ExtractOnly = false,
            AutoCommit = true
         });

Because the commit is done everytime you send a new file to the Solr-API you can search during the indexing and, of course, you don’t need to call the solr.Commit() method after indexing.

You need a request handler inside your solrconfig.xml (inside {your-solr-install-path}/conf) to make Solr understand the request from the client. Below is an example of how the solrconfig.xml looks when you haven’t changed anything after install of Solr. See this for further information about configuring Tika inside Solr: http://wiki.apache.org/solr/ExtractingRequestHandler

  <requestHandler name="/update/extract" 
                  startup="lazy"
                  class="solr.extraction.ExtractingRequestHandler" >
    <lst name="defaults">
      <!-- All the main content goes into "text"... if you need to return
           the extracted text or do highlighting, use a stored field. -->
      <str name="fmap.content">text</str>
      <str name="lowernames">true</str>
      <str name="uprefix">ignored_</str>
 
      <!-- capture link hrefs but ignore div attributes -->
      <str name="captureAttr">true</str>
      <str name="fmap.a">links</str>
      <str name="fmap.div">ignored_</str>
    </lst>
  </requestHandler>

Your Solr schema.xml file (inside {your-solr-install-path}/conf) needs some fields in order to index the metadata from the files you send to Solr. You can provide the fields you need and index/store/ the metadata as is required for the files you need to index. This is the fields that Solr is installed with:

   <field name="id" type="string" indexed="true" stored="true" required="true" /> 
   <field name="sku" type="text_en_splitting_tight" indexed="true" stored="true" omitNorms="true"/>
   <field name="name" type="text_general" indexed="true" stored="true"/>
   <field name="alphaNameSort" type="alphaOnlySort" indexed="true" stored="false"/>
   <field name="manu" type="text_general" indexed="true" stored="true" omitNorms="true"/>
   <field name="cat" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="features" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="includes" type="text_general" indexed="true" stored="true" termVectors="true" termPositions="true" termOffsets="true" />
 
   <field name="weight" type="float" indexed="true" stored="true"/>
   <field name="price"  type="float" indexed="true" stored="true"/>
   <field name="popularity" type="int" indexed="true" stored="true" />
   <field name="inStock" type="boolean" indexed="true" stored="true" />
 
   <!--
   The following store examples are used to demonstrate the various ways one might _CHOOSE_ to
    implement spatial.  It is highly unlikely that you would ever have ALL of these fields defined.
    -->
   <field name="store" type="location" indexed="true" stored="true"/>
 
   <!-- Common metadata fields, named specifically to match up with
     SolrCell metadata when parsing rich documents such as Word, PDF.
     Some fields are multiValued only because Tika currently may return
     multiple values for them.
   -->
   <field name="title" type="text_general" indexed="true" stored="true" multiValued="true"/>
   <field name="subject" type="text_general" indexed="true" stored="true"/>
   <field name="description" type="text_general" indexed="true" stored="true"/>
   <field name="comments" type="text_general" indexed="true" stored="true"/>
   <field name="author" type="text_general" indexed="true" stored="true"/>
   <field name="keywords" type="text_general" indexed="true" stored="true"/>
   <field name="category" type="text_general" indexed="true" stored="true"/>
   <field name="content_type" type="string" indexed="true" stored="true" multiValued="true"/>
   <field name="last_modified" type="date" indexed="true" stored="true"/>
   <field name="links" type="string" indexed="true" stored="true" multiValued="true"/>
 
   <!-- catchall field, containing all other searchable text fields (implemented
        via copyField further on in this schema  -->
   <field name="text" type="text_general" indexed="true" stored="false" multiValued="true"/>
 
   <!-- catchall text field that indexes tokens both normally and in reverse for efficient
        leading wildcard queries. -->
   <field name="text_rev" type="text_general_rev" indexed="true" stored="false" multiValued="true"/>
 
   <!-- non-tokenized version of manufacturer to make it easier to sort or group
        results by manufacturer.  copied from "manu" via copyField -->
   <field name="manu_exact" type="string" indexed="true" stored="false"/>
 
   <field name="payloads" type="payloads" indexed="true" stored="true"/>

The fields above is a fits-all scenario, so with this you can both index audio and document files.

Supported formats in Tika can be found here: http://tika.apache.org/1.0/formats.htm