C Razor Sharp / C# / .NET: January 2010

Link to source code and other goodies discussed here.

UPDATE: TextFileHarvester for SQL 2008 can now be found here.

I think the first thing to answer is, What is Solr? I'll let Solr explain itself:

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.

Simply put, Lucene is an open-source Java library that does full text searching. In other words, if you want to provide searching beyond the simple "sql like query" you'll need full text search. Now Lucene does just that, but firstly, it's written in Java (there is a .Net version called Lucene.net but frankly they're way behind the original Java project, and I don't know how active that community is.) Secondly, it's just a library, so you'd need to create your own search engine around it. That's where Solr comes in. Solr is a Java web based application that functions as a Search engine (using Lucene under the covers). The nice thing about it is that it has a simple Web Service API that's XML based which means, it can be used from C# (or any other language for that matter).

The point of this tutorial is to provide a simple walkthrough to getting Solr up and running to index and search some documents using C#. I work in Computer Forensics / E-Discovery where being able to search and filter documents is vital. I learned alot on the job, and felt that there wasn't really one good end to end tutorial aimed at C# developers, so I hope someone will find at least some of this information useful.

Here's how this will work. First, I'll walk you through getting Tomcat up and running. Then, I'll demonstrate how to install Solr. Next, I'll have you download a little utility I wrote which basically creates a database, and downloads a whole bunch of text files off the net. Finally, we'll write some C# code to index and search those text files.

Part 1: Installing Tomcat:

From their website:

Apache Tomcat is an open source software implementation of the Java Servlet and JavaServer Pages technologies. The Java Servlet and JavaServer Pages specifications are developed under the Java Community Process.

Since Solr is a Java web app, we need a server to host the Java Servlet (that's basically a Java web app). First, you'll need to make sure Java is installed on your machine. You can go to this site which will tell you whether or not it's installed: http://www.java.com/en/download/help/testvm.xml If you see this:

then you have Java installed. Otherwise, head on over to this link, and grab the latest version of the JRE (Java Runtime Environment.)

The next step is to install Tomcat. You can grab it here: http://apache.cyberuse.com/tomcat/tomcat-6/v6.0.24/bin/apache-tomcat-6.0.24.exe That's the windows Installer version. Run the install and click Next until you get to this page:

You can put any port number into the Port text box (the default is 8080). I put 8983 as that's the default you'll see in all Solr examples on the net. You can choose any port you'd like it doesn't really matter. Just remember which port you choose as that'll always be part of the URL when connecting to Solr. In all my examples here, I'll be using 8983.

On the next screen, it'll try to find the location of the JRE on your system:

It should find it automatically, but if it doesn't make sure to point it to directory where Java was installed on your system. Click next and Tomcat will install.

Once you finish the installation, tomcat will start and you'll see this:

(If you don't see that, either look at your Windows Taskbar in the bottom right for the Tomcat icon, or go to Start-> Programs -> Appache Tomcat 6.0 -> Configure Tomcat)

Click on the Start button and watch Tomcat startup. Above the Start button you should now see "Serive Status: Started. If it does not, here's what you need to do. Go to Program Files\Java\jre6\bin and locate the file called msvcr71.dll. Copy that file and paste it into C:\Windows\system. Now try starting Tomcat up again, and now it should work. Don't ask me why, but after breaking my head a while back on one of my test machines, I found the answer on Google.

Once Tomcat is started, open a web browser and navigate to: http://localhost:8983 (remember, if you used a different port during installation, use that one instead). You should be greeted with this beautiful page:

If you don't see that page, please refer to Tomcat's FAQ.

Step 2: Installing Solr

The next step in this process is to install the Solr web application. First, let's shut down Tomcat. (Click Start-> Programs -> Appache Tomcat 6.0 -> Configure Tomcat and click on the Stop button.) Next, we'll need to download Solr. As of the time of this blog post, the latest version of Solr is 1.4 so let's download and install that one. Head on over to this page and choose the mirror of your choice: http://www.apache.org/dyn/closer.cgi/lucene/solr/1.4.0 Once you click on one of the mirror links, you'll be taken to a page where you can choose the different formats to download. Download this one:

apache-solr-1.4.0.zip

Once that's downloaded, unzip the file and locate the folder "$\apache-solr-1.4.0\dist". In there you'll see a file "apache-solr-1.4.0.war"; war files are basically zipped up files that contain all of the necessary binaries for the servlet container (Tomcat) to run the webapp. Copy that file and past it to: "$\Program Files\Apache Software Foundation\Tomcat 6.0\webapps" Rename the file to solr.war because that will be the name of the URL to access Solr.

Next, we need to create the "Solr Home". The Solr home is basically a folder where Tomcat will look for all the relevant Solr configurations, as well as where the actual index files will be stored. Go back to the unzipped solr package, and locate the folder: "$\apache-solr-1.4.0\example\solr". Copy the contents of that directory (should be a bin folder, a conf folder and a readme.txt) and place it under your root directory, under a folder named Solr (e.g. C:\Solr). Note: You can place it anywhere on your hard drive, doesn't have to be under the root, but wherever you place it, make sure to remember the full path. For now, don't worry about what exactly you're doing, I'll explain more later.

Next next step is to tell Tomcat where the Solr Home folder is. Open Tomcat's configuration window again, and go to the Java tab. Locate the Java Options text box, and enter this line at the end of all the other stuff that's already there:

-Dsolr.solr.home=c:\solr

(If you chose a different Home Folder, place that path instead.)

Now go back to the General tab, and click Start. Once Tomcat starts up, open your browser and navigate to:

http://localhost:8983/solr/

You should be greeted with this page:

Click on the Solr Admin link and you'll see this page:

In the Query String box, replace the word solr with this: *:* (this is the lucene syntax for "select all") and click Search. You should now see a page of results in XML format. At this point we haven't indexed anything, so you won't see any results, but Congratulations, you've set up Solr!

Step 2.1: Configuring Solr

In order to move on to Solr Configurations, it's important to take a minute and explore how Lucene works. I'm no expert, and if you want a deeper understanding, use Google, but I will mention a few key points.

The first step to getting documents search-able is to index them. I'm sure you've heard that term before, but what exactly does that mean? At the heart of indexing are two concepts known as Tokenizing, and Filtering. The idea is, you use these tokenizers (that have custom analyzers) and filters to analyze the data and store them in a custom file structure known as an index. I think this page does a great job of explaining it:

The analyzer's job is to take apart a string of text and give you back a stream of tokens. The tokens are presumably usually words from the text content of the string, and that's what gets stored (along with the location and other details) in the index.

Each analyzer includes one or more tokenizers and may include filters. The tokenizers take care of the actual rules for where to break the text up into words (typically whitespace). The filters do any post-tokenizing work on the tokens (typically dropping out punctuation and commonly occurring words like "the", "an", "a", etc).

So you feed Lucene the data, it analyzes it and tokenizes it, at which point it is now search-able. Then, when you submit a query, it uses the same tokenizers and filters to parse the query, and return the results. Why is this relevant? Well, because different data will be tokenized differently. For most text you'll use a WhitespaceTokenizer which splits on white spaces, and tokenizes the words and generally ignoring special characters. For integers or dates for example, you obviously don't want that kind of tokenizer. You want to be able to specify that this is an int field, and be sorted as an int for example.

That's where the Solr config file comes in. You specify all the fields that you want to index, as well as which tokenizers and filters to use for those fields. If you've been following along with this tutorial, you'll find the config file(s) under $/Solr/Conf. There will be many files there, but we won't concern ourselves with all of them. For now we'll focus on schema.xml. The other one I'd like to mention is the solrconfig.xml file which basically are settings for solr itself (how often to commit the index, how much ram to use while indexing etc...). The great thing about the example files is that there's very very detailed documentation in the XML files.

For this tutorial, I'll be using text files that can be downloaded for free from http://www.textfiles.com. There you'll find thousands of text files from the early days of the net. I'll provide a link further down to a simple application that I wrote that scrapes the site, downloads as many files as you specify, and creates a database for you with the relevant information. For now though, I'd like to show you the config file needed to index these documents. We'll be indexing these fields:

fileid

doctext

title

datecreated

Here's the schema.xml file needed:


<?xml version="1.0" encoding="UTF-8" ?> 
  <schema name="example" version="1.2">
  <types>
      <fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" /> 
      <fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0" /> 
      <fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0" /> 
      <fieldType name="text" class="solr.TextField" positionIncrementGap="100">
          <analyzer type="index">
              <tokenizer class="solr.WhitespaceTokenizerFactory" /> 
              <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> 
              <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" /> 
              <filter class="solr.LowerCaseFilterFactory" /> 
              <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" /> 
          </analyzer>
        <analyzer type="query">
          <tokenizer class="solr.WhitespaceTokenizerFactory" /> 
          <filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" /> 
          <filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" /> 
          <filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" /> 
          <filter class="solr.LowerCaseFilterFactory" /> 
          <filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" /> 
        </analyzer>
      </fieldType>
      </types>
  <fields>
      <field name="fileid" type="int" indexed="true" stored="true" required="true" /> 
      <field name="doctext" type="text" indexed="true" stored="false" required="false" /> 
      <field name="title" type="text" indexed="true" stored="false" required="false" /> 
      <field name="datecreated" type="date" indexed="true" stored="false" /> 
  </fields>
  <uniqueKey>fileid</uniqueKey> 
  <defaultSearchField>doctext</defaultSearchField> 
  <solrQueryParser defaultOperator="OR" /> 
  </schema>

As you can see, the XML is pretty straight forward. I'd like to touch on two of the attributes in the field tag:

The first one is "stored". A lot of people make this mistake the first time they mess with Lucene, and they stored everything in the index. I'd like to stress this point really really strongly. An index is NOT a data storeage mechanism. It's not intended for that, it's not optimized for that. That's what Databases are for. What I mean by that is, the text you send to get indexed, gets tokenized and totally bastardized. There's no way to re-construct the original document. So what Lucene allows you to do, is store the text as is in the index as well for later retrieval. That's OK for small index's or small fields. However when you get to larger indexes or larger fields, your performance will suffer noticeably. What's most commonly done, is you index everything you need, however you only STORE the primary key (e.g. fileid). When you retrieve the items from the index during search, you get the primary key field, and then you query your database which IS meant for data storage, and get the rest of the fields you need. That's the approach I'll be taking here.

The other field I'd like to point out, is the "type" field. This tells Lucene, which kind of field this one will be, and based on the fieldtypes specified earlier, it uses the correct tokenizer.

There's also one change I think is worth making in the solconfig.xml. Not too far down in the solrconfig.xml file you'll see this:


  <!-- Used to specify an alternate directory to hold all index data
       other than the default ./data under the Solr home.
       If replication is in use, this should match the replication configuration.-->
  <dataDir>${solr.data.dir:./solr/data}</dataDir>

For some reason the example defaults to putting the index file in the same directory as the Tomcat server. Just comment out that line:


  <!-- Used to specify an alternate directory to hold all index data
       other than the default ./data under the Solr home.
       If replication is in use, this should match the replication configuration.
  <dataDir>${solr.data.dir:./solr/data}</dataDir>-->

we'll use the default "Solr Home" folder ($\solr\data\index) to store the index files.

At this point, in your /solr/conf folder, you should have these files: (delete all the other ones)

schema.xml (copy and paste the xml from above)

elevate.xml

protwords.txt

solrconfig.xml

stopwords.txt

synonyms.txt

Part 3: Downloading Some Text Files

OK, now we have most of our environment set up, it's time to get some files! I created a simple application, that connects to http://www.textfiles.com and downloads some files. It then creates a database and adds a record for each file downloaded. You can download it here:

For SQL 2005: http://www.box.net/shared/du4zac9vaa

For SQL 2008: http://www.box.net/shared/vtbod3rub6ibvof91xut

Here's how you use it:

TextFileHarvester.exe "ALEX\SQLEXPRESS" "TextFilesDatabase" "C:\MyTextFiles" 1000

First parameter is the Sql Server name. The second one is the Database name to create. The third one is the location where to store the text files on disk, and the last parameter is how many files to download.

Once that's done, you should have a database that looks something like this:

Just as a side point, the Dates are made up. They're just random dates between 1995 - 2005.

You should also have a directory (in the location you specified to the TextFileHarvester app) with many text files in them.

Part 4: Writing Some Code!!

Now comes the fun part. Earlier I mentioned that Solr was basically a Webservice that you can interact with using http and sending XML. You can do all of the work manually and create the XML by hand, but there's an awesome library out there that already does this. The library is open source and called SolrNet. Head on over there and download the dll's. (There's actually another library called SolrSharp but that's much older, and not as up to date as SolrNet. Also, in my opinion, SolrNet is much easier to use.)

Ok, now that we have the SolrNet dll's, we can create a simple application to Index them. We'll keep it simple, and do this: Connect to the database, get all of the files in one shot, connect to Solr and send them all to be indexed.

The way you send stuff to Solr using SolrNet is by having a class that holds your data, and then you add an attribute to each property that maps to a property in the index. Let me explain with code:


    public class TextFile
    {
        #region Members

        private string documentText;

        #endregion

        [SolrUniqueKey("fileid")]
        public int FileID { get; internal set; }

        public string FileLocation { get; internal set; }

        [SolrField("doctext")]
        public string DocumentText
        {
            get
            {
                if (this.documentText == null)
                {
                    this.documentText = File.ReadAllText(FileLocation);
                }
                return this.documentText;
            }
            
        }

        [SolrField("title")]
        public string Title { get; internal set; }

        [SolrField("datecreated")]
        public DateTime? DateCreated { get; internal set; }
    }

You add the SolrField attribute (or the SolrUniqueKey for the index primary key field) and provide the name for that field in the index. I made the DocumentText property be lazy loaded so we don't hit the file system for all file's text in one shot (there should probably be more error handling there, but hey, this is just a demo...).

SolrNet will require a collection of these objects, and it will then send that off to the index. The next step therefore, is to write some code to populate these objects. We can use Linq to Sql, or whatever you want to connect to the database and populate these objects. I wrote some (very) basic ADO.NET code to do this:


    internal class TextFileRepository
    {
        #region Members

        private string connectionString;

        #endregion

        #region Constructors

        public TextFileRepository(string connectionString)
        {
            this.connectionString = connectionString;
        }

        #endregion

        #region Methods

        public IEnumerable<TextFile> GetTextFiles()
        {
            return ExecuteSql("SELECT * FROM FILES");
        }

        public IEnumerable<TextFile> GetTextFiles(IEnumerable<int> fileIds)
        {
            if (!fileIds.Any()) { yield break; }
            string sql = String.Format("SELECT * FROM FILES WHERE FILEID IN({0})", fileIds.ToDelimetedString());
            foreach (var item in ExecuteSql(sql))
            {
                yield return item;
            }
        }

        private IEnumerable<TextFile> ExecuteSql(string sql)
        {
            using (SqlConnection connection = new SqlConnection(this.connectionString))
            using (SqlCommand command = connection.CreateCommand())
            {
                command.CommandText = sql;
                connection.Open();
                var reader = command.ExecuteReader();
                while (reader.Read())
                {
                    yield return FromReader(reader);
                }
            }
        }

        private TextFile FromReader(SqlDataReader reader)
        {
            var result = new TextFile();
            result.FileID = (int)reader["FileID"];
            result.Title = reader["Title"] as string;
            result.FileLocation = reader["FileLocation"] as string;
            var date = reader["DateCreated"];
            result.DateCreated = date == DBNull.Value ? (DateTime?)null : date as DateTime?;

            return result;
        }

        #endregion
    }

Nothing fancy. If you look at the GetTextFiles method it's about as straight forward as they come. Just do a SELECT * and convert to the TextFile object. (You'll notice a GetTextFiles with a parameter, but that's used for Searching as I'll explain shortly.)

Part 4.1: Indexing the Files

Now, we can write some code to Index these files:


    public class BasicIndexer
    {
        private string connectionString;
        private string solrUrl;

        public BasicIndexer(string connectionString, string solrUrl)
        {
            this.connectionString = connectionString;
            this.solrUrl = solrUrl;
        }

        public void IndexFiles()
        {
            Startup.Init<TextFile>(this.solrUrl);
            var solrWorker = ServiceLocator.Current.GetInstance<ISolrOperations<TextFile>>();
            var files = new TextFileRepository(this.connectionString).GetTextFiles();
            solrWorker.Add(files).Commit();
        }
    }

Pretty simple no? First, we need to call the Init method. Internally SolrNet uses IoC to handle the instantiation of the classes. Therefore, we can just use Microsoft.Practices.ServiceLocation from the Enterprise Library. (I'm not a huge fan of the Enterprise Library, but it's quick and easy for this example. Refer to the SolrNet documentation for better approaches using Castle Windsor.) Once we have the ISolrOperations, we can just call the Database to get all the files, and then call Add on the ISolrOperations. Once that's done, we just call Commit() and that's it!!

Let's give it a whirl. Create a quick ConsoleApplication and call the BasicIndexer class:


    public class Program
    {
        public static void Main(string[] args)
        {
            string connectionString = "";
            string solrUrl = "";

            BasicIndexer indexer = new BasicIndexer(connectionString, solrUrl);
            indexer.IndexFiles();
        }
    }

Fill in the blanks for your specific connection string, and your solr url (e.g. http://localhost:8983/solr). MAKE SURE YOU RESTART TOMCAT!! (Start -> Programs -> Configure Tomact -> Start)! Run the program and viola, you just indexed all of your files! (In a real world scenario, you'd obviously batch this operation, getting only x amount from the database, possibly event multithreading it to maximize performance, but again this is just a demo :) ). Now let's see if in fact our indexing worked. Open your browser to http://localhost:8983/solr (or whatever the Solr Url is) and click on Solr Admin. In the Search box enter "*:*" (the Lucene equivalent of SELECT *) and click the Search button. You SHOULD see this:

(depending on how many files you decided to download using the Harvester application you'll see different totals). Congrats! You've successfully indexed the files!

Part 4.2: Searching the Files

Now we want to search using C#. That too is VERY easy using SolrNet. The only thing to remember here, is like I mentioned earlier, we didn't STORE the fields in Solr. All we stored was the FileID. Therefore, we need to first retrieve those file id's, and then hit the database to get the rest of the information. It may seem like double work, but TRUST me!!! when dealing with larger data sets, it's MUCH faster. So, first let's create a class to hold the FileID's:


    internal class FileIDResult
    {
        [SolrField("fileid")]
        public int FileID { get; set; }
    }

Same concept like Indexing. We use the SolrField attributes for the properties that will map to the Index fields.

Now, we can write some code to execute the Search. There are a few parameters I'd like to discuss first. Aside from the search query itself, you need to specify the amount per page, and which number to start from. Solr has paging built right into it, so the way it works is, you specify how many items you want per page, and then how many items to skip over. So for example, if you have 100 results, and you have 10 items per page, and you want Page 3, you'd start at item 30 (it's zero based). With that in mind, here's the searching code:


internal static class SolrOperationsCache<T>
        where T: new()
    {
        private static ISolrOperations<T> solrOperations;

        public static ISolrOperations<T> GetSolrOperations(string solrUrl)
        {
            if (solrOperations == null)
            {
                Startup.Init<T>(solrUrl);
                solrOperations = ServiceLocator.Current.GetInstance<ISolrOperations<T>>();
            }

            return solrOperations;
        }

    }


    public class BasicSearcher
    {
        private string connectionString;
        private string solrUrl;

        public BasicSearcher(string connectionString, string solrUrl)
        {
            this.connectionString = connectionString;
            this.solrUrl = solrUrl;
        }

        public SearchResults Search(string query, int resultsPerPage, int pageNumber)
        {
            var solrWorker = SolrOperationsCache<FileIDResult>.GetSolrOperations(this.solrUrl);

            QueryOptions options = new QueryOptions
            {
                Rows = resultsPerPage,
                Start = (pageNumber - 1) * resultsPerPage,
            };

            ISolrQueryResults<FileIDResult> results = solrWorker.Query(query, options);
            var textFiles = new TextFileRepository(this.connectionString)
                .GetTextFiles(results.Select(r => r.FileID));


            var searchResults = new SearchResults
            {
                Results = textFiles,
                QueryTime = results.Header.QTime,
                TotalResults = results.NumFound
            };

            return searchResults;
        }
    }

    public class SearchResults
    {
        public IEnumerable<TextFile> Results { get; set; }
        public int QueryTime { get; set; }
        public int TotalResults { get; set; }
    }

Same idea as before. Get an instance to the ISolrOperations, and call the Query method. I used the overload that takes a QueryOptions object so that I can specify the page number and items per page. The result that comes back from that method call is an ISolrQueryResults which in addition to the search results itself, has some metrics. I wrapped all that up in the SearchResults class. Once we have the file id's we can hit the database, and get the rest of the data.

Let's give it a whirl:


    public class Program
    {
        public static void Main(string[] args)
        {
            string connectionString = "";
            string solrUrl = "";

            BasicSearcher searcher = new BasicSearcher(connectionString, solrUrl);
            var results = searcher.Search("*:*", 10, 1);
            foreach (TextFile file in results.Results)
            {
                Console.WriteLine("FileID: {0}, Title: {1}",file.FileID,file.Title);
            }
        }
    }

If you run this, you should see the file id's and titles for the first 10 documents! Congratulations, you've just executed a search!!

OK, this has gotten WAAAAAY longer than I ever imagined. There's ALOT going on here, I won't deny that. There's a few closing random points I'd like to make:

There are many various Tokenizers and Analyzers for various data. If you need to tweak one and create your own, you're SOL to do it in C#, you'll have to blow the dust off that old Java book and do it there.

The actual index is stored under $\solr\data\index. Sometimes it's useful to actually look into the index files and read them. For that you can use Luke (a standalone Java application.)

In the real world, you'd possibly have your Indexing Server somewhere other than where the data is. Keep that in mind, as the XML sent across can get quite big, so see if you can optimize that.

There's so much more to Lucene and Solr than what I covered here. Here are some links for more reading:

Lucene Home Page
Lucene tutorial. (While this is an actual Lucene tutorial of how to use the Lucene Java libraries, it has some great insight in to what's happening behind the scenes.)
Solr Home Page
Solr Wiki
The various Tokenizers an Analyzers available from Solr

That's all I can think of for now. If there's anything else, I'll add it in the future.

Lastly, here's a zip file containing everything I covered in this post. You'll find the TextFileHarvester app. You'll find the solr configuration files I've used for this demo. You'll also find a simple library I created to wrap the indexing and searching functionality of these text files, as well as a basic windows application to test both. Good luck, and Happy Coding!

P.S. This tutorial took a lot of time and effort. If you find it useful, just drop me a a line in the comments.

C Razor Sharp / C# / .NET

Sunday, January 24, 2010

Full Text Search using Solr / Lucene and C# / .NET. End to end tutorial on using these technologies with SolrNet.

Thursday, January 7, 2010

Comparing C# structs to null. Can you or not?

Labels