Sunday, January 24, 2010

Full Text Search using Solr / Lucene and C# / .NET. End to end tutorial on using these technologies with SolrNet.

Link to source code and other goodies discussed here.

UPDATE: TextFileHarvester for SQL 2008 can now be found here.

I think the first thing to answer is, What is Solr? I'll let Solr explain itself:

Solr is the popular, blazing fast open source enterprise search platform from the Apache Lucene project. Its major features include powerful full-text search, hit highlighting, faceted search, dynamic clustering, database integration, and rich document (e.g., Word, PDF) handling. Solr is highly scalable, providing distributed search and index replication, and it powers the search and navigation features of many of the world's largest internet sites.

Solr is written in Java and runs as a standalone full-text search server within a servlet container such as Tomcat. Solr uses the Lucene Java search library at its core for full-text indexing and search, and has REST-like HTTP/XML and JSON APIs that make it easy to use from virtually any programming language. Solr's powerful external configuration allows it to be tailored to almost any type of application without Java coding, and it has an extensive plugin architecture when more advanced customization is required.


Simply put, Lucene is an open-source Java library that does full text searching. In other words, if you want to provide searching beyond the simple "sql like query" you'll need full text search. Now Lucene does just that, but firstly, it's written in Java (there is a .Net version called Lucene.net but frankly they're way behind the original Java project, and I don't know how active that community is.) Secondly, it's just a library, so you'd need to create your own search engine around it. That's where Solr comes in. Solr is a Java web based application that functions as a Search engine (using Lucene under the covers). The nice thing about it is that it has a simple Web Service API that's XML based which means, it can be used from C# (or any other language for that matter).

The point of this tutorial is to provide a simple walkthrough to getting Solr up and running to index and search some documents using C#. I work in Computer Forensics / E-Discovery where being able to search and filter documents is vital. I learned alot on the job, and felt that there wasn't really one good end to end tutorial aimed at C# developers, so I hope someone will find at least some of this information useful.

Here's how this will work. First, I'll walk you through getting Tomcat up and running. Then, I'll demonstrate how to install Solr. Next, I'll have you download a little utility I wrote which basically creates a database, and downloads a whole bunch of text files off the net. Finally, we'll write some C# code to index and search those text files.

Part 1: Installing Tomcat:

From their website:

Apache Tomcat is an open source software implementation of the Java Servlet and JavaServer Pages technologies. The Java Servlet and JavaServer Pages specifications are developed under the Java Community Process.


Since Solr is a Java web app, we need a server to host the Java Servlet (that's basically a Java web app). First, you'll need to make sure Java is installed on your machine. You can go to this site which will tell you whether or not it's installed: http://www.java.com/en/download/help/testvm.xml If you see this:



then you have Java installed. Otherwise, head on over to this link, and grab the latest version of the JRE (Java Runtime Environment.)

The next step is to install Tomcat. You can grab it here: http://apache.cyberuse.com/tomcat/tomcat-6/v6.0.24/bin/apache-tomcat-6.0.24.exe That's the windows Installer version. Run the install and click Next until you get to this page:



You can put any port number into the Port text box (the default is 8080). I put 8983 as that's the default you'll see in all Solr examples on the net. You can choose any port you'd like it doesn't really matter. Just remember which port you choose as that'll always be part of the URL when connecting to Solr. In all my examples here, I'll be using 8983.

On the next screen, it'll try to find the location of the JRE on your system:



It should find it automatically, but if it doesn't make sure to point it to directory where Java was installed on your system. Click next and Tomcat will install.

Once you finish the installation, tomcat will start and you'll see this:



(If you don't see that, either look at your Windows Taskbar in the bottom right for the Tomcat icon, or go to Start-> Programs -> Appache Tomcat 6.0 -> Configure Tomcat)

Click on the Start button and watch Tomcat startup. Above the Start button you should now see "Serive Status: Started. If it does not, here's what you need to do. Go to Program Files\Java\jre6\bin and locate the file called msvcr71.dll. Copy that file and paste it into C:\Windows\system. Now try starting Tomcat up again, and now it should work. Don't ask me why, but after breaking my head a while back on one of my test machines, I found the answer on Google.

Once Tomcat is started, open a web browser and navigate to: http://localhost:8983 (remember, if you used a different port during installation, use that one instead). You should be greeted with this beautiful page:



If you don't see that page, please refer to Tomcat's FAQ.

Step 2: Installing Solr

The next step in this process is to install the Solr web application. First, let's shut down Tomcat. (Click Start-> Programs -> Appache Tomcat 6.0 -> Configure Tomcat and click on the Stop button.) Next, we'll need to download Solr. As of the time of this blog post, the latest version of Solr is 1.4 so let's download and install that one. Head on over to this page and choose the mirror of your choice: http://www.apache.org/dyn/closer.cgi/lucene/solr/1.4.0 Once you click on one of the mirror links, you'll be taken to a page where you can choose the different formats to download. Download this one:

apache-solr-1.4.0.zip

Once that's downloaded, unzip the file and locate the folder "$\apache-solr-1.4.0\dist". In there you'll see a file "apache-solr-1.4.0.war"; war files are basically zipped up files that contain all of the necessary binaries for the servlet container (Tomcat) to run the webapp. Copy that file and past it to: "$\Program Files\Apache Software Foundation\Tomcat 6.0\webapps" Rename the file to solr.war because that will be the name of the URL to access Solr.

Next, we need to create the "Solr Home". The Solr home is basically a folder where Tomcat will look for all the relevant Solr configurations, as well as where the actual index files will be stored. Go back to the unzipped solr package, and locate the folder: "$\apache-solr-1.4.0\example\solr". Copy the contents of that directory (should be a bin folder, a conf folder and a readme.txt) and place it under your root directory, under a folder named Solr (e.g. C:\Solr). Note: You can place it anywhere on your hard drive, doesn't have to be under the root, but wherever you place it, make sure to remember the full path. For now, don't worry about what exactly you're doing, I'll explain more later.

Next next step is to tell Tomcat where the Solr Home folder is. Open Tomcat's configuration window again, and go to the Java tab. Locate the Java Options text box, and enter this line at the end of all the other stuff that's already there:

-Dsolr.solr.home=c:\solr

(If you chose a different Home Folder, place that path instead.)



Now go back to the General tab, and click Start. Once Tomcat starts up, open your browser and navigate to:

http://localhost:8983/solr/

You should be greeted with this page:



Click on the Solr Admin link and you'll see this page:



In the Query String box, replace the word solr with this: *:* (this is the lucene syntax for "select all") and click Search. You should now see a page of results in XML format. At this point we haven't indexed anything, so you won't see any results, but Congratulations, you've set up Solr!

Step 2.1: Configuring Solr

In order to move on to Solr Configurations, it's important to take a minute and explore how Lucene works. I'm no expert, and if you want a deeper understanding, use Google, but I will mention a few key points.

The first step to getting documents search-able is to index them. I'm sure you've heard that term before, but what exactly does that mean? At the heart of indexing are two concepts known as Tokenizing, and Filtering. The idea is, you use these tokenizers (that have custom analyzers) and filters to analyze the data and store them in a custom file structure known as an index. I think this page does a great job of explaining it:

The analyzer's job is to take apart a string of text and give you back a stream of tokens. The tokens are presumably usually words from the text content of the string, and that's what gets stored (along with the location and other details) in the index.

Each analyzer includes one or more tokenizers and may include filters. The tokenizers take care of the actual rules for where to break the text up into words (typically whitespace). The filters do any post-tokenizing work on the tokens (typically dropping out punctuation and commonly occurring words like "the", "an", "a", etc).


So you feed Lucene the data, it analyzes it and tokenizes it, at which point it is now search-able. Then, when you submit a query, it uses the same tokenizers and filters to parse the query, and return the results. Why is this relevant? Well, because different data will be tokenized differently. For most text you'll use a WhitespaceTokenizer which splits on white spaces, and tokenizes the words and generally ignoring special characters. For integers or dates for example, you obviously don't want that kind of tokenizer. You want to be able to specify that this is an int field, and be sorted as an int for example.

That's where the Solr config file comes in. You specify all the fields that you want to index, as well as which tokenizers and filters to use for those fields. If you've been following along with this tutorial, you'll find the config file(s) under $/Solr/Conf. There will be many files there, but we won't concern ourselves with all of them. For now we'll focus on schema.xml. The other one I'd like to mention is the solrconfig.xml file which basically are settings for solr itself (how often to commit the index, how much ram to use while indexing etc...). The great thing about the example files is that there's very very detailed documentation in the XML files.

For this tutorial, I'll be using text files that can be downloaded for free from http://www.textfiles.com. There you'll find thousands of text files from the early days of the net. I'll provide a link further down to a simple application that I wrote that scrapes the site, downloads as many files as you specify, and creates a database for you with the relevant information. For now though, I'd like to show you the config file needed to index these documents. We'll be indexing these fields:


  • fileid

  • doctext

  • title

  • datecreated



Here's the schema.xml file needed:



<?xml version="1.0" encoding="UTF-8" ?>
<schema name="example" version="1.2">
<types>
<fieldType name="string" class="solr.StrField" sortMissingLast="true" omitNorms="true" />
<fieldType name="int" class="solr.TrieIntField" precisionStep="0" omitNorms="true" positionIncrementGap="0" />
<fieldType name="date" class="solr.TrieDateField" omitNorms="true" precisionStep="0" positionIncrementGap="0" />
<fieldType name="text" class="solr.TextField" positionIncrementGap="100">
<analyzer type="index">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="1" catenateNumbers="1" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
</analyzer>
<analyzer type="query">
<tokenizer class="solr.WhitespaceTokenizerFactory" />
<filter class="solr.SynonymFilterFactory" synonyms="synonyms.txt" ignoreCase="true" expand="true" />
<filter class="solr.StopFilterFactory" ignoreCase="true" words="stopwords.txt" enablePositionIncrements="true" />
<filter class="solr.WordDelimiterFilterFactory" generateWordParts="1" generateNumberParts="1" catenateWords="0" catenateNumbers="0" catenateAll="0" splitOnCaseChange="1" />
<filter class="solr.LowerCaseFilterFactory" />
<filter class="solr.SnowballPorterFilterFactory" language="English" protected="protwords.txt" />
</analyzer>
</fieldType>
</types>
<fields>
<field name="fileid" type="int" indexed="true" stored="true" required="true" />
<field name="doctext" type="text" indexed="true" stored="false" required="false" />
<field name="title" type="text" indexed="true" stored="false" required="false" />
<field name="datecreated" type="date" indexed="true" stored="false" />
</fields>
<uniqueKey>fileid</uniqueKey>
<defaultSearchField>doctext</defaultSearchField>
<solrQueryParser defaultOperator="OR" />
</schema>


As you can see, the XML is pretty straight forward. I'd like to touch on two of the attributes in the field tag:

The first one is "stored". A lot of people make this mistake the first time they mess with Lucene, and they stored everything in the index. I'd like to stress this point really really strongly. An index is NOT a data storeage mechanism. It's not intended for that, it's not optimized for that. That's what Databases are for. What I mean by that is, the text you send to get indexed, gets tokenized and totally bastardized. There's no way to re-construct the original document. So what Lucene allows you to do, is store the text as is in the index as well for later retrieval. That's OK for small index's or small fields. However when you get to larger indexes or larger fields, your performance will suffer noticeably. What's most commonly done, is you index everything you need, however you only STORE the primary key (e.g. fileid). When you retrieve the items from the index during search, you get the primary key field, and then you query your database which IS meant for data storage, and get the rest of the fields you need. That's the approach I'll be taking here.

The other field I'd like to point out, is the "type" field. This tells Lucene, which kind of field this one will be, and based on the fieldtypes specified earlier, it uses the correct tokenizer.

There's also one change I think is worth making in the solconfig.xml. Not too far down in the solrconfig.xml file you'll see this:



<!-- Used to specify an alternate directory to hold all index data
other than the default ./data under the Solr home.
If replication is in use, this should match the replication configuration.-->
<dataDir>${solr.data.dir:./solr/data}</dataDir>


For some reason the example defaults to putting the index file in the same directory as the Tomcat server. Just comment out that line:



<!-- Used to specify an alternate directory to hold all index data
other than the default ./data under the Solr home.
If replication is in use, this should match the replication configuration.
<dataDir>${solr.data.dir:./solr/data}</dataDir>-->


we'll use the default "Solr Home" folder ($\solr\data\index) to store the index files.

At this point, in your /solr/conf folder, you should have these files: (delete all the other ones)


  • schema.xml (copy and paste the xml from above)

  • elevate.xml

  • protwords.txt

  • solrconfig.xml

  • stopwords.txt

  • synonyms.txt



Part 3: Downloading Some Text Files

OK, now we have most of our environment set up, it's time to get some files! I created a simple application, that connects to http://www.textfiles.com and downloads some files. It then creates a database and adds a record for each file downloaded. You can download it here:

For SQL 2005: http://www.box.net/shared/du4zac9vaa

For SQL 2008: http://www.box.net/shared/vtbod3rub6ibvof91xut

Here's how you use it:

TextFileHarvester.exe "ALEX\SQLEXPRESS" "TextFilesDatabase" "C:\MyTextFiles" 1000

First parameter is the Sql Server name. The second one is the Database name to create. The third one is the location where to store the text files on disk, and the last parameter is how many files to download.

Once that's done, you should have a database that looks something like this:



Just as a side point, the Dates are made up. They're just random dates between 1995 - 2005.

You should also have a directory (in the location you specified to the TextFileHarvester app) with many text files in them.

Part 4: Writing Some Code!!

Now comes the fun part. Earlier I mentioned that Solr was basically a Webservice that you can interact with using http and sending XML. You can do all of the work manually and create the XML by hand, but there's an awesome library out there that already does this. The library is open source and called SolrNet. Head on over there and download the dll's. (There's actually another library called SolrSharp but that's much older, and not as up to date as SolrNet. Also, in my opinion, SolrNet is much easier to use.)

Ok, now that we have the SolrNet dll's, we can create a simple application to Index them. We'll keep it simple, and do this: Connect to the database, get all of the files in one shot, connect to Solr and send them all to be indexed.

The way you send stuff to Solr using SolrNet is by having a class that holds your data, and then you add an attribute to each property that maps to a property in the index. Let me explain with code:



public class TextFile
{
#region Members

private string documentText;

#endregion

[SolrUniqueKey("fileid")]
public int FileID { get; internal set; }

public string FileLocation { get; internal set; }

[SolrField("doctext")]
public string DocumentText
{
get
{
if (this.documentText == null)
{
this.documentText = File.ReadAllText(FileLocation);
}
return this.documentText;
}

}

[SolrField("title")]
public string Title { get; internal set; }

[SolrField("datecreated")]
public DateTime? DateCreated { get; internal set; }
}


You add the SolrField attribute (or the SolrUniqueKey for the index primary key field) and provide the name for that field in the index. I made the DocumentText property be lazy loaded so we don't hit the file system for all file's text in one shot (there should probably be more error handling there, but hey, this is just a demo...).

SolrNet will require a collection of these objects, and it will then send that off to the index. The next step therefore, is to write some code to populate these objects. We can use Linq to Sql, or whatever you want to connect to the database and populate these objects. I wrote some (very) basic ADO.NET code to do this:



internal class TextFileRepository
{
#region Members

private string connectionString;

#endregion

#region Constructors

public TextFileRepository(string connectionString)
{
this.connectionString = connectionString;
}

#endregion

#region Methods

public IEnumerable<TextFile> GetTextFiles()
{
return ExecuteSql("SELECT * FROM FILES");
}

public IEnumerable<TextFile> GetTextFiles(IEnumerable<int> fileIds)
{
if (!fileIds.Any()) { yield break; }
string sql = String.Format("SELECT * FROM FILES WHERE FILEID IN({0})", fileIds.ToDelimetedString());
foreach (var item in ExecuteSql(sql))
{
yield return item;
}
}

private IEnumerable<TextFile> ExecuteSql(string sql)
{
using (SqlConnection connection = new SqlConnection(this.connectionString))
using (SqlCommand command = connection.CreateCommand())
{
command.CommandText = sql;
connection.Open();
var reader = command.ExecuteReader();
while (reader.Read())
{
yield return FromReader(reader);
}
}
}

private TextFile FromReader(SqlDataReader reader)
{
var result = new TextFile();
result.FileID = (int)reader["FileID"];
result.Title = reader["Title"] as string;
result.FileLocation = reader["FileLocation"] as string;
var date = reader["DateCreated"];
result.DateCreated = date == DBNull.Value ? (DateTime?)null : date as DateTime?;

return result;
}

#endregion
}


Nothing fancy. If you look at the GetTextFiles method it's about as straight forward as they come. Just do a SELECT * and convert to the TextFile object. (You'll notice a GetTextFiles with a parameter, but that's used for Searching as I'll explain shortly.)

Part 4.1: Indexing the Files

Now, we can write some code to Index these files:



public class BasicIndexer
{
private string connectionString;
private string solrUrl;

public BasicIndexer(string connectionString, string solrUrl)
{
this.connectionString = connectionString;
this.solrUrl = solrUrl;
}

public void IndexFiles()
{
Startup.Init<TextFile>(this.solrUrl);
var solrWorker = ServiceLocator.Current.GetInstance<ISolrOperations<TextFile>>();
var files = new TextFileRepository(this.connectionString).GetTextFiles();
solrWorker.Add(files).Commit();
}
}


Pretty simple no? First, we need to call the Init method. Internally SolrNet uses IoC to handle the instantiation of the classes. Therefore, we can just use Microsoft.Practices.ServiceLocation from the Enterprise Library. (I'm not a huge fan of the Enterprise Library, but it's quick and easy for this example. Refer to the SolrNet documentation for better approaches using Castle Windsor.) Once we have the ISolrOperations, we can just call the Database to get all the files, and then call Add on the ISolrOperations. Once that's done, we just call Commit() and that's it!!

Let's give it a whirl. Create a quick ConsoleApplication and call the BasicIndexer class:



public class Program
{
public static void Main(string[] args)
{
string connectionString = "";
string solrUrl = "";

BasicIndexer indexer = new BasicIndexer(connectionString, solrUrl);
indexer.IndexFiles();
}
}


Fill in the blanks for your specific connection string, and your solr url (e.g. http://localhost:8983/solr). MAKE SURE YOU RESTART TOMCAT!! (Start -> Programs -> Configure Tomact -> Start)! Run the program and viola, you just indexed all of your files! (In a real world scenario, you'd obviously batch this operation, getting only x amount from the database, possibly event multithreading it to maximize performance, but again this is just a demo :) ). Now let's see if in fact our indexing worked. Open your browser to http://localhost:8983/solr (or whatever the Solr Url is) and click on Solr Admin. In the Search box enter "*:*" (the Lucene equivalent of SELECT *) and click the Search button. You SHOULD see this:



(depending on how many files you decided to download using the Harvester application you'll see different totals). Congrats! You've successfully indexed the files!

Part 4.2: Searching the Files

Now we want to search using C#. That too is VERY easy using SolrNet. The only thing to remember here, is like I mentioned earlier, we didn't STORE the fields in Solr. All we stored was the FileID. Therefore, we need to first retrieve those file id's, and then hit the database to get the rest of the information. It may seem like double work, but TRUST me!!! when dealing with larger data sets, it's MUCH faster. So, first let's create a class to hold the FileID's:



internal class FileIDResult
{
[SolrField("fileid")]
public int FileID { get; set; }
}


Same concept like Indexing. We use the SolrField attributes for the properties that will map to the Index fields.

Now, we can write some code to execute the Search. There are a few parameters I'd like to discuss first. Aside from the search query itself, you need to specify the amount per page, and which number to start from. Solr has paging built right into it, so the way it works is, you specify how many items you want per page, and then how many items to skip over. So for example, if you have 100 results, and you have 10 items per page, and you want Page 3, you'd start at item 30 (it's zero based). With that in mind, here's the searching code:



internal static class SolrOperationsCache<T>
where T: new()
{
private static ISolrOperations<T> solrOperations;

public static ISolrOperations<T> GetSolrOperations(string solrUrl)
{
if (solrOperations == null)
{
Startup.Init<T>(solrUrl);
solrOperations = ServiceLocator.Current.GetInstance<ISolrOperations<T>>();
}

return solrOperations;
}

}




public class BasicSearcher
{
private string connectionString;
private string solrUrl;

public BasicSearcher(string connectionString, string solrUrl)
{
this.connectionString = connectionString;
this.solrUrl = solrUrl;
}

public SearchResults Search(string query, int resultsPerPage, int pageNumber)
{
var solrWorker = SolrOperationsCache<FileIDResult>.GetSolrOperations(this.solrUrl);

QueryOptions options = new QueryOptions
{
Rows = resultsPerPage,
Start = (pageNumber - 1) * resultsPerPage,
};

ISolrQueryResults<FileIDResult> results = solrWorker.Query(query, options);
var textFiles = new TextFileRepository(this.connectionString)
.GetTextFiles(results.Select(r => r.FileID));


var searchResults = new SearchResults
{
Results = textFiles,
QueryTime = results.Header.QTime,
TotalResults = results.NumFound
};

return searchResults;
}
}

public class SearchResults
{
public IEnumerable<TextFile> Results { get; set; }
public int QueryTime { get; set; }
public int TotalResults { get; set; }
}


Same idea as before. Get an instance to the ISolrOperations, and call the Query method. I used the overload that takes a QueryOptions object so that I can specify the page number and items per page. The result that comes back from that method call is an ISolrQueryResults which in addition to the search results itself, has some metrics. I wrapped all that up in the SearchResults class. Once we have the file id's we can hit the database, and get the rest of the data.

Let's give it a whirl:



public class Program
{
public static void Main(string[] args)
{
string connectionString = "";
string solrUrl = "";

BasicSearcher searcher = new BasicSearcher(connectionString, solrUrl);
var results = searcher.Search("*:*", 10, 1);
foreach (TextFile file in results.Results)
{
Console.WriteLine("FileID: {0}, Title: {1}",file.FileID,file.Title);
}
}
}


If you run this, you should see the file id's and titles for the first 10 documents! Congratulations, you've just executed a search!!

OK, this has gotten WAAAAAY longer than I ever imagined. There's ALOT going on here, I won't deny that. There's a few closing random points I'd like to make:


  • There are many various Tokenizers and Analyzers for various data. If you need to tweak one and create your own, you're SOL to do it in C#, you'll have to blow the dust off that old Java book and do it there.

  • The actual index is stored under $\solr\data\index. Sometimes it's useful to actually look into the index files and read them. For that you can use Luke (a standalone Java application.)

  • In the real world, you'd possibly have your Indexing Server somewhere other than where the data is. Keep that in mind, as the XML sent across can get quite big, so see if you can optimize that.



There's so much more to Lucene and Solr than what I covered here. Here are some links for more reading:

Lucene Home Page
Lucene tutorial. (While this is an actual Lucene tutorial of how to use the Lucene Java libraries, it has some great insight in to what's happening behind the scenes.)
Solr Home Page
Solr Wiki
The various Tokenizers an Analyzers available from Solr

That's all I can think of for now. If there's anything else, I'll add it in the future.

Lastly, here's a zip file containing everything I covered in this post. You'll find the TextFileHarvester app. You'll find the solr configuration files I've used for this demo. You'll also find a simple library I created to wrap the indexing and searching functionality of these text files, as well as a basic windows application to test both. Good luck, and Happy Coding!

P.S. This tutorial took a lot of time and effort. If you find it useful, just drop me a a line in the comments.

36 comments:

Ali Oral said...

Great article! It covers almost everything that you need to start a .Net project with Solr. Most of the Solr/Lucene related code samples are in java. It is good to see how easy is to consume Solr with .NET also.

David Craft said...

Hey man.. really great article..

I'm using a very old version of SolrNet.. Looks like the Attribute stuff i new. Currently i'm using XSL to translate an xml document into a solr xml file and using SolrNet to post it to Solr.

Solr is very addictive. I've just discovered the LocalSolr addon which allows you to do Geo Spacial Searches over Solr by Long/Lat and Radius.

I also use windows with Tomcat and Solr 1.4..

If you want you can check out how to install LocalSolr on my blog

http://www.craftyfella.com/2009/12/installing-localsolr-onto-solr-14.html

Unknown said...

Seems like a lot of java tomfoolery for a .net group. Have you tried the LucidWorks installerty at http://www.lucidimagination.com/Downloads/Lucidworks-for-Solr ... it configuresjet all this tomcat (or jetty stuff) for you

A.Friedman said...

@John: While you are right and the package from LucidWorks does automate installing Tomcat and Solr, I feel like it A) doesn't save you all that much time from doing it yourself, and more importantly B) it kind of hides the details from you. I feel like it's important when writing a complete tutorial, to demonstrate all the steps involved. That way you undestand what Tomcat is, what the Solr Home folder is, what the Tomcat and Solr configs are, where they're stored, how you can change the location of those files etc etc. I just think it's important to show the detailed steps, rather than hiding it and letting it happen automatically.

Bottom line though, the main point of this tutorial was more about communicating with Solr from C#. I think I did a decent job of demonstrating that.

Unknown said...

Hi..This is great !!
But i want to know that how i can use SolrJ in .net ?
Is there any other option otherthen converting Solj.jar lib to .net dll throught IKVM ?

A.Friedman said...

@Jagdish: I think IKVM is your only option. Personally, I've only ever messed with IKVM for the fun of it, never actually tried it on large production .jar files.

That said, SolrNet is a great client, and there's no reason not to use it IMHO.

Frank Birch said...

Hi - I liked the article - it's a great starting point. I've tried to work through the demo you've put together. The first part indexing - went fine. But when I tried to build a version of your "search" code I ran into a problem with "SolrOperationsCache" used in the BasicSearcher class which I've been unable to resolve. It doesn't seem to be in the 0.2.3 SolrNet namespace. I checked 0.2.0 as well in case it used to exist, but has since been retired. If you can point me to where it lives I'd be grateful.

best regards, Frank Birch

A.Friedman said...

@Frank: That's my mistake. It's actually a class that I created myself as a wrapper around the Startup.Init call. Here's the code:

internal static class SolrOperationsCache<T>
where T: new()
{
private static ISolrOperations<T> solrOperations;

public static ISolrOperations<T> GetSolrOperations(string solrUrl)
{
if (solrOperations == null)
{
Startup.Init<T>(solrUrl);
solrOperations = ServiceLocator.Current.GetInstance<ISolrOperations<T>>();
}

return solrOperations;
}

}

If you download the code I linked at the top of the blog post, you'll find it there. My bad, I'll update the blog post when I get a chance.

Frank Birch said...

Many thanks - I'll give it a whirl during the day.

Unknown said...

hi,
i use this library but i have one problem what solr url i put because if i put http:\\localhost:8983\Solr then it gives me "500 internal server error" so what i put in that url

Alyona said...

Hi!
Great tutorial, easy to follow!

I am interested in the problem of retrieving the original text of the document.

You suggest that the Lucene index stores just the filedID and hence the database storing the corresponding file location is required.

What if the Lucene index stored the file location directly? Then it wouldn't be necessary to have the database, right?
What are the the advantages of using SQL server in this application?

Thanks,
Aly

A.Friedman said...

@Aly: Regarding retrieving the text off the documents, that's a whole 'nother can of worms. It really depends on which kind of document you're dealing with as they each have their own suite of options. For Office documents, you can look into the Microsoft Interop libraries (which is MUCH easier to use now with .NET 4.0). For pdf's you can look into iTextSharp. Basically, each format has its own challenge. Oh, and iFilters are another option.

As for your second point regarding storing the path in the index, that's actually a nice idea, and for my simple example would work. In the real world though, you wouldn't want to do that for a few reasons:

A) With a database you can store a lot more metadata that would be lost if you stored just the path in the index. For example, in my sample, when I display the search results, I display the title of the article. How would you do that if you without a DB?

B) What if you need to change servers and the location of all your files change? Good luck updating the index! With a database, that's trivial.

There are other more subtle benefits for using a DB, but the main point is that an index isn't a storage medium, that's what the DB is for.

HTH
--Alex

Alyona said...

@Alex This makes sense. Thanks for your quick answer!

sun.parth said...

I am getting error while running the exe using DOS prompt
"Failed to connect to server c18/MSSQLSERVER2008".

pls help

A.Friedman said...

@sun: That little app uses SMO to create the database and populate it. Unfortunately, when I wrote that app, I was targeting SQL2005, and never got around to updating it to support SQL2008 as well. Send me your email address, and I'll try to send you an updated version.

--Alex

bea said...

I am running into trouble when I replace the schema.xml file content with the index schema in the tutorial. The error I get when I try to navigate to ".../solr/admin" after doing this is:

"QueryElevationComponent requires the schema to have a uniqueKeyField implemented using StrField at org.apache.solr.handler. component.QueryElevationComponent.inform (QueryElevationComponent.java:157)"

I'm wondering if anyone else had and solved this problem? I can't find anything much to help through google, and can't understand why there would be an insistence that a uniqueKeyField be a StrField--if thats what this means? The only configuration change I have made to solr is to set the "solr/home" environment entry in "Tomcat\webapps\solr\WEB-INF\web.xml" file to "C:\Solr" as for some reason things wouldn't work just setting "-Dsolr.solr.home=c:\solr" in the Java Opts. Sorry, a bit long. Really grateful for any help anyone can give on this.

Thalaiselvam said...

1) Can we passible to create more than one data folder?... how to configure the Solr home folder through coding?...

2) can we index multiple table's record in single data folder?..

Kindly advice...
Thanks

Thalaiselvam said...

Hi,

Can we create multiple entity in single data file ( could you give me the sample schema.xml for multi entity)?...

Satheesh said...

Excellent article!! You have almost covered everthing for dummies. Good Job

bea said...

I resolved my problem by removing the QueryElevationComponent and its handler from the solrconfig.xml file.

Héctor said...

This is a great article, I like it to much, it is very clear, and it’s helpful for a beginner, can you tell me where can I find more information, documentation, examples or a book about solrNet?
I’m starting to work with this and I need more examples, and the solr page is not very easy to understand. Can we use facets with solrNet to search with meta-data associated to the files? Does solrNet support facets?
Thanks for your work!!

Héctor said...

This is a great article, I like it to much, it is very clear, and it’s helpful for a beginner, can you tell me where can I find more information, documentation, examples or a book about solrNet?
I’m starting to work with this and I need more examples, and the solr page is not very easy to understand. Can we use facets with solrNet to search with meta-data associated to the files? Does solrNet support facets?
Thanks for your work!!

Unknown said...

First of all, great article! Very well written and easy to follow. Not sure if this is a newbie mistake, but I wasn't sure what to replace "ALEX\SQLEXPRESS" with on my local PC to get the TextFileHarvester to work. After digging around a bit, I found that my setup needed ".\SQLEXPRESS" - just wanted to add this comment in case anyone else ran into it.

N said...

Hi,
Its a great article and I am working on multifaceted searching.It is very helpful to me and thank you for this article.

Batagoda said...

Great Article. Probably the best article on the web to start Solr integration with .net

Unknown said...

Thanks for the article, every bit of "I used X and this is how it works" is absolute gold for newbies scrambling to keep up.

I saw a mention of LucidWorks Enterprise. I have to say it's a really quick & easy way to getting Solr installed and configured on Windows. Having said that, the licensing (which is necessary for any actual use) is prohibitive for all but large businesses and commercial websites. Wish there was something else..

Matt Dameron said...

Can you send me the updated app to support SQL2008 as well? Or the code? matt@incasestudio.com

A.Friedman said...

I've updated the post to now include a link to the TextFileHarvester application that will work with SQL Server 2008 as well. Make sure you have the SMO dlls installed on your machine (if you have Management Stduio then you'll have it).

Prakash said...

Hi,This article is really helpful.I am getting some sql string error while debugging your code.Can you help me out please..

Prakash said...

Thanks for the article.Its a great help.Can we index .pdf and .doc also?

qntmfred said...

Very helpful, thanks!

Raja said...

A.Friedman, Excellent tutorial with exact amount of details for any .NET developer to get started on Solr. Thanks you for taking time to write such an useful tutorial.

Raja

dmead said...

Nice article! We're trying to get Solr integrated with our .NET app and your article is the first one we've read - it really takes you through step by step - thanks!

sebusarg said...

Very nice post, i've being looking for something like this and didn't found much.

Thank you very much, i'll give it a try!

plemon said...

Very good and detailed article to get one started with Solr, very easy to understand and code shows just what you need, kudos!

Unknown said...

Really enjoyed reading this, now time to go play with it!