Prologue
In 1999, when I was in my first year of university, if I wanted to check my e-mail I had to come into a lab. I used to come into the class, open browser, type address and then go out to grab some coffee. Usually when I came back I was really happy to see that at least half of the page had been loaded.
Today people are not so patient. They used to get response from the web-sites at least in a few seconds and if your application is not that fast, you have a problem. In fact, “few seconds” is a very optimistic term. Nowadays we measure response time with milliseconds.
If you want to create something really popular in the Web (and your application operates big amount of data), sooner or later you will face the need to implement full text search engine over your system.
My Goals
In this post I will try to describe and compare two technologies – Microsoft SQL Server Full Text Search and Apache Lucene. The first one is an embedded SQL Server feature and the second one is a third-party software library, originally developed in Java and ported to many other platforms, including .NET.
How to build a super fast search for your web application.
Introducing SQL Server Full Text Search
Microsoft SQL Server supports full text search since 1998, when version 7.0 was released. The best thing about it is that it is embedded part of SQL Server, which provides you all benefits of full integration with RDBMS.
So what are the benefits of using full text search over “like” operator? The main difference is in underlying index structure. Regular database index is built on the whole field value, full text search technique uses inverted index instead. FTS involves indexing the individual words within a text field in order to make searching through many records quick (tokenization). Using “like” still requires a string search within the field. So specific index organization provides such benefits as “precision vs recall tradeoff”, high search performance, stemming, ranking and many others typical FTS features. You will have to create this index before you can start searching, and after that SQL Server will handle updating the index automatically.
How Lucene Search Works
Lucene Search also use inverted index, but since it is just a software library, you will have to manage index by yourself. This and other major differences between two technologies are shown in the table below:
Comparing MS SQL Full Text Search and Lucene
Lucene | MS SQL FTS | |
Index auto update | No | Yes |
Store data in index | Yes | No |
Location in RAM | Yes | No |
Interface | API | SQL |
Queering multiple columns | Yes | Yes |
Stop words, synonyms, sounds-like | Yes | Yes |
Custom Index Documents Structure | Yes | No |
Wildcards | Yes | With restrictions |
Spellchecking, hit-highlighting and other extensions | Provided in “contrib” extensions library | No |
So, MS FTS will handle updating the index. On the other hand, Lucene provides the opportunity to put it into a RAM and also allows to store data right in the index, which obviously increases the performance. MS FTS is a part of RDBMS so you will have to write SQL-queries (using specific FTS statements like CONTAINS, FREETEXT etc) and Lucene provides you an API.
Both of them support stop words, synonyms, stemming and multiple columns queering. Lucene allows you to use more query operators than FTS and also supports wildcards in the beginning and middle of the words, when FTS allows it only in the end. With Lucene you can choose any custom structure for your index (analyze and store data from different sources in one document); FTS allows you to create index only on one table (but you can use indexed views as a workaround).
Another great thing about Lucene is that many useful extensions like spellchecking or hits highlighting are already implemented and ready to use. But in case with SQL Server FTS you will have to create it by yourself (using CLR or client application code).
Getting to the Specifics
Some specific functionality you will have to implement manually using both Lucene and FTS. But there will be a difference in the approach and eventual performance. Let’s see it on a small example.
We took Wikipedia dump for our tests. Let’s assume, that we want to display most relevant articles that were recently modified in the top of search results.
SQL-statement for FTS will be look like that:
select top 20 p.title, p.text
from CONTAINSTABLE(dbo.WikiArticles, text, 'cats') ft
join dbo.WikiArticles p on ft.[KEY] = p.id
order by case when datediff(dd, p.date, @today) > @maxdaysago then ft.rank else ft.rank* (1 + @multiplier * (( @maxdaysago - datediff(dd, p.date, @today)) / @maxdaysago)) end desc
We have to join main table with CONTAINSTABLE results to get dates and then add to query complex “order by” clause, which makes it pretty hard for the SQL Server engine.
And here is the source code for Lucene:
(You can find formatted code in Appendix I in the bottom of the post)
We’ve created two new classes (derived from Lucene standard classes) and override just one standard method. Than we can use it in search queries like that:
IndexSearcher searcher = new IndexSearcher(reader);
searcher.SetDefaultFieldSortScoring(true, true);
QueryParser parser = new QueryParser(Lucene.Net.Util.Version.LUCENE_29, "text", new StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_29));
Query query = parser.Parse("cats");
Query boostQuery = new RecencyBoostingQuery(query, 15, 100, "date");
Sort sort = new Sort(new SortField[]{SortField.FIELD_SCORE, new SortField("title", SortField.STRING)});
TopDocs hits = searcher.Search(boostQuery, null, 20, sort);
The only difference from a standard Lucene search in the listing above is using RecencyBoostingQuery with required boosting parameters instead of standard Query class.
We’ve tested these queries for SQL Server and Lucene on the same environment (Intel i5-3330 @ 3.0 GHz RAM: 8GB) and it appears that Lucene handles it at least 5 times faster. The results of this and other performance tests are shown in the tables below:
Indexing speed, size and single query execution time
Lucene | MS SQL FTS | |
Indexing Speed | 3 MB/sec | 1 MB/sec |
Index Size | 10-25% | 25-30% |
Simple query | <20 ms | < 20 ms |
Query With Custom Score | < 4 sec | >20 sec |
Parallel Query Executions (10 threads, average execution time per query in ms)
MS SQL FTS | Lucene (File System) | Lucene (RAM) | ||
Cold System | Simple Query | 56 | 643 | 21 |
Boost Query | 19669* | 859 | 27 | |
Second executions | Simple Query | 14 | 8 | <5 |
Boost Query | 465 | 17 | 9 |
*average time, the very first query could be executed up to 2 min(!)
As you can see, SQL Server is a little faster with simple queries executed first time, and is way behind in case of complex query with custom scoring criteria. And Lucene with index placed in RAM is unbeatable favorite.
Conclusion
So if your performance requirements are critical and you have enough RAM to store the whole index, you probably have to choose Lucene as your search engine. Same choice will be reasonable when you are going to build a complex system with a lot of custom functionality like spellchecking, hit-highlighting, auto complete, advanced scoring etc. And if your search engine requirements are not so strong, maybe using SQL Server Full Text Search will be the best option – it is very simple to maintain and fast enough with simple queries.
Appendix I
Recency Boosting listing
{
public double Multipier { get; set; }
public int Today { get; set; }
public int MaxDaysAgo { get; set; }
public string DayField { get; set; }
public static int MsecPerDay = 1000*3600*24;
public RecencyBoostingQuery(Query subQuery, double multipier, int maxDaysAgo, string dayField)
: base(subQuery)
{
Today = (int) ((DateTime.Now.Ticks/TimeSpan.TicksPerMillisecond)/(MsecPerDay));
Multipier = multipier;
MaxDaysAgo = maxDaysAgo;
DayField = dayField;
}
protected override CustomScoreProvider GetCustomScoreProvider(IndexReader reader)
{
return new RecencyBooster(reader, this);
}
}
public class RecencyBooster : CustomScoreProvider
{
private readonly int[] _publishDay;
private readonly RecencyBoostingQuery _recencyBoostingQuery;
public RecencyBooster(IndexReader reader, RecencyBoostingQuery recencyBoostingQuery)
: base(reader)
{
_recencyBoostingQuery = recencyBoostingQuery;
_publishDay = FieldCache_Fields.DEFAULT.GetInts(reader, _recencyBoostingQuery.DayField);
}
public override float CustomScore(int doc, float subQueryScore, float valSrcScore)
{
int daysAgo = (_recencyBoostingQuery.Today - _publishDay[doc]);
if (daysAgo < _recencyBoostingQuery.MaxDaysAgo)
{
var boost =
(float)
(_recencyBoostingQuery.Multipier*(_recencyBoostingQuery.MaxDaysAgo - daysAgo)/
_recencyBoostingQuery.MaxDaysAgo);
return (float) (subQueryScore*(1.0 + boost));
}
return subQueryScore;
}