Thursday, February 4, 2010

Exploring CouchDB-Lucene 0.5

‹prev | My Chain | next›

Thanks to a fix by the clairvoyant Robert Newson, I was able to get the latest couchdb-lucene built. Now to install it.

Per the README, the maven build process creates a .zip file that contains everything needed. But what to do with the contents on my development netbook? The contents are not exactly typical Linux layout:
cstrom@whitefall:~/tmp/couchdb-lucene-0.5-SNAPSHOT$ ls
bin conf lib LICENSE README.md tools
These directories look to be intended as self-contained—maybe in a /opt/couchdb-lucene or /usr/local/couchdb-lucene directory. For now, I will create a local directory in my home directory, and symlink it to couchdb-lucene (for future installs):
cstrom@whitefall:~$ mkdir local
cstrom@whitefall:~$ cd !$
cstrom@whitefall:~/local$ cp /home/cstrom/repos/couchdb-lucene/target/couchdb-lucene-0.5-SNAPSHOT-dist.zip .
cstrom@whitefall:~/local$ unzip couchdb-lucene-0.5-SNAPSHOT-dist.zip
cstrom@whitefall:~/local$ ln -s couchdb-lucene-0.5-SNAPSHOT couchdb-lucene
With that, I can add the necessary configuration to /etc/couchdb/local.ini:
[couchdb]
os_process_timeout=60000 ; increase the timeout from 5 seconds.

[external]
;fti=/path/to/python /path/to/couchdb-lucene/tools/couchdb-external-hook.py
fti=/usr/bin/python /home/cstrom/local/couchdb-lucene/tools/couchdb-external-hook.py

[httpd_db_handlers]
_fti = {couch_httpd_external, handle_external_req, <<"fti">>}
Lastly, I need to start up the couchdb-lucene server (say, that's new!):
cstrom@whitefall:~/local/couchdb-lucene$ ./bin/run 
2010-02-04 21:33:45,309 INFO [Main] Index output goes to: /home/cstrom/local/couchdb-lucene-0.5-SNAPSHOT/indexes
2010-02-04 21:33:45,416 INFO [Main] Accepting connections with SelectChannelConnector@localhost:5985
So I have my couchdb-lucene server running, I restart my couchdb server (to pick up the local.ini changes. Now what?

To actually get indexing working, I need a CouchDB design document that contains a "fulltext" object. Since I am only doing exploratory code here, I name the design document "test" and define it as:
{
"_id": "_design/test",
"_rev": "1-93d99ffe0bddccd4b1aa521f3a569a50",
"fulltext": {
"by_title": {
"index": "function(rec) { var doc=new Document(); doc.add(rec.title); return doc; }\n"
}
}

}
The "fulltext" object can contain any number of index types. Here, I define only one: "by_title". The "by_title" object contains only an "index" function. It can contain other attributes (and I will need some of them eventually), but for exploratory purposes, all I need to do is define the "index" function.

The index function will operate on each record in the CouchDB database—that is what the rec input value represents. Inside the function, I create a new couchdb-lucene document object ("new Document()"), add the title from the CouchDB record ("rec.title") to the couchdb-lucene document, and finally return that couchdb-lucene document. If I wanted to index only recipes and exclude the meal in which they were served, I could have added a condition to return null if the rec's type was "Meal". For now, I index everything.

That should do it. I access this test resource with curl thusly:
cstrom@whitefall:~/repos/eee-code$ curl http://localhost:5984/eee/_fti/test/by_title?q=fish
{"q":"default:fish",
"etag":"357888d94e68","skip":0,"limit":25,"total_rows":20,"search_duration":1,"fetch_duration":4,
"rows":[{"id":"2004-10-24-fish","score":3.0870423316955566},
{"id":"2004-02-27","score":3.0870423316955566},
{"id":"2002-04-07","score":3.0870423316955566},
{"id":"2006-02-05-chili","score":3.0870423316955566},
{"id":"2002-02-15","score":2.4696338176727295},
{"id":"2002-02-18","score":2.4696338176727295},
{"id":"2004-03-31-fish","score":2.4696338176727295},
//...
]}
(I added the formatting for readability)

That seems to be working. To be sure I could check a sample of those document IDs in the CouchDB server to ensure that each has "fish" in the title. Instead, I will try out the field store couchdb-lucene feature, which can be used to store a field for return in the results. To use this, I only need to add {"store":"yes"} to the add method of the couchdb-lucene document:
function(rec) { var doc=new Document(); doc.add(rec.title, {"store":"yes"}); return doc; }
Then, when I request the same query I find:
cstrom@whitefall:~/repos/eee-code$ curl http://localhost:5984/eee/_fti/test/by_title?q=fish
{"q":"default:fish",
"etag":"35b5363947ea","skip":0,"limit":25,"total_rows":20,"search_duration":1,"fetch_duration":6,
"rows":[{"id":"2004-10-24-fish","score":3.0870423316955566,"fields":{"default":"Fish Ghosts"}},
{"id":"2004-02-27","score":3.0870423316955566,"fields":{"default":"Fish in a Packet"}},
{"id":"2002-04-07","score":3.0870423316955566,"fields":{"default":"Fish Curry"}},
{"id":"2006-02-05-chili","score":3.0870423316955566,"fields":{"default":"Fish Chili"}},
{"id":"2002-02-15","score":2.4696338176727295,"fields":{"default":"Fish Sticks and Leftovers"}},
{"id":"2002-02-18","score":2.4696338176727295,"fields":{"default":"Sesame Fish Sticks"}},
{"id":"2004-03-31-fish","score":2.4696338176727295,"fields":{"default":"Quick Thai Fish Curry"}},
//...
]}
Indeed, the titles are now returned with the results and yes, the query is indeed only returning those documents with "fish" in the title.

That is a good stopping point for tonight. Up tomorrow: using these and more spiffy couchdb-lucene features to get my various search Cucumber scenarios passing again.

Day #4

No comments:

Post a Comment