Difference between revisions of "Reference/Search"

From Dreamwidth Notes
Jump to: navigation, search
(Sphinx)
(usage details)
Line 105: Line 105:
  
 
Getting content into search requires a few things:
 
Getting content into search requires a few things:
 +
* a database containing entry/comment text - we have a separate database for the text of entries/comments that we want to be searchable. We copy each entry/comment when it's posted or edited from the main database into the search database
 +
* a Sphinx index - doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further, creating an index of the words.  Processing the text this way also makes it possible for a search for "test" to turn up "tests", "testing", etc
  
* a separate database containing entry/comment text - we have a separate database for the text of entries/comments that we want to be searchable. We copy each entry/comment when it's posted or edited from the main database into the search database
+
Getting search results involves a couple more:
 +
* a Sphinx daemon - actually runs the search
 +
* a search worker - connects to Sphinx and retrieves the results
  
* a Sphinx index - doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further, creating an index of the words.  Processing the text this way also makes it possible for  "test" to turn up "tests", "testing", etc
+
For search to work, you have to first [[TheSchwartz_Setup | set up TheSchwartz]] and [[Setting up Gearman | set up Gearman]].
  
 +
The full process of a search is as follows:
 +
<ol><li>schedule for copying any entries/comments that should be searchable
  
Getting search results involves a couple more:
+
All entries / comments before you set up search will need to be copied over. The following script will schedule a full copy of entries/comments for all accounts on your server:
 +
 
 +
<pre>  bin/schedule-copier-jobs</pre>
 +
 
 +
Entries/comments posted or edited after you set up search will be automatically scheduled to be copied.
 +
</li><li>copy the original text into the search database
 +
 
 +
  bin/worker/sphinx-copier -v
 +
 
 +
</li><li>run the Sphinx indexer to make the text searchable (as root)
 +
 
 +
  indexer --all
 +
 
 +
After this, the text is finally searchable. If you want, you can test by running:
 +
 
 +
  search -q words
 +
 
 +
And if that works, you're on your way to making search work on the site itself.
 +
</li><li>run the Sphinx search daemon. Put this in a window, and leave it alone until you're done. There's some debugging output in console mode:
 +
 
 +
  searchd --console
 +
 
 +
</li><li>make sure that the search worker is running
  
* ??
+
  bin/worker/sphinx-search-gm -v
  
* a search worker
+
</li></ol>
  
 +
After the first time, you'll only need to care about step 2 onwards. All scheduling as in step 1 will happen automatically.
  
 +
Text search is resource-intensive and is a separate system from the main Dreamwidth site so it's possible to run on a different machine from the webservers on a production site. You don't have to worry about this too much on a development server where it's basically just you on the site. Still, be warned that it might be good to only turn on the search workers when you need it.
  
 
[[Category: Development]][[Category: Dreamwidth Installation]][[Category: Reference]]
 
[[Category: Development]][[Category: Dreamwidth Installation]][[Category: Reference]]

Revision as of 13:20, 25 February 2013

There are two forms of search on Dreamwidth installations: both are optional and require further setup.

User search searches for users matching certain characteristics. Text search searches through entries and comments. This page will focus on text search.


User Search

Not heavily documented, instructions for setup can be found in Set_up_UserSearch.

Text Search

Dreamwidth uses Sphinx, an open-source search package, to implement text search. Search is available on the search page. There are two modes of search: site search, and per-journal search.

Site search only shows public content. Journal search may contain locked content, following the regular behavior for whether you can see the locked content or not. That is, if you can see it on the journal, then you can find it with search. If you can't see it on the journal, then you won't see it in the search results. There's also an option to search by comments. Only comments made on paid users' journals are indexed for technical reasons (site load).

Text search is resource-intensive and is a separate system from the main Dreamwidth site. This makes it possible to run on a different machine from the webservers on a production site. You don't have to worry about this too much on a development server where it's basically just you on the site. Still, be warned that it might be good to only turn on the search workers when you're testing something specific.

Installation

You'll need to install the Sphinx package and a couple of Perl modules that make it easy for us to use Sphinx:

Installing the Sphinx package

You will need to download the Sphinx package:

 wget http://sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz

And then install it:

 tar -zxvf sphinx-0.9.9.tar.gz
 cd sphinx-0.9.9/
 ./configure
 make
 make install    # as root

Installing File::SearchPath and Sphinx::Search

These are available via Ubuntu's package system, so:

 apt-get install libfile-searchpath-perl libsphinx-search-perl

It is important that you match up the versions of the Perl packages and the Sphinx package; otherwise, your searches will silently fail due to incompatibilities in the APi. Assuming that all works, you should have everything installed that you need to get the search system setup. Moving on!

Setup

Database

You will need to create a new database. Something like this will work, but make sure to adjust the username and password:

 CREATE DATABASE dw_sphinx;
 GRANT ALL ON dw_sphinx.* TO dw@'localhost' IDENTIFIED BY '__YOURPASSWORD__';
 USE dw_sphinx;

Now you have to create the tables:

The table is a pretty straightforward table. It just stores the posts, who they're by, where they're at, and some basic security information. Note that this table has the full (compressed) subject and text of the entries, so it can get rather large.

Site Configuration

Configuring your site is next. This involves adding a new section to your %DBINFO hash, like this:

  sphinx => {
      host => '127.0.0.1',
      port => 3306,
      user => 'dw',
      pass => '__YOURPASSWORD__',
      dbname => 'dw_sphinx',
      role => {
          sphinx_search => 1,
      },
  },

You also need to add a configuration elsewhere in the file that tells your system where the search daemon will be. Port 3312 is the default:

  # sphinx search daemon
  @SPHINX_SEARCHD = ( '127.0.0.1', 3312 );

That's it for site configuration. Once you have the above two options in, then your site will do all the right things to make the search active. Of course, we still have to configure Sphinx itself...

Sphinx

The first step, assuming you're going to be running Sphinx as root, is to make the directory it needs:

  mkdir /var/data

Now, we need to setup the configuration file. By default, sphinx will look for the file in `/usr/local/etc/sphinx.conf`. To confirm, run this:

 indexer --quiet

It will fail if it didn't find a config file, but will helpfully tell you where it tried to look. Now we know where the config file is, we need to replace it with this:

That's right. It's long. But it's actually almost identical to the configuration file that comes with Sphinx. There are a lot of tweaks in it to figure out the right combination of values for UTF-8 support and the like, but the rest is pretty straightforward.

Make sure to customize `sql_user` and `sql_pass` in the configuration files to match what you used earlier.

To make sure that your test setup is working, once you have all of the configuration done, try to run the indexer (as root).

 indexer --all

You should see it spit out some stuff saying it's collecting documents, and if all goes well, you should see files in /var/data. You won't be able to search yet because you haven't placed any data in your search database, but you'll at least have confirmed that you have Sphinx configured properly.

The Search Process

Getting content into search requires a few things:

  • a database containing entry/comment text - we have a separate database for the text of entries/comments that we want to be searchable. We copy each entry/comment when it's posted or edited from the main database into the search database
  • a Sphinx index - doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further, creating an index of the words. Processing the text this way also makes it possible for a search for "test" to turn up "tests", "testing", etc

Getting search results involves a couple more:

  • a Sphinx daemon - actually runs the search
  • a search worker - connects to Sphinx and retrieves the results

For search to work, you have to first set up TheSchwartz and set up Gearman.

The full process of a search is as follows:

  1. schedule for copying any entries/comments that should be searchable All entries / comments before you set up search will need to be copied over. The following script will schedule a full copy of entries/comments for all accounts on your server:
      bin/schedule-copier-jobs

    Entries/comments posted or edited after you set up search will be automatically scheduled to be copied.

  2. copy the original text into the search database
     bin/worker/sphinx-copier -v
    
  3. run the Sphinx indexer to make the text searchable (as root)
     indexer --all
    

    After this, the text is finally searchable. If you want, you can test by running:

     search -q words
    

    And if that works, you're on your way to making search work on the site itself.

  4. run the Sphinx search daemon. Put this in a window, and leave it alone until you're done. There's some debugging output in console mode:
     searchd --console
    
  5. make sure that the search worker is running
     bin/worker/sphinx-search-gm -v
    

After the first time, you'll only need to care about step 2 onwards. All scheduling as in step 1 will happen automatically.

Text search is resource-intensive and is a separate system from the main Dreamwidth site so it's possible to run on a different machine from the webservers on a production site. You don't have to worry about this too much on a development server where it's basically just you on the site. Still, be warned that it might be good to only turn on the search workers when you need it.