Difference between revisions of "Reference/Search"

From Dreamwidth Notes
Jump to: navigation, search
m
Line 11: Line 11:
  
 
Site search only shows public content. Journal search may contain locked content, following the regular behavior for whether you can see the locked content or not. That is, if you can see it on the journal, then you can find it with search. If you can't see it on the journal, then you won't see it in the search results. There's also an option to search by comments. Only comments made on paid users' journals are indexed for technical reasons (site load).
 
Site search only shows public content. Journal search may contain locked content, following the regular behavior for whether you can see the locked content or not. That is, if you can see it on the journal, then you can find it with search. If you can't see it on the journal, then you won't see it in the search results. There's also an option to search by comments. Only comments made on paid users' journals are indexed for technical reasons (site load).
 +
 +
Text search is resource-intensive and is a separate system from the main Dreamwidth site. This makes it possible to run on a different machine from the webservers on a production site. You don't have to worry about this too much on a development server where it's basically just you on the site. Still, be warned that it might be good to only turn on the search workers when you're testing something specific.
  
 
== Installation ==
 
== Installation ==
  
You'll need to install the Sphinx package and a couple of Perl modules that will let us use Sphinx with our code.
+
You'll need to install the Sphinx package and a couple of Perl modules that make it easy for us to use Sphinx:
  
 
=== Installing the Sphinx package ===
 
=== Installing the Sphinx package ===
Line 20: Line 22:
 
You will need to download the Sphinx package:
 
You will need to download the Sphinx package:
  
    wget http://sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
+
  wget http://sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
  
 
And then install it:
 
And then install it:
  
    tar -zxvf sphinx-0.9.9.tar.gz
+
  tar -zxvf sphinx-0.9.9.tar.gz
    cd sphinx-0.9.9/
+
  cd sphinx-0.9.9/
    ./configure
+
  ./configure
    make
+
  make
    make install    # as root
+
  make install    # as root
  
 
=== Installing File::SearchPath and Sphinx::Search ===
 
=== Installing File::SearchPath and Sphinx::Search ===
Line 34: Line 36:
 
These are available via Ubuntu's package system, so:
 
These are available via Ubuntu's package system, so:
  
    apt-get install libfile-searchpath-perl libsphinx-search-perl
+
  apt-get install libfile-searchpath-perl libsphinx-search-perl
  
 
It is important that you [http://search.cpan.org/~jjschutz/Sphinx-Search-0.22/lib/Sphinx/Search.pm#VERSION match up the versions] of the Perl packages and the Sphinx package; otherwise, your searches will silently fail due to incompatibilities in the APi. Assuming that all works, you should have everything installed that you need to get the search system setup. Moving on!
 
It is important that you [http://search.cpan.org/~jjschutz/Sphinx-Search-0.22/lib/Sphinx/Search.pm#VERSION match up the versions] of the Perl packages and the Sphinx package; otherwise, your searches will silently fail due to incompatibilities in the APi. Assuming that all works, you should have everything installed that you need to get the search system setup. Moving on!
 +
 +
== Setup ==
 +
 +
=== Database ===
 +
 +
You will need to create a new database. Something like this will work, but make sure to adjust the username and password:
 +
 +
  CREATE DATABASE dw_sphinx;
 +
  GRANT ALL ON dw_sphinx.* TO dw@'localhost' IDENTIFIED BY '__YOURPASSWORD__';
 +
  USE dw_sphinx;
 +
 +
Now you have to create the tables:
 +
 +
<gist>ba988dfd02e49822246f</gist>
 +
 +
The table is a pretty straightforward table.  It just stores the posts, who they're by, where they're at, and some basic security information.  Note that this table has the full (compressed) subject and text of the entries, so it can get rather large.
 +
 +
=== Site Configuration ===
 +
 +
Configuring your site is next.  This involves adding a new section to your %DBINFO hash, like this:
 +
 +
  sphinx => {
 +
      host => '127.0.0.1',
 +
      port => 3306,
 +
      user => 'dw',
 +
      pass => '__YOURPASSWORD__',
 +
      dbname => 'dw_sphinx',
 +
      role => {
 +
          sphinx_search => 1,
 +
      },
 +
  },
 +
 +
You also need to add a configuration elsewhere in the file that tells your system where the search daemon will be.  Port 3312 is the default:
 +
 +
  # sphinx search daemon
 +
  @SPHINX_SEARCHD = ( '127.0.0.1', 3312 );
 +
 +
That's it for site configuration.  Once you have the above two options in, then your site will do all the right things to make the search active.  Of course, we still have to configure Sphinx itself...
 +
 +
=== Sphinx ===
 +
 +
The first step, assuming you're going to be running Sphinx as root, is to make the directory it needs:
 +
 +
  mkdir /var/data
 +
 +
Now, we need to setup the configuration file.  By default, sphinx will look for the file in `/usr/local/etc/sphinx.conf`. To confirm, run this:
 +
 +
  indexer --quiet
 +
 +
It will fail if it didn't find a config file, but will helpfully tell you where it tried to look. Now we know where the config file is, we need to replace it with this:
 +
 +
<gist>995d62983c3a72fdd249</gist>
 +
 +
That's right.  It's long.  But it's actually almost identical to the configuration file that comes with Sphinx.  There are a lot of tweaks in it to figure out the right combination of values for UTF-8 support and the like, but the rest is pretty straightforward.
 +
 +
Make sure to customize `sql_user` and `sql_pass` in the configuration files to match what you used earlier.
 +
 +
== The Search Process ==
 +
 +
Getting content into search requires a few things:
 +
 +
* a separate database containing entry/comment text - we have a separate database for the text of entries/comments that we want to be searchable. We copy each entry/comment when it's posted or edited from the main database into the search database
 +
 +
* a Sphinx index - doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further, creating an index of the words.  Processing the text this way also makes it possible for  "test" to turn up "tests", "testing", etc
 +
 +
 +
Getting search results involves a couple more:
 +
 +
* ??
 +
 +
* a search worker
 +
  
  
 
[[Category: Development]][[Category: Dreamwidth Installation]][[Category: Reference]]
 
[[Category: Development]][[Category: Dreamwidth Installation]][[Category: Reference]]

Revision as of 08:57, 25 February 2013

There are two forms of search on Dreamwidth installations: both are optional and require further setup.

User search searches for users matching certain characteristics. Text search searches through entries and comments. This page will focus on text search.


User Search

Not heavily documented, instructions for setup can be found in Set_up_UserSearch.

Text Search

Dreamwidth uses Sphinx, an open-source search package, to implement text search. Search is available on the search page. There are two modes of search: site search, and per-journal search.

Site search only shows public content. Journal search may contain locked content, following the regular behavior for whether you can see the locked content or not. That is, if you can see it on the journal, then you can find it with search. If you can't see it on the journal, then you won't see it in the search results. There's also an option to search by comments. Only comments made on paid users' journals are indexed for technical reasons (site load).

Text search is resource-intensive and is a separate system from the main Dreamwidth site. This makes it possible to run on a different machine from the webservers on a production site. You don't have to worry about this too much on a development server where it's basically just you on the site. Still, be warned that it might be good to only turn on the search workers when you're testing something specific.

Installation

You'll need to install the Sphinx package and a couple of Perl modules that make it easy for us to use Sphinx:

Installing the Sphinx package

You will need to download the Sphinx package:

 wget http://sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz

And then install it:

 tar -zxvf sphinx-0.9.9.tar.gz
 cd sphinx-0.9.9/
 ./configure
 make
 make install    # as root

Installing File::SearchPath and Sphinx::Search

These are available via Ubuntu's package system, so:

 apt-get install libfile-searchpath-perl libsphinx-search-perl

It is important that you match up the versions of the Perl packages and the Sphinx package; otherwise, your searches will silently fail due to incompatibilities in the APi. Assuming that all works, you should have everything installed that you need to get the search system setup. Moving on!

Setup

Database

You will need to create a new database. Something like this will work, but make sure to adjust the username and password:

 CREATE DATABASE dw_sphinx;
 GRANT ALL ON dw_sphinx.* TO dw@'localhost' IDENTIFIED BY '__YOURPASSWORD__';
 USE dw_sphinx;

Now you have to create the tables:

The table is a pretty straightforward table. It just stores the posts, who they're by, where they're at, and some basic security information. Note that this table has the full (compressed) subject and text of the entries, so it can get rather large.

Site Configuration

Configuring your site is next. This involves adding a new section to your %DBINFO hash, like this:

  sphinx => {
      host => '127.0.0.1',
      port => 3306,
      user => 'dw',
      pass => '__YOURPASSWORD__',
      dbname => 'dw_sphinx',
      role => {
          sphinx_search => 1,
      },
  },

You also need to add a configuration elsewhere in the file that tells your system where the search daemon will be. Port 3312 is the default:

  # sphinx search daemon
  @SPHINX_SEARCHD = ( '127.0.0.1', 3312 );

That's it for site configuration. Once you have the above two options in, then your site will do all the right things to make the search active. Of course, we still have to configure Sphinx itself...

Sphinx

The first step, assuming you're going to be running Sphinx as root, is to make the directory it needs:

  mkdir /var/data

Now, we need to setup the configuration file. By default, sphinx will look for the file in `/usr/local/etc/sphinx.conf`. To confirm, run this:

 indexer --quiet

It will fail if it didn't find a config file, but will helpfully tell you where it tried to look. Now we know where the config file is, we need to replace it with this:

That's right. It's long. But it's actually almost identical to the configuration file that comes with Sphinx. There are a lot of tweaks in it to figure out the right combination of values for UTF-8 support and the like, but the rest is pretty straightforward.

Make sure to customize `sql_user` and `sql_pass` in the configuration files to match what you used earlier.

The Search Process

Getting content into search requires a few things:

  • a separate database containing entry/comment text - we have a separate database for the text of entries/comments that we want to be searchable. We copy each entry/comment when it's posted or edited from the main database into the search database
  • a Sphinx index - doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further, creating an index of the words. Processing the text this way also makes it possible for "test" to turn up "tests", "testing", etc


Getting search results involves a couple more:

  •  ??
  • a search worker