Difference between revisions of "Search/Sphinx"

From Dreamwidth Notes
Jump to: navigation, search
(Sphinx)
m (new gist with most recent config values)
 
(17 intermediate revisions by 6 users not shown)
Line 1: Line 1:
 +
 +
Sphinx indexes entries and comments.  It's available from the [http://dreamwidth.org/search search page].
 +
 +
There are two modes of search: site search, and per-journal search.
 +
 +
Site search only shows public content. Journal search may contain locked content, following the regular behavior for whether you can see the locked content or not. That is, if you can see it on the journal, then you can find it with search. If you can't see it on the journal, then you won't see it in the search results. There's also an option to search by comments. Only comments made on paid users' journals are indexed for technical reasons (site load).
 +
 +
 +
== Setup ==
 +
 
This page documents the process you will need to go through to setup the Sphinx search system that Dreamwidth uses.  This is not an easy process, and the documentation is probably going to need some iterations to get to a very useful state.
 
This page documents the process you will need to go through to setup the Sphinx search system that Dreamwidth uses.  This is not an easy process, and the documentation is probably going to need some iterations to get to a very useful state.
  
I only suggest setting up the search system if you have a good amount of time to mess around with things.  If you need some help, feel free to grab me anytime and I'll help out.
+
I only suggest setting up the search system if you have a good amount of time to mess around with things.  If you need some help, feel free to grab me (<dwuser>mark</dwuser>) anytime and I'll help out.
  
 
== Software Installation ==
 
== Software Installation ==
Line 9: Line 19:
 
  apt-get install libpath-class-perl libmysqlclient15-dev g++
 
  apt-get install libpath-class-perl libmysqlclient15-dev g++
  
There are three packages that you will have to download:
+
The instructions are different depending on your version of Ubuntu, so choose the appropriate version:
 +
 
 +
=== Jaunty/9.04 and older ===
 +
 
 +
==== Installing File::SearchPath and Sphinx::Search ====
 +
 
 +
There are two Perl packages that you will have to download:
  
http://sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz
 
http://search.cpan.org/CPAN/authors/id/J/JJ/JJSCHUTZ/Sphinx-Search-0.12.tar.gz
 
 
  http://search.cpan.org/CPAN/authors/id/T/TJ/TJENNESS/File-SearchPath-0.05.tar.gz
 
  http://search.cpan.org/CPAN/authors/id/T/TJ/TJENNESS/File-SearchPath-0.05.tar.gz
 +
http://search.cpan.org/CPAN/authors/id/J/JJ/JJSCHUTZ/Sphinx-Search-0.12.tar.gz
  
Now, you need to build these.  The second two are standard Perl packages which you can build with dh-make-perl.  Do File::SearchPath first (and then install it) and then you can build Sphinx::Search (and install it).
+
Now, you need to build these.  They are standard Perl packages which you can build with dh-make-perl.  Do File::SearchPath first (and then install it) and then you can build Sphinx::Search (and install it).
 +
 
 +
==== Installing the Sphinx package ====
 +
 
 +
You will need to download the Sphinx package:
 +
 
 +
http://sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz
  
 
The Sphinx package itself is a standard project style.  Setup and installation looks something like this:
 
The Sphinx package itself is a standard project style.  Setup and installation looks something like this:
Line 24: Line 45:
 
  make
 
  make
 
  make install
 
  make install
 +
 +
=== Karmic/9.10 and newer ===
 +
 +
==== Installing File::SearchPath and Sphinx::Search ====
 +
 +
From Ubuntu 9.10 and up, these Perl packages are available in the packaging system. You can install them in one step:
 +
 +
apt-get install libfile-searchpath-perl libsphinx-search-perl
 +
 +
==== Installing the Sphinx package ====
 +
 +
You will need to download the Sphinx package:
 +
 +
http://sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz
 +
 +
The Sphinx package itself is a standard project style.  Setup and installation looks something like this:
 +
 +
tar -zxvf sphinx-0.9.9.tar.gz
 +
cd sphinx-0.9.9/
 +
./configure
 +
make
 +
make install
 +
 +
 +
It is important that you [http://search.cpan.org/~jjschutz/Sphinx-Search-0.22/lib/Sphinx/Search.pm#VERSION match up the versions] of the Perl packages and the Sphinx package; otherwise, your searches will silently fail due to incompatibilities in the API.  (For instance, assuming the proper workers are running, if <tt>"search -q terms"</tt> returns results, while a site search always fails, this is one possible reason.)
  
 
Assuming that all works, you should have everything installed that you need to get the search system setup.  Moving on!
 
Assuming that all works, you should have everything installed that you need to get the search system setup.  Moving on!
Line 37: Line 83:
 
  CREATE DATABASE dw_sphinx;
 
  CREATE DATABASE dw_sphinx;
 
  GRANT ALL ON dw_sphinx.* TO dw@'localhost' IDENTIFIED BY 'dw';
 
  GRANT ALL ON dw_sphinx.* TO dw@'localhost' IDENTIFIED BY 'dw';
 +
USE dw_sphinx;
  
Now, you have to make a table:
+
Now, you have to make a table. Use these CREATE TABLE statements:
  
USE dw_sphinx;
+
<gist>19693079b7378531bde13cf9bce981e1</gist>
CREATE TABLE `posts_raw` (
+
  `id` int(10) unsigned NOT NULL auto_increment,
+
  `journal_id` int(10) unsigned NOT NULL,
+
  `jitemid` int(10) unsigned NOT NULL,
+
  `poster_id` int(10) unsigned NOT NULL,
+
  `security_bits` varchar(255) NOT NULL,
+
  `allow_global_search` enum('0','1') NOT NULL default '1',
+
  `is_deleted` enum('0','1') NOT NULL default '0',
+
  `date_posted` int(10) unsigned NOT NULL,
+
  `title` varchar(255) default NULL,
+
  `data` mediumtext,
+
  `revtime` int(10) unsigned NOT NULL,
+
  `touchtime` int(10) unsigned NOT NULL,
+
  PRIMARY KEY  (`id`),
+
  UNIQUE KEY `journal_id` (`journal_id`,`jitemid`)
+
) ENGINE=InnoDB;
+
  
The table is a pretty straightforward table.  It just stores the posts, who they're by, where they're at, and some basic security information.  Note that this table has the full uncompressed subject and text of the entries, so it can get rather large.
+
The <code>items_raw</code> table is a pretty straightforward table.  It just stores the posts, who they're by, where they're at, and some basic security information.  Note that this table has the full uncompressed subject and text of the entries, so it can get rather large.  The <code>support_raw</code> table stores similar information for support requests.
  
 
=== Site ===
 
=== Site ===
Line 88: Line 119:
 
  mkdir /var/data
 
  mkdir /var/data
  
Now, we need to setup the configuration file.  Here's mine:
+
Now, we need to set up the configuration file.  By default, sphinx looks for the file <tt>/usr/local/etc/sphinx.conf</tt>.  If that's not present on your system, try running "indexer --quiet"; it will fail if it didn't find a config file, but will helpfully tell you where it tried to look.
  
## Dreamwidth sphinx configuration
 
 
source src1
 
{
 
    type                  = mysql
 
 
    sql_host              = localhost
 
    sql_user              = dw
 
    sql_pass              = dw
 
    sql_db                = dw_sphinx
 
    sql_port              = 3306
 
 
    sql_query_pre        = SET NAMES 'utf8'
 
 
    sql_query            = \
 
        SELECT id, journal_id, jitemid, poster_id, security_bits \
 
              allow_global_search, is_deleted, date_posted, title, data \
 
        FROM posts_raw \
 
        WHERE id >= $start AND id <= $end
 
 
    sql_query_range      = SELECT MIN(id), MAX(id) FROM posts_raw
 
 
    sql_attr_uint        = journal_id
 
    sql_attr_uint        = poster_id
 
    sql_attr_uint        = jitemid
 
    sql_attr_bool        = is_deleted
 
    sql_attr_timestamp    = date_posted
 
    sql_attr_bool        = allow_global_search
 
    sql_attr_multi        = uint security_bits from field
 
  
    sql_query_info        = SELECT * FROM posts_raw WHERE id = $id
+
<gist>eafc5b3b1dca62dabd90c785d6b62d9d</gist>
}
+
+
index dw1
+
{
+
    source        = src1
+
    path          = /var/data/dreamwidth
+
    charset_type  = utf-8
+
+
    charset_table = 0..9, a..z, _, A..Z->a..z, U+00C0->a, U+00C1->a, \
+
        U+00C2->a, U+00C3->a, U+00C4->a, U+00C5->a, U+00C7->c, U+00C8->e, \
+
        U+00C9->e, U+00CA->e, U+00CB->e, U+00CC->i, U+00CD->i, U+00CE->i, \
+
        U+00CF->i, U+00D1->n, U+00D2->o, U+00D3->o, U+00D4->o, U+00D5->o, \
+
        U+00D6->o, U+00D9->u, U+00DA->u, U+00DB->u, U+00DC->u, U+00DD->y, \
+
        U+00E0->a, U+00E1->a, U+00E2->a, U+00E3->a, U+00E4->a, U+00E5->a, \
+
        U+00E7->c, U+00E8->e, U+00E9->e, U+00EA->e, U+00EB->e, U+00EC->i, \
+
        U+00ED->i, U+00EE->i, U+00EF->i, U+00F1->n, U+00F2->o, U+00F3->o, \
+
        U+00F4->o, U+00F5->o, U+00F6->o, U+00F9->u, U+00FA->u, U+00FB->u, \
+
        U+00FC->u, U+00FD->y, U+00FF->y, U+0100->a, U+0101->a, U+0102->a, \
+
        U+0103->a, U+0104->a, U+0105->a, U+0106->c, U+0107->c, U+0108->c, \
+
        U+0109->c, U+010A->c, U+010B->c, U+010C->c, U+010D->c, U+010E->d, \
+
        U+010F->d, U+0112->e, U+0113->e, U+0114->e, U+0115->e, U+0116->e, \
+
        U+0117->e, U+0118->e, U+0119->e, U+011A->e, U+011B->e, U+011C->g, \
+
        U+011D->g, U+011E->g, U+011F->g, U+0120->g, U+0121->g, U+0122->g, \
+
        U+0123->g, U+0124->h, U+0125->h, U+0128->i, U+0129->i, U+012A->i, \
+
        U+012B->i, U+012C->i, U+012D->i, U+012E->i, U+012F->i, U+0130->i, \
+
        U+0134->j, U+0135->j, U+0136->k, U+0137->k, U+0139->l, U+013A->l, \
+
        U+013B->l, U+013C->l, U+013D->l, U+013E->l, U+0142->l, U+0143->n, \
+
        U+0144->n, U+0145->n, U+0146->n, U+0147->n, U+0148->n, U+014C->o, \
+
        U+014D->o, U+014E->o, U+014F->o, U+0150->o, U+0151->o, U+0154->r, \
+
        U+0155->r, U+0156->r, U+0157->r, U+0158->r, U+0159->r, U+015A->s, \
+
        U+015B->s, U+015C->s, U+015D->s, U+015E->s, U+015F->s, U+0160->s, \
+
        U+0161->s, U+0162->t, U+0163->t, U+0164->t, U+0165->t, U+0168->u, \
+
        U+0169->u, U+016A->u, U+016B->u, U+016C->u, U+016D->u, U+016E->u, \
+
        U+016F->u, U+0170->u, U+0171->u, U+0172->u, U+0173->u, U+0174->w, \
+
        U+0175->w, U+0176->y, U+0177->y, U+0178->y, U+0179->z, U+017A->z, \
+
        U+017B->z, U+017C->z, U+017D->z, U+017E->z, U+01A0->o, U+01A1->o, \
+
        U+01AF->u, U+01B0->u, U+01CD->a, U+01CE->a, U+01CF->i, U+01D0->i, \
+
        U+01D1->o, U+01D2->o, U+01D3->u, U+01D4->u, U+01D5->u, U+01D6->u, \
+
        U+01D7->u, U+01D8->u, U+01D9->u, U+01DA->u, U+01DB->u, U+01DC->u, \
+
        U+01DE->a, U+01DF->a, U+01E0->a, U+01E1->a, U+01E6->g, U+01E7->g, \
+
        U+01E8->k, U+01E9->k, U+01EA->o, U+01EB->o, U+01EC->o, U+01ED->o, \
+
        U+01F0->j, U+01F4->g, U+01F5->g, U+01F8->n, U+01F9->n, U+01FA->a, \
+
        U+01FB->a, U+0200->a, U+0201->a, U+0202->a, U+0203->a, U+0204->e, \
+
        U+0205->e, U+0206->e, U+0207->e, U+0208->i, U+0209->i, U+020A->i, \
+
        U+020B->i, U+020C->o, U+020D->o, U+020E->o, U+020F->o, U+0210->r, \
+
        U+0211->r, U+0212->r, U+0213->r, U+0214->u, U+0215->u, U+0216->u, \
+
        U+0217->u, U+0218->s, U+0219->s, U+021A->t, U+021B->t, U+021E->h, \
+
        U+021F->h, U+0226->a, U+0227->a, U+0228->e, U+0229->e, U+022A->o, \
+
        U+022B->o, U+022C->o, U+022D->o, U+022E->o, U+022F->o, U+0230->o, \
+
        U+0231->o, U+0232->y, U+0233->y, U+1E00->a, U+1E01->a, U+1E02->b, \
+
        U+1E03->b, U+1E04->b, U+1E05->b, U+1E06->b, U+1E07->b, U+1E08->c, \
+
        U+1E09->c, U+1E0A->d, U+1E0B->d, U+1E0C->d, U+1E0D->d, U+1E0E->d, \
+
        U+1E0F->d, U+1E10->d, U+1E11->d, U+1E12->d, U+1E13->d, U+1E14->e, \
+
        U+1E15->e, U+1E16->e, U+1E17->e, U+1E18->e, U+1E19->e, U+1E1A->e, \
+
        U+1E1B->e, U+1E1C->e, U+1E1D->e, U+1E1E->f, U+1E1F->f, U+1E20->g, \
+
        U+1E21->g, U+1E22->h, U+1E23->h, U+1E24->h, U+1E25->h, U+1E26->h, \
+
        U+1E27->h, U+1E28->h, U+1E29->h, U+1E2A->h, U+1E2B->h, U+1E2C->i, \
+
        U+1E2D->i, U+1E2E->i, U+1E2F->i, U+1E30->k, U+1E31->k, U+1E32->k, \
+
        U+1E33->k, U+1E34->k, U+1E35->k, U+1E36->l, U+1E37->l, U+1E38->l, \
+
        U+1E39->l, U+1E3A->l, U+1E3B->l, U+1E3C->l, U+1E3D->l, U+1E3E->m, \
+
        U+1E3F->m, U+1E40->m, U+1E41->m, U+1E42->m, U+1E43->m, U+1E44->n, \
+
        U+1E45->n, U+1E46->n, U+1E47->n, U+1E48->n, U+1E49->n, U+1E4A->n, \
+
        U+1E4B->n, U+1E4C->o, U+1E4D->o, U+1E4E->o, U+1E4F->o, U+1E50->o, \
+
        U+1E51->o, U+1E52->o, U+1E53->o, U+1E54->p, U+1E55->p, U+1E56->p, \
+
        U+1E57->p, U+1E58->r, U+1E59->r, U+1E5A->r, U+1E5B->r, U+1E5C->r, \
+
        U+1E5D->r, U+1E5E->r, U+1E5F->r, U+1E60->s, U+1E61->s, U+1E62->s, \
+
        U+1E63->s, U+1E64->s, U+1E65->s, U+1E66->s, U+1E67->s, U+1E68->s, \
+
        U+1E69->s, U+1E6A->t, U+1E6B->t, U+1E6C->t, U+1E6D->t, U+1E6E->t, \
+
        U+1E6F->t, U+1E70->t, U+1E71->t, U+1E72->u, U+1E73->u, U+1E74->u, \
+
        U+1E75->u, U+1E76->u, U+1E77->u, U+1E78->u, U+1E79->u, U+1E7A->u, \
+
        U+1E7B->u, U+1E7C->v, U+1E7D->v, U+1E7E->v, U+1E7F->v, U+1E80->w, \
+
        U+1E81->w, U+1E82->w, U+1E83->w, U+1E84->w, U+1E85->w, U+1E86->w, \
+
        U+1E87->w, U+1E88->w, U+1E89->w, U+1E8A->x, U+1E8B->x, U+1E8C->x, \
+
        U+1E8D->x, U+1E8E->y, U+1E8F->y, U+1E96->h, U+1E97->t, U+1E98->w, \
+
        U+1E99->y, U+1EA0->a, U+1EA1->a, U+1EA2->a, U+1EA3->a, U+1EA4->a, \
+
        U+1EA5->a, U+1EA6->a, U+1EA7->a, U+1EA8->a, U+1EA9->a, U+1EAA->a, \
+
        U+1EAB->a, U+1EAC->a, U+1EAD->a, U+1EAE->a, U+1EAF->a, U+1EB0->a, \
+
        U+1EB1->a, U+1EB2->a, U+1EB3->a, U+1EB4->a, U+1EB5->a, U+1EB6->a, \
+
        U+1EB7->a, U+1EB8->e, U+1EB9->e, U+1EBA->e, U+1EBB->e, U+1EBC->e, \
+
        U+1EBD->e, U+1EBE->e, U+1EBF->e, U+1EC0->e, U+1EC1->e, U+1EC2->e, \
+
        U+1EC3->e, U+1EC4->e, U+1EC5->e, U+1EC6->e, U+1EC7->e, U+1EC8->i, \
+
        U+1EC9->i, U+1ECA->i, U+1ECB->i, U+1ECC->o, U+1ECD->o, U+1ECE->o, \
+
        U+1ECF->o, U+1ED0->o, U+1ED1->o, U+1ED2->o, U+1ED3->o, U+1ED4->o, \
+
        U+1ED5->o, U+1ED6->o, U+1ED7->o, U+1ED8->o, U+1ED9->o, U+1EDA->o, \
+
        U+1EDB->o, U+1EDC->o, U+1EDD->o, U+1EDE->o, U+1EDF->o, U+1EE0->o, \
+
        U+1EE1->o, U+1EE2->o, U+1EE3->o, U+1EE4->u, U+1EE5->u, U+1EE6->u, \
+
        U+1EE7->u, U+1EE8->u, U+1EE9->u, U+1EEA->u, U+1EEB->u, U+1EEC->u, \
+
        U+1EED->u, U+1EEE->u, U+1EEF->u, U+1EF0->u, U+1EF1->u, U+1EF2->y, \
+
        U+1EF3->y, U+1EF4->y, U+1EF5->y, U+1EF6->y, U+1EF7->y, U+1EF8->y, \
+
        U+1EF9->y
+
}
+
+
index dw1stemmed : dw1
+
{
+
    path      = /var/data/dw1stemmed
+
    morphology = stem_en
+
}
+
 
+
indexer
+
{
+
    mem_limit  = 512M
+
}
+
+
searchd
+
{
+
    address    = 127.0.0.1
+
    port      = 3312
+
    log        = /var/log/searchd.log
+
    query_log  = /var/log/query.log
+
    pid_file  = /var/log/searchd.pid
+
}
+
  
 
That's right.  It's long.  But it's actually almost identical to the configuration file that comes with Sphinx.  I had to do a lot of tweaking to figure out the right combination of values for UTF-8 support and the like, but the rest is pretty straightforward.
 
That's right.  It's long.  But it's actually almost identical to the configuration file that comes with Sphinx.  I had to do a lot of tweaking to figure out the right combination of values for UTF-8 support and the like, but the rest is pretty straightforward.
 +
 +
Make sure to customize `sql_user` and `sql_pass` in the configuration files to match what you used earlier.
  
 
== Testing ==
 
== Testing ==
Line 241: Line 134:
 
  indexer --all
 
  indexer --all
  
You should see it spit out some stuff saying it's collecting documents, and if all goes well, you should see files in /var/data.  You can also test search from the command line:
+
You should see it spit out some stuff saying it's collecting documents, and if all goes well, you should see files in /var/data.  You won't be able to search yet because you haven't placed any data in your search database, but you'll at least have confirmed that you have Sphinx configured properly.
  
search -q some words
 
  
That's the gist of it.  If all of that seems to be working, then you should be able to skip down to the usage section.
+
== Search Architecture ==
  
== Usage ==
+
Making content searchable requires two things:
  
The search system requires a few components to be up and running for it to actually workIn a nutshell, you should make sure the following are running at all times:
+
* a '''search database'''.  The search database contains the text of entries and comments that we want to be searchable.  It's separate from the main database.  Every time an entry or comment is posted or edited, it needs to be copied from the main database into the search databaseThe worker that does this is <tt>'''sphinx-copier'''</tt>, which is run by TheSchwartz.
 +
* a '''search index'''.  Doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further -- it creates an index of words, then runs searches against that index instead of the raw text itself. Processing the text this way also makes it possible for a search for "test" to turn up "tests", "testing", etc.  The index is created by running a program named, surprisingly enough, <tt>'''indexer'''</tt>.
  
bin/worker/sphinx-copier -v
+
Running searches and getting results requires two more:
bin/worker/sphinx-search-gm -v
+
  
The former is a TheSchwartz job (so you need to have that configured) and the latter is a Gearman worker (and you need that configured too)Once those are running, have to make sure the search daemon is runningAgain, as root:
+
* the Sphinx '''search daemon''' -- the program that actually runs searchesThis is <tt>'''searchd'''</tt>.
 +
* a '''search worker''', which connects to the Sphinx daemon, feeds it queries, and retrieves the resultsThis is <tt>'''sphinx-search-gm'''</tt>, a Gearman job.
  
searchd --console
 
  
All of these command lines are "foreground debug mode" versions.  If you drop the arguments, the workers/searchd will spawn themselves off into the background and disappear from sight.
 
  
Now, once you have those components up and running, you can do one of two things to actually get data in the system to search.  You can use the manual copier or you can go post on one of your paid accounts.  The manual copier is probably easiest if you don't have a zillion accounts on the system:
+
== Assembling the pieces (test env) ==
  
 +
Now, here's how we put all of those pieces together in a test environment.
 +
 +
* First, we need to have both [[TheSchwartz Setup | TheSchwartz]] and [[Setting up Gearman | Gearman]] set up and working -- <tt>sphinx-copier</tt> depends on the former, and <tt>sphinx-search-gm</tt> on the latter.
 +
* Now, we want to run both of those workers in the foreground so that we can keep an eye on them.  In separate terminal sessions, run
 +
bin/worker/sphinx-copier -v
 +
bin/worker/sphinx-search-gm -v
 +
* Next, we need to have the search daemon running -- also in the foreground:
 +
searchd --console
 +
* To get data into the search db, you have two options: 
 +
** You can post to some of your paid accounts.  Now that <tt>sphinx-copier</tt> is running, it will pick up anything new that you post and schedule it for addition to the search db.
 +
** Alternatively, you can run the manual copier, which will tell <tt>sphinx-copier</tt> about any posts or comments that area already on your site:
 
  bin/schedule-copier-jobs
 
  bin/schedule-copier-jobs
 +
* ...running that will get your sphinx-copier busy copying data into the dw_sphinx database you made earlier.  You can see if it works by watching the output of the sphinx-copier -- it should say something about inserting posts.  You can then go to the dw_sphinx database and select from posts_raw to see the data is actually in the system.
 +
* Now that we have data in the search database, we have to index it.  On a production site, you'd want to run the indexer every 15 minutes or so; in test, you can just run it before you do a search, if you've added new content since the last run.
 +
  indexer --all --rotate
  
Just run that and it will get your sphinx-copier busy copying data into the dw_sphinx database you made earlier.  You can see if it works by watching the output of the sphinx-copier, it should say something about inserting postsYou can then go to the dw_sphinx database and select from posts_raw to see the data is actually in the system.
+
* Finally, restart searchd.   
  
Once you have data in the system, you have to index. This is pretty easy, just run the indexer again (as root):
+
You should now be able to search for things from the command line or from your site! To search from the command line, use:
  
  indexer --all
+
  search -q some words
 +
 
 +
 
 +
Sphinx is resource-intensive.  It's intentionally been made a separate system from the main Dreamwidth site, so that it can be run on a different machine from the webservers in production.
 +
 
 +
You don't have to worry too much about load on a development server where you have little data to index and it's only you on the machine.  Still, it may make sense to only turn on the search workers when you're testing something search-related.
  
Once that's done, you can restart searchd, and you should be able to search for things from the command line or from your site.
 
  
[[Category: Development]][[Category: Dreamwidth Installation]]
+
[[Category: Development]][[Category: Dreamwidth Installation]][[Category: Reference]][[Category:Search]]

Latest revision as of 21:42, 27 March 2017

Sphinx indexes entries and comments. It's available from the search page.

There are two modes of search: site search, and per-journal search.

Site search only shows public content. Journal search may contain locked content, following the regular behavior for whether you can see the locked content or not. That is, if you can see it on the journal, then you can find it with search. If you can't see it on the journal, then you won't see it in the search results. There's also an option to search by comments. Only comments made on paid users' journals are indexed for technical reasons (site load).


Setup

This page documents the process you will need to go through to setup the Sphinx search system that Dreamwidth uses. This is not an easy process, and the documentation is probably going to need some iterations to get to a very useful state.

I only suggest setting up the search system if you have a good amount of time to mess around with things. If you need some help, feel free to grab me ([info]mark) anytime and I'll help out.

Software Installation

First, you will need to setup the Sphinx software. Very first, you should make sure you have some packages (Ubuntu Intrepid):

apt-get install libpath-class-perl libmysqlclient15-dev g++

The instructions are different depending on your version of Ubuntu, so choose the appropriate version:

Jaunty/9.04 and older

Installing File::SearchPath and Sphinx::Search

There are two Perl packages that you will have to download:

http://search.cpan.org/CPAN/authors/id/T/TJ/TJENNESS/File-SearchPath-0.05.tar.gz
http://search.cpan.org/CPAN/authors/id/J/JJ/JJSCHUTZ/Sphinx-Search-0.12.tar.gz

Now, you need to build these. They are standard Perl packages which you can build with dh-make-perl. Do File::SearchPath first (and then install it) and then you can build Sphinx::Search (and install it).

Installing the Sphinx package

You will need to download the Sphinx package:

http://sphinxsearch.com/downloads/sphinx-0.9.8.1.tar.gz

The Sphinx package itself is a standard project style. Setup and installation looks something like this:

tar -zxvf sphinx-0.9.8.1.tar.gz
cd sphinx-0.9.8.1/
./configure
make
make install

Karmic/9.10 and newer

Installing File::SearchPath and Sphinx::Search

From Ubuntu 9.10 and up, these Perl packages are available in the packaging system. You can install them in one step:

apt-get install libfile-searchpath-perl libsphinx-search-perl

Installing the Sphinx package

You will need to download the Sphinx package:

http://sphinxsearch.com/downloads/sphinx-0.9.9.tar.gz

The Sphinx package itself is a standard project style. Setup and installation looks something like this:

tar -zxvf sphinx-0.9.9.tar.gz
cd sphinx-0.9.9/
./configure
make
make install


It is important that you match up the versions of the Perl packages and the Sphinx package; otherwise, your searches will silently fail due to incompatibilities in the API. (For instance, assuming the proper workers are running, if "search -q terms" returns results, while a site search always fails, this is one possible reason.)

Assuming that all works, you should have everything installed that you need to get the search system setup. Moving on!

Configuration

There are several points to configure. Let's start with the configuration of your database.

Database

You will need to create a new database. Something like this process will work:

CREATE DATABASE dw_sphinx;
GRANT ALL ON dw_sphinx.* TO dw@'localhost' IDENTIFIED BY 'dw';
USE dw_sphinx;

Now, you have to make a table. Use these CREATE TABLE statements:

The items_raw table is a pretty straightforward table. It just stores the posts, who they're by, where they're at, and some basic security information. Note that this table has the full uncompressed subject and text of the entries, so it can get rather large. The support_raw table stores similar information for support requests.

Site

Configuring your site is next. This involves adding a new section to your %DBINFO hash, like this:

sphinx => {
    host => '127.0.0.1',
    port => 3306,
    user => 'dw',
    pass => 'dw',
    dbname => 'dw_sphinx',
    role => {
        sphinx_search => 1,
    },
},

You also need to add a configuration elsewhere in the file that tells your system where the search daemon will be. Port 3312 is the default:

# sphinx search daemon
@SPHINX_SEARCHD = ( '127.0.0.1', 3312 );

That's it for site configuration. Once you have the above two options in, then your site will do all the right things to make the search active. Of course, we still have to configure Sphinx itself...

Sphinx

Left this to last as it's probably the trickiest. The first step is, assuming you're going to be running Sphinx as root (I do), to make the directory it needs:

mkdir /var/data

Now, we need to set up the configuration file. By default, sphinx looks for the file /usr/local/etc/sphinx.conf. If that's not present on your system, try running "indexer --quiet"; it will fail if it didn't find a config file, but will helpfully tell you where it tried to look.


That's right. It's long. But it's actually almost identical to the configuration file that comes with Sphinx. I had to do a lot of tweaking to figure out the right combination of values for UTF-8 support and the like, but the rest is pretty straightforward.

Make sure to customize `sql_user` and `sql_pass` in the configuration files to match what you used earlier.

Testing

To make sure that your test setup is working, once you have all of the configuration done, try to run the indexer (as root).

indexer --all

You should see it spit out some stuff saying it's collecting documents, and if all goes well, you should see files in /var/data. You won't be able to search yet because you haven't placed any data in your search database, but you'll at least have confirmed that you have Sphinx configured properly.


Search Architecture

Making content searchable requires two things:

  • a search database. The search database contains the text of entries and comments that we want to be searchable. It's separate from the main database. Every time an entry or comment is posted or edited, it needs to be copied from the main database into the search database. The worker that does this is sphinx-copier, which is run by TheSchwartz.
  • a search index. Doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further -- it creates an index of words, then runs searches against that index instead of the raw text itself. Processing the text this way also makes it possible for a search for "test" to turn up "tests", "testing", etc. The index is created by running a program named, surprisingly enough, indexer.

Running searches and getting results requires two more:

  • the Sphinx search daemon -- the program that actually runs searches. This is searchd.
  • a search worker, which connects to the Sphinx daemon, feeds it queries, and retrieves the results. This is sphinx-search-gm, a Gearman job.


Assembling the pieces (test env)

Now, here's how we put all of those pieces together in a test environment.

  • First, we need to have both TheSchwartz and Gearman set up and working -- sphinx-copier depends on the former, and sphinx-search-gm on the latter.
  • Now, we want to run both of those workers in the foreground so that we can keep an eye on them. In separate terminal sessions, run
bin/worker/sphinx-copier -v
bin/worker/sphinx-search-gm -v
  • Next, we need to have the search daemon running -- also in the foreground:
searchd --console
  • To get data into the search db, you have two options:
    • You can post to some of your paid accounts. Now that sphinx-copier is running, it will pick up anything new that you post and schedule it for addition to the search db.
    • Alternatively, you can run the manual copier, which will tell sphinx-copier about any posts or comments that area already on your site:
bin/schedule-copier-jobs
  • ...running that will get your sphinx-copier busy copying data into the dw_sphinx database you made earlier. You can see if it works by watching the output of the sphinx-copier -- it should say something about inserting posts. You can then go to the dw_sphinx database and select from posts_raw to see the data is actually in the system.
  • Now that we have data in the search database, we have to index it. On a production site, you'd want to run the indexer every 15 minutes or so; in test, you can just run it before you do a search, if you've added new content since the last run.
  indexer --all --rotate
  • Finally, restart searchd.

You should now be able to search for things from the command line or from your site! To search from the command line, use:

search -q some words


Sphinx is resource-intensive. It's intentionally been made a separate system from the main Dreamwidth site, so that it can be run on a different machine from the webservers in production.

You don't have to worry too much about load on a development server where you have little data to index and it's only you on the machine. Still, it may make sense to only turn on the search workers when you're testing something search-related.