|
|
Line 1: |
Line 1: |
− | There are two forms of search on Dreamwidth installations: both are optional and require further setup.
| + | #REDIRECT [[Search]] |
− | | + | |
− | User search searches for users matching certain characteristics. Text search searches through entries and comments. This page will focus on text search.
| + | |
− | | + | |
− | | + | |
− | = User Search =
| + | |
− | Not heavily documented, instructions for setup can be found in [[Set_up_UserSearch]].
| + | |
− | | + | |
− | = Text Search =
| + | |
− | Dreamwidth uses Sphinx, an open-source search package, to implement text search. Search is available on the [http://dreamwidth.org/search search page]. There are two modes of search: site search, and per-journal search.
| + | |
− | | + | |
− | Site search only shows public content. Journal search may contain locked content, following the regular behavior for whether you can see the locked content or not. That is, if you can see it on the journal, then you can find it with search. If you can't see it on the journal, then you won't see it in the search results. There's also an option to search by comments. Only comments made on paid users' journals are indexed for technical reasons (site load).
| + | |
− | | + | |
− | Text search is resource-intensive and is a separate system from the main Dreamwidth site. This makes it possible to run on a different machine from the webservers on a production site. You don't have to worry about this too much on a development server where it's basically just you on the site. Still, be warned that it might be good to only turn on the search workers when you're testing something specific.
| + | |
− | | + | |
− | == Installation ==
| + | |
− | | + | |
− | You'll need to install the Sphinx package and a couple of Perl modules that make it easy for us to use Sphinx:
| + | |
− | | + | |
− | === Installing the Sphinx package ===
| + | |
− | | + | |
− | apt-get install sphinxsearch
| + | |
− | | + | |
− | === Installing File::SearchPath and Sphinx::Search ===
| + | |
− | | + | |
− | These are available via Ubuntu's package system, so:
| + | |
− | | + | |
− | apt-get install libfile-searchpath-perl libsphinx-search-perl
| + | |
− | | + | |
− | It is important that you [http://search.cpan.org/~jjschutz/Sphinx-Search-0.22/lib/Sphinx/Search.pm#VERSION match up the versions] of the Perl packages and the Sphinx package; otherwise, your searches will silently fail due to incompatibilities in the APi. Assuming the proper workers are running, if "search -q terms" returns results, while a site search always fails, this is one possible reason. Assuming that all works, you should have everything installed that you need to get the search system setup. Moving on!
| + | |
− | | + | |
− | == Setup ==
| + | |
− | | + | |
− | === Database ===
| + | |
− | | + | |
− | You will need to create a new database. Something like this will work, but make sure to adjust the username and password:
| + | |
− | | + | |
− | CREATE DATABASE dw_sphinx;
| + | |
− | GRANT ALL ON dw_sphinx.* TO dw@'localhost' IDENTIFIED BY '__YOURPASSWORD__';
| + | |
− | USE dw_sphinx;
| + | |
− | | + | |
− | Now you have to create the tables:
| + | |
− | | + | |
− | <gist>ba988dfd02e49822246f</gist>
| + | |
− | | + | |
− | The table is a pretty straightforward table. It just stores the posts, who they're by, where they're at, and some basic security information. Note that this table has the full (compressed) subject and text of the entries, so it can get rather large.
| + | |
− | | + | |
− | === Site Configuration ===
| + | |
− | | + | |
− | Configuring your site is next. This involves adding a new section to your %DBINFO hash, like this:
| + | |
− | | + | |
− | sphinx => {
| + | |
− | host => '127.0.0.1',
| + | |
− | port => 3306,
| + | |
− | user => 'dw',
| + | |
− | pass => '__YOURPASSWORD__',
| + | |
− | dbname => 'dw_sphinx',
| + | |
− | role => {
| + | |
− | sphinx_search => 1,
| + | |
− | },
| + | |
− | },
| + | |
− | | + | |
− | You also need to add a configuration elsewhere in the file that tells your system where the search daemon will be. Port 3312 is the default:
| + | |
− | | + | |
− | # sphinx search daemon
| + | |
− | @SPHINX_SEARCHD = ( '127.0.0.1', 3312 );
| + | |
− | | + | |
− | That's it for site configuration. Once you have the above two options in, then your site will do all the right things to make the search active. Of course, we still have to configure Sphinx itself...
| + | |
− | | + | |
− | === Sphinx ===
| + | |
− | | + | |
− | The first step, assuming you're going to be running Sphinx as root, is to make the directory it needs:
| + | |
− | | + | |
− | mkdir /var/data
| + | |
− | | + | |
− | Now, we need to setup the configuration file. By default, sphinx will look for the file in `/usr/local/etc/sphinx.conf`. To confirm, run this:
| + | |
− | | + | |
− | indexer --quiet
| + | |
− | | + | |
− | It will fail if it didn't find a config file, but will helpfully tell you where it tried to look. Now we know where the config file is, we need to replace it with this:
| + | |
− | | + | |
− | <gist>f50f0604a064db0464ad</gist>
| + | |
− | | + | |
− | That's right. It's long. But it's actually almost identical to the configuration file that comes with Sphinx. There are a lot of tweaks in it to figure out the right combination of values for UTF-8 support and the like, but the rest is pretty straightforward.
| + | |
− | | + | |
− | Make sure to customize `sql_user` and `sql_pass` in the configuration files to match what you used earlier.
| + | |
− | | + | |
− | To make sure that your test setup is working, once you have all of the configuration done, try to run the indexer (as root).
| + | |
− | | + | |
− | indexer --all
| + | |
− | | + | |
− | You should see it spit out some stuff saying it's collecting documents, and if all goes well, you should see files in /var/data. You won't be able to search yet because you haven't placed any data in your search database, but you'll at least have confirmed that you have Sphinx configured properly.
| + | |
− | | + | |
− | == The Search Process ==
| + | |
− | | + | |
− | Getting content into search requires a few things:
| + | |
− | * a database containing entry/comment text - we have a separate database for the text of entries/comments that we want to be searchable. We copy each entry/comment when it's posted or edited from the main database into the search database
| + | |
− | * a Sphinx index - doing a search on raw text is painfully slow, so Sphinx processes the contents of the search database further, creating an index of the words. Processing the text this way also makes it possible for a search for "test" to turn up "tests", "testing", etc
| + | |
− | | + | |
− | Getting search results involves a couple more:
| + | |
− | * a Sphinx daemon - actually runs the search
| + | |
− | * a search worker - connects to Sphinx and retrieves the results
| + | |
− | | + | |
− | For search to work, you have to first [[TheSchwartz_Setup | set up TheSchwartz]] and [[Setting up Gearman | set up Gearman]].
| + | |
− | | + | |
− | The full process of a search is as follows:
| + | |
− | <ol><li>schedule for copying any entries/comments that should be searchable
| + | |
− | | + | |
− | All entries / comments before you set up search will need to be copied over. The following script will schedule a full copy of entries/comments for all accounts on your server:
| + | |
− | | + | |
− | <pre> bin/schedule-copier-jobs</pre>
| + | |
− | | + | |
− | Entries/comments posted or edited after you set up search will be automatically scheduled to be copied.
| + | |
− | </li><li>copy the original text into the search database. You can leave this running.
| + | |
− | | + | |
− | bin/worker/sphinx-copier -v
| + | |
− | | + | |
− | </li><li>run the Sphinx indexer to make the text searchable (as root). On a production server, you might want to put this in a crontab and run it every 15 minutes or so. If just testing, make sure to run it before you do a search, if you've inserted new content (so that the new stuff will show up in search results):
| + | |
− | | + | |
− | indexer --all --rotate
| + | |
− | | + | |
− | After this, the text is finally searchable. If you want, you can test by running:
| + | |
− | | + | |
− | search -q words
| + | |
− | | + | |
− | And if that works, you're on your way to making search work on the site itself.
| + | |
− | </li><li>run the Sphinx search daemon (as root). Put this in a window, and leave it alone until you're done. There's some debugging output in console mode:
| + | |
− | | + | |
− | searchd --console --pidfile
| + | |
− | | + | |
− | </li><li>make sure that the search worker is running. You can leave this running in a window:
| + | |
− | | + | |
− | bin/worker/sphinx-search-gm -v
| + | |
− | | + | |
− | </li></ol>
| + | |
− | | + | |
− | After the first time, you'll only need to care about step 2 onwards. All scheduling as in step 1 will happen automatically.
| + | |
− | | + | |
− | Text search is resource-intensive and is a separate system from the main Dreamwidth site so it's possible to run on a different machine from the webservers on a production site. You don't have to worry about this too much on a development server where it's basically just you on the site. Still, be warned that it might be good to only turn on the search workers when you need it.
| + | |
− | | + | |
− | [[Category: Development]][[Category: Dreamwidth Installation]][[Category: Reference]]
| + | |