Difference between revisions of "Production Notes"
(→SSL) |
(→ddlockd) |
||
Line 137: | Line 137: | ||
== ddlockd == | == ddlockd == | ||
+ | |||
+ | This is a very damn simple locking daemon. | ||
+ | |||
+ | * Runs on: dfw-jobs01, dfw-jobs02 | ||
+ | * Port: 7002 | ||
+ | |||
+ | === Start/Restart === | ||
+ | |||
+ | Given that it's a locking system, it won't actually restart. The following command will start them up if they're down, that's it: | ||
+ | |||
+ | run as root on dfw-admin01 | ||
+ | $ bin/restart-ddlockd | ||
+ | |||
+ | === Status === | ||
+ | |||
+ | You can telnet to the port and issue the command <tt>status</tt> to see what's going on. It's a little terse. |
Revision as of 06:43, 19 May 2009
This document is meant to be read by people with sysadmin experience. I'll go back at some point and clean it up, break it down into sections, etc. But for now I'm just trying to dump as much information as possible so that Matthew and Robby have some state on how things are.
Contents
Links
- Cacti: http://z.dreamwidth.org/cacti/
- Nagios: http://z.dreamwidth.org/nagios3/
- Healthy?: http://www.dreamwidth.org/admin/healthy.bml
Nagios
The Nagios setup is running on dfw-admin01 in /etc/nagios3, most of the configuration files are in /etc/nagios3/conf.d as you can imagine. You can poke around if you want to change it, it's pretty straightforward.
If you do change things, you probably want to commit them to the operations repository.
make your changes... etc etc $ sync-back-nagios $ cd /root/dw-ops/nagios/conf.d $ hg status if everything looks good, then: $ commit -a mark -m "Some commit message."
Replace mark with matthew or alierak as appropriate.
Cacti
Most of the graphs are more or less useful. I spend a lot of time looking at dfw-lb01 which shows all of the incoming site traffic. In particular: eth0 is always the "Internet" interface, on all slices. eth1 is the "Internal/Private" interface. And lo is lo.
The only time lo is really interesting is on the dfw-lb01/dfw-lb02 machines. Look at the SSL configuration to see why, but lo is the measure of how much SSL traffic we're doing.
Traffic Flow
This summarizes the flow of traffic. There are a lot more sections that talk far more in depth about various things, but here you go...
- Site external IP is on dfw-lb01 (or dfw-lb02), which runs Perlbal.
- User connects to Perlbal. If it's a static request, it serves it locally. If it's dynamic, it hands off to a webserver.
- Perlbal connects to dfw-webXX and proxies the request.
- Webservers connect to lots of things: databases, memcache, mogilefsd, gearmand, etc.
- Response is returned.
That's the basic flow of things and what connects to what. There's a separate flow that happens when the user requests a userpic (or any other MogileFS resource, but for now it's just userpics).
- User -> Perlbal, "GET /userpic/XXXX/YYY"
- Perlbal -> Webserver, "GET /userpic/XXXX/YYY"
- Webserver replies: X-REPROXY-URL: http://dfw-mog01/dev1/0/00/000/234.fid
- Perlbal -> dfw-mog01, "GET /dev1/..."
- Mogile storage node replies with image
- Perlbal munges headers from webserver original reply, plus body of image from mogile storeage node, returns that to the user.
SSL is different again:
- User -> Pound.
- Pound handles the SSL handshake and decryption/encryption.
- Pound connects to localhost:80 (Perlbal).
- Same process now as originally.
Perlbal
Perlbal is the main software load balancer. With the Dreamwidth configuration, it doesn't do terribly much except handle reproxying.
- Runs on: dfw-lb01, dfw-lb02
- Admin port: 60000
Nagios is setup to monitor HTTP and SSL on these machines, not necessarily the admin port though. (That could be useful.)
Start/Restart
If Perlbal happens to crash or otherwise become unavailable, you can start/restart it.
run as root on dfw-admin01 $ bin/restart-perlbal
Health Checking
Kareila made a script for doing status checks on the perlbals. You can run it like this:
[dw @ dfw-admin01 - ~/current] -> bin/pbadm 1 Name "LJ::PERLBAL_SERVERS" used only once: possible typo at bin/pbadm line 37. Tue May 19 05:44:52 2009: [lb01 - 003, 0000] [lb02 - 000, 0000] Tue May 19 05:44:53 2009: [lb01 - 001, 0000] [lb02 - 000, 0000] Tue May 19 05:44:54 2009: [lb01 - 004, 0000] [lb02 - 000, 0000]
Ignore the warning. These lines should be color coded: green is okay, yellow is intriguing, red is problematic. But generally as long as the numbers look pretty low, it should be alright. (Unless it says DOWN of course...)
MogileFS
Gearman
Very simple server that just handles jobs. If this goes down, it should be started back up.
- Runs on: dfw-jobs01, dfw-jobs02
- Port: 7003
There is no administrative port. I think there are some commands you can use to see how deep the queues are, but I don't know off the top of my head. We use gearman for only one thing right now (userpic resizes?) so I can't imagine it falling behind.
Start/Restart
This is manual, I have no tool to do it. SSH to the servers that run gearman and use the /etc/init.d/gearman-server script.
Memcached
These generally stay up and never give any trouble. They store data, it's basically a LRU cache. We don't push them that hard right now -- you can find all of the basic information in Cacti, I setup a nice memcached graphing library with interesting stats.
- Runs on: dfw-memc01, dfw-memc02
- Port: 11211
Start/Restart
Same as for Perlbal:
as root on dfw-admin01 $ bin/restart-memcache
KEEP IN MIND: Restarting memcache puts a heavy strain on the databases. While we can get away with it without any trouble right now (our databases are bored), at some point in the future restarting memcache will become synonymous with shooting the site in the knee and watching it hobble along.
Admin Stats
If you want the nitty gritty you can telnet to one of the instances on the port above and type stats which will give you a nice dump.
TheSchwartz
Workers
Incoming Mail
Outgoing Mail
Databases
Webservers
SSL
ddlockd
This is a very damn simple locking daemon.
- Runs on: dfw-jobs01, dfw-jobs02
- Port: 7002
Start/Restart
Given that it's a locking system, it won't actually restart. The following command will start them up if they're down, that's it:
run as root on dfw-admin01 $ bin/restart-ddlockd
Status
You can telnet to the port and issue the command status to see what's going on. It's a little terse.