Difference between revisions of "Production Notes"
(→Nagios) |
|||
(26 intermediate revisions by 4 users not shown) | |||
Line 1: | Line 1: | ||
[[Category: Production]] | [[Category: Production]] | ||
− | This document is | + | This document is for the Dreamwidth staff. Most of this won't work for you, but if you're curious, feel free to look. We host on AWS. |
== Links == | == Links == | ||
− | * | + | * Dashboard: https://app.datadoghq.com/dashboard/v8n-hrk-jgt/dreamwidth-health |
− | + | ||
* Healthy?: http://www.dreamwidth.org/admin/healthy.bml | * Healthy?: http://www.dreamwidth.org/admin/healthy.bml | ||
− | == | + | == Traffic Flow == |
− | + | Traffic looks like: | |
− | + | 1. We use Cloudflare for some things (www, userpics, attachments, some high traffic domains). | |
+ | 2. AWS Cloudfront. Use this to configure caching and such. | ||
+ | 3. AWS WAF. Use this to configure mitigations against bad actors or things. | ||
+ | 4. AWS Application Load Balancer. Use this to route traffic to different internal endpoints. | ||
+ | 5. Destination instance(s). | ||
− | + | It's hard to say what #5 is since it varies depending on the request. In general though, it hits one of our EC2 instances which handles the request. | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | == Gearman == | |
− | + | Very simple server that just handles jobs. If this goes down, it should be started back up. | |
− | + | * Runs on: dfw-jobs01, dfw-jobs02 | |
+ | * Port: 7003 | ||
− | + | There is no administrative port. I think there are some commands you can use to see how deep the queues are, but I don't know off the top of my head. We use gearman for only one thing right now (userpic resizes? directory searches?) so I can't imagine it falling behind. | |
− | == | + | === Start/Restart === |
− | This | + | This is manual, I have no tool to do it. SSH to the servers that run gearman and use the <tt>/etc/init.d/gearman-server</tt> script. |
− | == | + | === Dreamhacks === |
+ | See [[Setting up Gearman]] | ||
− | == | + | == Memcached == |
− | + | These generally stay up and never give any trouble. They store data, it's basically a LRU cache. We don't push them that hard right now -- you can find all of the basic information in Cacti, I setup a nice memcached graphing library with interesting stats. | |
− | == | + | * Runs on: dfw-memc01, dfw-memc02 |
+ | * Port: 11211 | ||
+ | |||
+ | === Start/Restart === | ||
+ | |||
+ | Same as for Perlbal: | ||
+ | |||
+ | as root on dfw-admin01 | ||
+ | $ bin/restart-memcache | ||
+ | |||
+ | KEEP IN MIND: Restarting memcache puts a heavy strain on the databases. While we can get away with it without any trouble right now (our databases are bored), at some point in the future restarting memcache will become synonymous with shooting the site in the knee and watching it hobble along. | ||
+ | |||
+ | === Admin Stats === | ||
+ | |||
+ | If you want the nitty gritty you can telnet to one of the instances on the port above and type <tt>stats</tt> which will give you a nice dump. | ||
== TheSchwartz == | == TheSchwartz == | ||
+ | |||
+ | There's not much to mention here. The actual work is done by workers, which is in the Workers section of this document. TheSchwartz database is maintained on the global database (see Databases section). Logical db name is <tt>dw_schwartz</tt>. | ||
== Workers == | == Workers == | ||
+ | |||
+ | The workers do async tasks that we don't need to happen inline with someone doing something on the website. Okay, so I lied, some workers actually are synchronous (thinking of the Gearman things here). | ||
+ | |||
+ | * Runs on: dfw-jobs01, dfw-jobs02 | ||
+ | |||
+ | There is no port or management for these, they're just tasks. Typically speaking, you can see if they're running by looking at ps on the machine. | ||
+ | |||
+ | === Start/Restart === | ||
+ | |||
+ | Second verse... | ||
+ | |||
+ | as root on dfw-admin01 | ||
+ | $ bin/restart-jobs | ||
+ | |||
+ | CAVEAT LECTOR: Restarting the workers can be hard on the content-importer workers, since they allocate 12 hours to process entry and comment imports. If you restart workers while an import is in progress, it will cause that user's import to effectively pause halfway for 12 hours until it gets retried later. | ||
+ | |||
+ | There is no current way around this. You just have to know when is a good time to restart workers. While I'm gone, if you need to restart them, just do it. If a user has a problem with a delayed import, support will be awesome and let them know that it might take a while. | ||
+ | |||
+ | If you want to check on the importers... | ||
+ | |||
+ | [root @ dfw-admin01 - ~] -> bin/importer-status | ||
+ | dfw-mail01 | ||
+ | 4084 ? S 2:11 content-importer [bored] | ||
+ | dfw-jobs01 | ||
+ | 5663 ? S 4:29 content-importer [bored] | ||
+ | 25034 ? S 0:05 content-importer [bored] | ||
+ | dfw-jobs02 | ||
+ | 28323 ? S 0:04 content-importer [bored] | ||
+ | 28528 ? S 0:05 content-importer [bored] | ||
+ | |||
+ | Note they're all bored. That means you are safe to just restart the workers. On the other hand, if it says it's posting entries or comments, you might want to wait. (But if it's an emergency, just do it.) | ||
== Incoming Mail == | == Incoming Mail == | ||
+ | |||
+ | The machine dfw-mail01 handles incoming mail. It's a postfix system, with the MySQL module so that it can handle mail aliases/forwarding for users. | ||
+ | |||
+ | Sorry this is lacking in detail. If you're familiar with postfix you can dig around /etc/postfix for some more information. Specifically the /etc/postfix/dw directory. | ||
== Outgoing Mail == | == Outgoing Mail == | ||
+ | |||
+ | Amazon SES is our outgoing mail provider. You can use the AWS dashboards if needed. | ||
== Databases == | == Databases == | ||
+ | |||
+ | We use Amazon RDS and operate on the Aurora database technology. This is basically MySQL but it automates things like leader elections/follower promotions, backups, etc. | ||
== Webservers == | == Webservers == | ||
− | == | + | Serves web requests. Pretty straightforward. |
+ | |||
+ | * Runs on: va-web{01,02,03,04,05,06} | ||
+ | |||
+ | === Start/Restart === | ||
+ | |||
+ | You're probably used to this by now. There is a handy tool to do the restarts, but this one lets you give it a "delay". If you have an emergency, and need to restart everything, you can just run the command with a 0 argument. | ||
+ | |||
+ | run as root on va-admin01 | ||
+ | $ bin/restart-webs 5 | ||
+ | |||
+ | That restarts the webservers with a 5 second delay. If you don't specify a delay, 5 seconds is used presently. | ||
+ | |||
+ | == ddlockd == | ||
+ | |||
+ | This is a very damn simple locking daemon. | ||
+ | |||
+ | * Runs on: dfw-jobs01, dfw-jobs02 | ||
+ | * Port: 7002 | ||
+ | |||
+ | === Start/Restart === | ||
+ | |||
+ | Given that it's a locking system, it won't actually restart. The following command will start them up if they're down, that's it: | ||
+ | |||
+ | run as root on dfw-admin01 | ||
+ | $ bin/restart-ddlockd | ||
+ | |||
+ | === Status === | ||
+ | |||
+ | You can telnet to the port and issue the command <tt>status</tt> to see what's going on. It's a little terse. |
Latest revision as of 03:33, 7 November 2021
This document is for the Dreamwidth staff. Most of this won't work for you, but if you're curious, feel free to look. We host on AWS.
Contents
Links
- Dashboard: https://app.datadoghq.com/dashboard/v8n-hrk-jgt/dreamwidth-health
- Healthy?: http://www.dreamwidth.org/admin/healthy.bml
Traffic Flow
Traffic looks like:
1. We use Cloudflare for some things (www, userpics, attachments, some high traffic domains). 2. AWS Cloudfront. Use this to configure caching and such. 3. AWS WAF. Use this to configure mitigations against bad actors or things. 4. AWS Application Load Balancer. Use this to route traffic to different internal endpoints. 5. Destination instance(s).
It's hard to say what #5 is since it varies depending on the request. In general though, it hits one of our EC2 instances which handles the request.
Gearman
Very simple server that just handles jobs. If this goes down, it should be started back up.
- Runs on: dfw-jobs01, dfw-jobs02
- Port: 7003
There is no administrative port. I think there are some commands you can use to see how deep the queues are, but I don't know off the top of my head. We use gearman for only one thing right now (userpic resizes? directory searches?) so I can't imagine it falling behind.
Start/Restart
This is manual, I have no tool to do it. SSH to the servers that run gearman and use the /etc/init.d/gearman-server script.
Dreamhacks
Memcached
These generally stay up and never give any trouble. They store data, it's basically a LRU cache. We don't push them that hard right now -- you can find all of the basic information in Cacti, I setup a nice memcached graphing library with interesting stats.
- Runs on: dfw-memc01, dfw-memc02
- Port: 11211
Start/Restart
Same as for Perlbal:
as root on dfw-admin01 $ bin/restart-memcache
KEEP IN MIND: Restarting memcache puts a heavy strain on the databases. While we can get away with it without any trouble right now (our databases are bored), at some point in the future restarting memcache will become synonymous with shooting the site in the knee and watching it hobble along.
Admin Stats
If you want the nitty gritty you can telnet to one of the instances on the port above and type stats which will give you a nice dump.
TheSchwartz
There's not much to mention here. The actual work is done by workers, which is in the Workers section of this document. TheSchwartz database is maintained on the global database (see Databases section). Logical db name is dw_schwartz.
Workers
The workers do async tasks that we don't need to happen inline with someone doing something on the website. Okay, so I lied, some workers actually are synchronous (thinking of the Gearman things here).
- Runs on: dfw-jobs01, dfw-jobs02
There is no port or management for these, they're just tasks. Typically speaking, you can see if they're running by looking at ps on the machine.
Start/Restart
Second verse...
as root on dfw-admin01 $ bin/restart-jobs
CAVEAT LECTOR: Restarting the workers can be hard on the content-importer workers, since they allocate 12 hours to process entry and comment imports. If you restart workers while an import is in progress, it will cause that user's import to effectively pause halfway for 12 hours until it gets retried later.
There is no current way around this. You just have to know when is a good time to restart workers. While I'm gone, if you need to restart them, just do it. If a user has a problem with a delayed import, support will be awesome and let them know that it might take a while.
If you want to check on the importers...
[root @ dfw-admin01 - ~] -> bin/importer-status dfw-mail01 4084 ? S 2:11 content-importer [bored] dfw-jobs01 5663 ? S 4:29 content-importer [bored] 25034 ? S 0:05 content-importer [bored] dfw-jobs02 28323 ? S 0:04 content-importer [bored] 28528 ? S 0:05 content-importer [bored]
Note they're all bored. That means you are safe to just restart the workers. On the other hand, if it says it's posting entries or comments, you might want to wait. (But if it's an emergency, just do it.)
Incoming Mail
The machine dfw-mail01 handles incoming mail. It's a postfix system, with the MySQL module so that it can handle mail aliases/forwarding for users.
Sorry this is lacking in detail. If you're familiar with postfix you can dig around /etc/postfix for some more information. Specifically the /etc/postfix/dw directory.
Outgoing Mail
Amazon SES is our outgoing mail provider. You can use the AWS dashboards if needed.
Databases
We use Amazon RDS and operate on the Aurora database technology. This is basically MySQL but it automates things like leader elections/follower promotions, backups, etc.
Webservers
Serves web requests. Pretty straightforward.
- Runs on: va-web{01,02,03,04,05,06}
Start/Restart
You're probably used to this by now. There is a handy tool to do the restarts, but this one lets you give it a "delay". If you have an emergency, and need to restart everything, you can just run the command with a 0 argument.
run as root on va-admin01 $ bin/restart-webs 5
That restarts the webservers with a 5 second delay. If you don't specify a delay, 5 seconds is used presently.
ddlockd
This is a very damn simple locking daemon.
- Runs on: dfw-jobs01, dfw-jobs02
- Port: 7002
Start/Restart
Given that it's a locking system, it won't actually restart. The following command will start them up if they're down, that's it:
run as root on dfw-admin01 $ bin/restart-ddlockd
Status
You can telnet to the port and issue the command status to see what's going on. It's a little terse.