Difference between revisions of "English-stripping"
(first pass at updating) |
(removing outdating bits; trying to simplify some sections) |
||
Line 59: | Line 59: | ||
All IDs should be listed in alphabetical order, if possible. | All IDs should be listed in alphabetical order, if possible. | ||
− | = English- | + | = How to English-strip = |
− | You might think, after learning the above, that English-stripping a page is fairly easy - and in theory, it is. In practice, however, you need to know at least something about how both Perl and HTML work | + | You might think, after learning the above, that English-stripping a page is fairly easy - and in theory, it is. In practice, however, you need to know at least something about how both Perl and HTML work |
− | + | == A basic example == | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
In Perl, literal text strings (that is, text which is mostly left unchanged) are represented by surrounding quote marks. For example: | In Perl, literal text strings (that is, text which is mostly left unchanged) are represented by surrounding quote marks. For example: | ||
Line 100: | Line 71: | ||
</source> | </source> | ||
− | The "\n" in this example is called a 'newline', and signals to Perl that it should start a new line when it encounters it. | + | The "\n" in this example is called a 'newline', and signals to Perl that it should start a new line when it encounters it. |
The string itself may be surrounded on the same line by other Perl code, such as: | The string itself may be surrounded on the same line by other Perl code, such as: | ||
Line 114: | Line 85: | ||
</source> | </source> | ||
− | + | == Unknown data == | |
Sometimes, there will be parts of a string which contain information that you can't specifically know when you're English-stripping, such as the username of the logged-in user. For example, the code might say something like: | Sometimes, there will be parts of a string which contain information that you can't specifically know when you're English-stripping, such as the username of the logged-in user. For example, the code might say something like: | ||
<source lang="perl"> | <source lang="perl"> | ||
− | $ret .= "$u->{user}, | + | $ret .= "$u->{user}, use your $LJ::SITENAMESHORT invite code:\n"; |
</source> | </source> | ||
In the actual HTML output, this might look something like: | In the actual HTML output, this might look something like: | ||
− | sophie, | + | sophie, use your Dreamwidth invite code: |
Even though the "$u->{user}" and "$LJ::SITENAMESHORT" parts in this example are highlighted in red, you can tell they're pieces of data by the dollar sign; anything in a string that starts with a dollar sign is data that needs to be kept somehow. You don't ''need'' to understand what the names mean in order to English-strip them, just the way they're used. (Of course, if you do understand them, it'll be easier to give them meaningful labels.) | Even though the "$u->{user}" and "$LJ::SITENAMESHORT" parts in this example are highlighted in red, you can tell they're pieces of data by the dollar sign; anything in a string that starts with a dollar sign is data that needs to be kept somehow. You don't ''need'' to understand what the names mean in order to English-strip them, just the way they're used. (Of course, if you do understand them, it'll be easier to give them meaningful labels.) | ||
− | |||
− | |||
In order to use a piece of data in a multi-language string, you need to assign it a label. For example, for the first piece, let's call it "username". Then, you take the data part exactly as written (including the dollar sign), and combine the two with <code>=></code>: | In order to use a piece of data in a multi-language string, you need to assign it a label. For example, for the first piece, let's call it "username". Then, you take the data part exactly as written (including the dollar sign), and combine the two with <code>=></code>: | ||
Line 158: | Line 127: | ||
</source> | </source> | ||
− | Note that in this example, I've split the line into three lines in the middle of the line. Perl is perfectly happy with this as long as you do it in the right place - for example, not in the middle of a literal string. You could write the above as a single line if you wanted | + | Note that in this example, I've split the line into three lines in the middle of the line. Perl is perfectly happy with this as long as you do it in the right place - for example, not in the middle of a literal string. You could write the above as a single line if you wanted but it's not really very easy to read and shorter lines are generally preferred. |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
The key thing to note here is that after the multi-language ID, I've put a comma, then pasted the notation we constructed above after it, while still inside the parentheses of the LJ::Lang::ml() function. After that, we continue on as normal - the closing parenthesis, and the newline character. | The key thing to note here is that after the multi-language ID, I've put a comma, then pasted the notation we constructed above after it, while still inside the parentheses of the LJ::Lang::ml() function. After that, we continue on as normal - the closing parenthesis, and the newline character. | ||
Line 174: | Line 137: | ||
We're done. Yay! | We're done. Yay! | ||
− | + | === Plurals and numbers === | |
These are described in excruciating detail in [[Embedding plural forms into translations]], but here's a quick example: | These are described in excruciating detail in [[Embedding plural forms into translations]], but here's a quick example: | ||
Line 202: | Line 165: | ||
This takes care of applying the rules for English plurals for you, and lets translators (with help from some magic in LiveJournal and Dreamwidth source code) handle it appropriately by just specifying a text string for their language, without having to muck around in the Dreamwidth source code - which is, after all, the goal of the translation system. | This takes care of applying the rules for English plurals for you, and lets translators (with help from some magic in LiveJournal and Dreamwidth source code) handle it appropriately by just specifying a text string for their language, without having to muck around in the Dreamwidth source code - which is, after all, the goal of the translation system. | ||
− | === | + | === Don't split sentences === |
− | + | For example, you '''should not''' do this in your '''.text''' file: | |
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
− | + | ||
.createaccount.enter_invite_code.part1=, enter your | .createaccount.enter_invite_code.part1=, enter your | ||
.createaccount.enter_invite_code.part2= invite code: | .createaccount.enter_invite_code.part2= invite code: | ||
− | |||
− | |||
=== Heredocs === | === Heredocs === |
Revision as of 17:49, 11 July 2013
English-stripping a page refers to the process of taking out hardcoded English text from the BML pages, giving them an ID you can use to refer to the string, and then putting the original English text in a lookup file. In this way, you're stripping the BML files of any English text, hence the name. This is useful because by doing this, it's easy to support multiple languages; the text for different languages is held in the database and can be looked up by the aforementioned ID.
Although Dreamwidth Studios itself won't be supporting any language other than English, it's still important to learn how to English-strip pages as it means our Site Copy team can change text as necessary on the site without having to go through the code, and also because we want other users of the code to be able to implement other languages if they want to with the minimum of hassle. (for both of these reasons, we're also going to be replacing the current translation system with something better - although to be perfectly honest, that's not going to be too hard.)
Contents
Glossary
First, a bit of explanation about some of the terms we're going to use:
- String: this refers to a piece of text. For example, this sentence can be considered a string. We'll normally use this when referring to the text that a multi-language ID refers to (defined below).
The Anatomy of a Multi-Language ID
The IDs that replace English text in a BML page are referred to as Multi-Language IDs. There are two types of ID - global IDs (which can be used by any BML page) and page-specific IDs (which are only valid on one page). When you English-strip a file, you will almost always be using page-specific IDs, but it's helpful to know about global IDs anyway.
Global IDs
Global IDs are, for the most part, defined in bin/upgrading/en.dat
. However, for features specific to Dreamwidth Studios (and unusable by any other site using our code), any corresponding global IDs will be defined in bin/upgrading/en_DW.dat
instead. For example, the Tropospherical sitescheme strings are stored in en_DW.dat
since Tropospherical is specific to DWS, and because the strings appear in every page, page-specific IDs can't be used.
A global ID looks something like this:
date.month.december.short
You'll notice this ID is split into several parts with dots. This helps to know precisely how the string is being used; ideally, each separate part should be a subset of the part before it in some way. In this example, month
is part of date
; december
is a month
, and short
means the short version of how to say this month. (in this example, it's "Dec" in the English text; there's a corresponding long
version too, which is simply "December").
Each section name should be lower-case and use only letters, digits, and the underscore and hyphen characters. (There aren't actually any set rules for the characters you can use in IDs in the code, but this is how it's been done so far.) The number of sections in an ID is arbitrary, as are the section names themselves. However, you should always have at least two sections in an ID for ease of use.
Page-specific IDs
Page-specific IDs are defined in a file of the same name as the page it applies to with the additional extension .text
. For example, for a page htdocs/login.bml, the corresponding page-specific ID file will be htdocs/login.bml.text.
Page-specific IDs begin with a dot, and thereafter follow the same rules as global IDs. For example, one of the strings in the htdocs/login.bml.text
file in dw-free has this ID:
.createaccount.header
Generally, in page-specific IDs, the names you'll use for your sections will correspond to the sections of the page in question. So this ID, for example, refers to the header of the section that invites the user to create an account if they don't already have one.
Again, the actual names and number of sections is arbitrary, but you should always have your IDs follow the structural flow of the content of the page for ease of use.
The Anatomy of a .text File
There isn't too much to learn about how a .text file works - it's pretty straightforward. For each ID referenced in the page, you put the name of the ID, an equals sign (=), then the English text stripped from the file. (We'll talk about how precisely to do that in the next few sections.) Ideally, you should have one string correspond to one unbroken line of English. (This doesn't mean just one sentence - it's perfectly valid to have whole paragraphs under one ID. Just make sure you don't have any HTML in a string, unless it's part of a sentence. (ie, don't include wrapping <p>
tags, etc.)
For example, the .createaccount.header
page-specific ID referred to in the last section is defined in the .text file like so:
.createaccount.header=Not a <?sitename?> member?
(the <?sitename?>
part of this is a BML tag; for more information on these, see the linked page.)
It's possible to have a multi-line string in a .text file. You should never need to do this in a page-specific ID, but if you do, you simply replace the equals sign with two less-than signs (<<), and end the string with a dot on its own line. For example, here's the definition for the global ID email.invitecoderequest.accept.body
:
email.invitecoderequest.accept.body<< Your request for invites has been granted. You can view all your invite codes here: [[invitesurl]] .
All IDs should be listed in alphabetical order, if possible.
How to English-strip
You might think, after learning the above, that English-stripping a page is fairly easy - and in theory, it is. In practice, however, you need to know at least something about how both Perl and HTML work
A basic example
In Perl, literal text strings (that is, text which is mostly left unchanged) are represented by surrounding quote marks. For example:
"Enter your invite code below:\n"
The "\n" in this example is called a 'newline', and signals to Perl that it should start a new line when it encounters it.
The string itself may be surrounded on the same line by other Perl code, such as:
$ret .= "Enter your invite code below:\n";
In these examples, the string is highlighted in red. Your aim here is to get this string English-stripped using the LJ::Lang::ml
Perl function. The function itself isn't used in a literal Perl string, so it doesn't need quotes around it. However, the newline "\n" character *does* need to be in a literal string with quotes around it, which means you need to combine the two. This is done using a dot - .
- which is how you tell Perl to combine a literal string and something else. This is how it would end up:
$ret .= LJ::Lang::ml( ".createaccount.enter_invite_code" ) . "\n";
Unknown data
Sometimes, there will be parts of a string which contain information that you can't specifically know when you're English-stripping, such as the username of the logged-in user. For example, the code might say something like:
$ret .= "$u->{user}, use your $LJ::SITENAMESHORT invite code:\n";
In the actual HTML output, this might look something like:
sophie, use your Dreamwidth invite code:
Even though the "$u->{user}" and "$LJ::SITENAMESHORT" parts in this example are highlighted in red, you can tell they're pieces of data by the dollar sign; anything in a string that starts with a dollar sign is data that needs to be kept somehow. You don't need to understand what the names mean in order to English-strip them, just the way they're used. (Of course, if you do understand them, it'll be easier to give them meaningful labels.)
In order to use a piece of data in a multi-language string, you need to assign it a label. For example, for the first piece, let's call it "username". Then, you take the data part exactly as written (including the dollar sign), and combine the two with =>
:
username => $u->{user}
The above means that the label 'username' should have the value of whatever $u->{user} comes out to be.
If you have multiple pieces of data, as above, you can use commas to separate them; simply copy the above format and separate them with a comma. For example, let's assign the site name a label of "sitename":
username => $u->{user}, sitename => $LJ::SITENAMESHORT
You then surround the whole thing with braces:
{ username => $u->{user}, sitename => $LJ::SITENAMESHORT }
You then need to use the LJ::Lang::ml
function, described above, and in addition to giving it the multi-language ID, you need to also give it the data itself:
$ret .= LJ::Lang::ml( ".createaccount.enter_invite_code", { username => $u->{user}, sitename => $LJ::SITENAMESHORT } ) . "\n";
Note that in this example, I've split the line into three lines in the middle of the line. Perl is perfectly happy with this as long as you do it in the right place - for example, not in the middle of a literal string. You could write the above as a single line if you wanted but it's not really very easy to read and shorter lines are generally preferred.
The key thing to note here is that after the multi-language ID, I've put a comma, then pasted the notation we constructed above after it, while still inside the parentheses of the LJ::Lang::ml() function. After that, we continue on as normal - the closing parenthesis, and the newline character.
We're now done for the Perl part of it, and we now need to add the text to the .text file. Fortunately, this is a lot easier; when you need to put data in a string, simply refer to its label surrounded by two square brackets. The above string would be represented in the .text file as follows:
.createaccount.enter_invite_code=[[username]], enter your [[sitename]] invite code:
We're done. Yay!
Plurals and numbers
These are described in excruciating detail in Embedding plural forms into translations, but here's a quick example:
$ret .= "You have $num message" . ( ( $num != 1 ) ? 's' : '' ) . " in your inbox.";
You could use a variable for the plural, like this:
$ret .= LJ::Lang::ml( ".inbox.num_msgs", { num => $num, plural => ( $num != 1 ) ? 's' : '' } );
.inbox.num_msgs=You have [[num]] message[[plural]] in your inbox.
However, this would still be baking English into the source code - not actual English text in this case, but English grammar in the form of singular and plural inflections. Instead, you can use:
$ret .= LJ::Lang::ml( ".inbox.num_msgs", { num => $num );
.inbox.num_msgs=You have [[num]] [[?num|message[messages]] in your inbox.
This takes care of applying the rules for English plurals for you, and lets translators (with help from some magic in LiveJournal and Dreamwidth source code) handle it appropriately by just specifying a text string for their language, without having to muck around in the Dreamwidth source code - which is, after all, the goal of the translation system.
Don't split sentences
For example, you should not do this in your .text file:
.createaccount.enter_invite_code.part1=, enter your .createaccount.enter_invite_code.part2= invite code:
Heredocs
Sometimes, you'll come across Perl constructs that look something like this:
$ret .= <<HTML;
...followed by a block of text that isn't Perl code, followed by an "HTML" on its own line (or whatever was after the << in the original line). This format of text is known in Perl parlance as a "heredoc". If you can replace any text in there with <?_ml ... _ml?>
tags, and it works, you should do so. However, if any need the use of the LJ::Lang::ml
function, it's probably best to have someone look at it who codes Perl, since these require care in order to get right. (Of course, if you know Perl and know how to fix it, go ahead; otherwise, just make a note of it and move on.)
HTML forms
Sometimes, you'll come across text that needs to be stripped in HTML forms. For example, in the following example the contents of the <p>
tag and the value of the submit button need to be English-stripped:
<form method="post" action="create.bml"> <input type="hidden" name="mode" value="codesubmit"> <p>Enter your invite code:</p> <input type="text" name="invite" size="20" maxlength="20"> <input type="submit" value="Create Account"> </form>
Notice that we do *not* want to English-strip:
- name="invite": The word "invite", although English, is being used here as a field name, and is not shown to the user. The code will be looking out for a field called "invite" no matter what language the user is using, so this must not be changed.
- value="codesubmit": Same as above - this value is used by the code.
However, the value of a submit button (in this case, "Create Account") *is* shown to the user (as this is what's shown on the button itself), and thus needs to be English-stripped, despite being in a value
attribute.
There is one exception to this. If you find a submit button with a "name" attribute, check to see if there are any more. If any two submit buttons have the same 'name' attribute, such as:
<input type="submit" name="action" value="Rename"> <input type="submit" name="action" value="Delete">
...then do not English-strip it, and make a note for whoever reviews your patch that this will need to be fixed. (This is because the code will be checking for the value of whatever button is clicked on, and that prohibits English-stripping from taking place; if you were to English-strip the value here, it would no longer work for non-English users.)
If you do not find two submit buttons with the same 'name' attribute, but there is nonetheless still a 'name' attribute on at least one, make a note for the reviewer that this is the case, but go ahead and English-strip it as normal. (The reviewer will check to see whether the code is actually checking for this value.)
Fin
That's basically the guide for how to English-strip a page. Don't be too afraid of messing things up; a reviewer will tell you if you have anything wrong.