Difference between revisions of "English-stripping"

From Dreamwidth Notes
Jump to: navigation, search
(someone please write this. please.)
 
(Unknown data: s/sophie/mark/ where appropriate)
 
(26 intermediate revisions by 9 users not shown)
Line 1: Line 1:
How to English-strip a file:
+
{{Update|text=BML::ml is deprecated and you should use LJ::Lang::ml instead. LJ::Lang::get_text isn't explained.}}
 +
 
 +
English-stripping a page refers to the process of taking out hardcoded English text from the BML pages, giving them an ID you can use to refer to the string, and then putting the original English text in a lookup file. In this way, you're stripping the BML files of any English text, hence the name. This is useful because by doing this, it's easy to support multiple languages; the text for different languages is held in the database and can be looked up by the aforementioned ID.
 +
 
 +
Developers with experience on the LiveJournal code may refer to this as "translation" or "the translation system". Although Dreamwidth Studios itself won't be supporting any language other than English, it's still important to learn how to English-strip pages as it means our Site Copy team can change text as necessary on the site without having to go through the code, and also because we want other users of the code to be able to implement other languages if they want to with the minimum of hassle.
 +
 
 +
= Glossary =
 +
 
 +
First, a bit of explanation about some of the terms we're going to use:
 +
 
 +
* '''String''': this refers to a piece of text. For example, this sentence can be considered a string. We'll normally use this when referring to the text that a multi-language ID refers to (defined below).
 +
 
 +
* '''Multi-Language IDs''': what replaces English text in a file. There are two types of ID - global IDs (which can be used by any BML page) and page-specific IDs (which are only valid on one page). When you English-strip a file, you will almost always be using page-specific IDs, but it's helpful to know about global IDs anyway.
 +
 
 +
== Global IDs ==
 +
 
 +
Global IDs are, for the most part, defined in <code>bin/upgrading/en.dat</code>. However, for features specific to Dreamwidth Studios (and unusable by any other site using our code), any corresponding global IDs will be defined in <code>bin/upgrading/en_DW.dat</code> instead. For example, the Tropospherical sitescheme strings are stored in <code>en_DW.dat</code> since Tropospherical is specific to DWS, and because the strings appear in every page, page-specific IDs can't be used.
 +
 
 +
A global ID looks something like this:
 +
 
 +
date.month.december.short
 +
 
 +
You'll notice this ID is split into several parts with dots. This helps to know precisely how the string is being used; ideally, each separate part should be a subset of the part before it in some way. In this example, <code>month</code> is part of <code>date</code>; <code>december</code> is a <code>month</code>, and <code>short</code> means the short version of how to say this month. (in this example, it's "Dec" in the English text; there's a corresponding <code>long</code> version too, which is simply "December").
 +
 
 +
Each section name should be lower-case and use only letters, digits, and the underscore and hyphen characters. (There aren't actually any set rules for the characters you can use in IDs in the code, but this is how it's been done so far.) The number of sections in an ID is arbitrary, as are the section names themselves. However, you should always have at least two sections in an ID for ease of use.
 +
 
 +
== Page-specific IDs ==
 +
 
 +
Page-specific IDs are defined in a file of the same name as the page it applies to with the additional extension <code>.text</code>. For example, for a page '''htdocs/login.bml''', the corresponding page-specific ID file will be '''htdocs/login.bml.text'''.
 +
 
 +
Page-specific IDs begin with a dot, and thereafter follow the same rules as global IDs. For example, one of the strings in the <code>htdocs/login.bml.text</code> file in '''dw-free''' has this ID:
 +
 
 +
.createaccount.header
 +
 
 +
Generally, in page-specific IDs, the names you'll use for your sections will correspond to the sections of the page in question. So this ID, for example, refers to the header of the section that invites the user to create an account if they don't already have one.
 +
 
 +
Again, the actual names and number of sections is arbitrary, but you should always have your IDs follow the structural flow of the content of the page for ease of use.
 +
 
 +
All IDs should be listed in alphabetical order, if possible.
 +
 
 +
= How to English-strip =
 +
 
 +
== A basic example ==
 +
 
 +
In Perl, literal text strings are represented by surrounding quote marks. For example:
 +
 
 +
<source lang="perl">
 +
"Enter your invite code below:\n"
 +
</source>
 +
 
 +
The "\n" in this example is called a 'newline', and signals to Perl that it should start a new line when it encounters it.
 +
 
 +
The string itself may be surrounded on the same line by other Perl code, such as:
 +
 
 +
<source lang="perl">
 +
$ret .= "Enter your invite code below:\n";
 +
</source>
 +
 
 +
In these examples, the string is highlighted in red. Your aim here is to get this string English-stripped using the <code>LJ::Lang::ml</code> Perl function. The function itself isn't used in a literal Perl string, so it doesn't need quotes around it. However, the newline "\n" character *does* need to be in a literal string with quotes around it, which means you need to combine the two. This is done using a dot - <code>.</code> - which is how you tell Perl to combine a literal string and something else. This is how it would end up:
 +
 
 +
<source lang="perl">
 +
$ret .= LJ::Lang::ml( ".createaccount.enter_invite_code" ) . "\n";
 +
</source>
 +
 
 +
Then put the line into the corresponding '''.text''' file:
 +
 
 +
.createaccount.enter_invite_code=Enter your invite code below:
 +
 
 +
Always make sure you don't have any HTML in a string, unless it's part of a sentence. (ie, don't include wrapping <code>&lt;p&gt;</code> tags, etc.).
 +
 
 +
== Updating strings ==
 +
 
 +
If working on a bug that requires the content of a string to be updated, you must create an entirely new string. [https://github.com/fhocutt/dw-free/commit/dc4f49ffda345967e1afa7ee7733716566f3f17d#commitcomment-5891643 The details are a bit arcane], but the upshot is that this is necessary to ensure the database gets updated.
 +
 
 +
Say you have <code>example.string = This is an example</code>. To update it, instead of simply editing the '''.text''' file to <code>example.string = This is a different example</code>, you must change to (e.g.) <code>example.string2 = This is a different example</code>, and change the corresponding instance of <code>example.string</code> in the parent file.
 +
 
 +
 
 +
== Multi-line strings ==
 +
 
 +
It's possible to have a multi-line string in a '''.text''' file. Simply replace the equals sign with two less-than signs (&lt;&lt;), and end the string with a dot on its own line:
 +
 
 +
<nowiki>email.invitecoderequest.accept.body<<
 +
Your request for invites has been granted. You can view all your invite codes here:
 +
 +
  [[invitesurl]]
 +
 
 +
.</nowiki>
 +
 
 +
== Unknown data ==
 +
 
 +
Sometimes, there will be parts of a string which contain information that you can't specifically know when you're English-stripping, such as the username of the logged-in user. For example, the code might say something like:
 +
 
 +
<source lang="perl">
 +
$ret .= "$u->{user}, send your $LJ::SITENAMESHORT invite code:\n";
 +
</source>
 +
 
 +
In the actual HTML output, this might look something like:
 +
 
 +
mark, send your Dreamwidth invite code:
 +
 
 +
Even though the "$u->{user}" and "$LJ::SITENAMESHORT" parts in this example are highlighted in red, you can tell they're pieces of data by the dollar sign; anything in a string that starts with a dollar sign is data that needs to be kept somehow. You don't ''need'' to understand what the names mean in order to English-strip them, just the way they're used. (Of course, if you do understand them, it'll be easier to give them meaningful labels.)
 +
 
 +
In order to use a piece of data in a multi-language string, you need to assign it a label. For example, for the first piece, let's call it "username". Then, you take the data part exactly as written (including the dollar sign), and combine the two with <code>=&gt;</code>. If you have multiple pieces of data, as above, you can use commas to separate them:
 +
 
 +
<source lang="perl">
 +
username => $u->{user}, sitename => $LJ::SITENAMESHORT
 +
</source>
 +
 
 +
You then surround the whole thing with braces:
 +
 
 +
<source lang="perl">
 +
{ username => $u->{user}, sitename => $LJ::SITENAMESHORT }
 +
</source>
 +
 
 +
You then need to use the <code>LJ::Lang::ml</code> function and give it both the multi-language ID and the data itself:
 +
 
 +
<source lang="perl">
 +
$ret .= LJ::Lang::ml( ".invites.send_invite_code",
 +
                  { username => $u->{user}, sitename => $LJ::SITENAMESHORT }
 +
              ) . "\n";
 +
</source>
 +
 
 +
Note that in this example, I've split the line into three lines in the middle of the line. Perl is perfectly happy with this as long as you do it in the right place - for example, not in the middle of a literal string. You could write the above as a single line if you wanted but it's not really very easy to read and shorter lines are generally preferred.
 +
 
 +
The key thing to note here is that after the multi-language ID, I've put a comma, then pasted the notation we constructed above after it, while still inside the parentheses of the LJ::Lang::ml() function. After that, we continue on as normal - the closing parenthesis, and the newline character.
 +
 
 +
We're now done for the Perl part of it, and we now need to add the text to the '''.text''' file. Fortunately, this is a lot easier; when you need to put data in a string, simply refer to its label surrounded by two square brackets. The above string would be represented in the '''.text''' file as follows:
 +
 
 +
<nowiki>.invites.send_invite_code=[[username]], send your [[sitename]] invite code:</nowiki>
 +
 
 +
We're done. Yay!
 +
 
 +
== Plurals and numbers ==
 +
 
 +
These are described in excruciating detail in [[Embedding plural forms into translations]], but here's a quick example:
 +
 
 +
<source lang="perl">
 +
$ret .= "You have $num message" . ( ( $num != 1 ) ? 's' : '' ) . " in your inbox.";
 +
</source>
 +
 
 +
You could use a variable for the plural, like this:
 +
 
 +
<source lang="perl">
 +
$ret .= LJ::Lang::ml( ".inbox.num_msgs",
 +
                  { num => $num, plural => ( $num != 1 ) ? 's' : '' }
 +
              );
 +
</source>
 +
 
 +
<nowiki>.inbox.num_msgs=You have [[num]] message[[plural]] in your inbox.</nowiki>
 +
 
 +
However, this would still be baking English into the source code - not actual English text in this case, but English grammar in the form of singular and plural inflections. Instead, you can use:
 +
 
 +
<source lang="perl">
 +
$ret .= LJ::Lang::ml( ".inbox.num_msgs", { num => $num );
 +
</source>
 +
 
 +
<nowiki>.inbox.num_msgs=You have [[num]] [[?num|message[messages]] in your inbox.</nowiki>
 +
 
 +
This takes care of applying the rules for English plurals for you, and lets translators (with help from some magic in LiveJournal and Dreamwidth source code) handle it appropriately by just specifying a text string for their language, without having to muck around in the Dreamwidth source code - which is, after all, the goal of the translation system.
 +
 
 +
== Don't split sentences ==
 +
 
 +
For example, you '''should not''' do this in your '''.text''' file:
 +
 
 +
.createaccount.enter_invite_code.part1=, enter your
 +
 +
.createaccount.enter_invite_code.part2= invite code:
 +
 
 +
== Heredocs ==
 +
 
 +
Sometimes, you'll come across Perl constructs that look something like this:
 +
 
 +
<source lang="perl">
 +
$ret .= <<HTML;
 +
</source>
 +
 
 +
...followed by a block of text that isn't Perl code, followed by an "HTML" on its own line (or whatever was after the &lt;&lt; in the original line). This format of text is known in Perl parlance as a "heredoc". If you can replace any text in there with <code>&lt;?_ml ... _ml?&gt;</code> tags, and it works, you should do so. However, if any need the use of the <code>LJ::Lang::ml</code> function, it's probably best to have someone look at it who codes Perl, since these require care in order to get right. (Of course, if you know Perl and know how to fix it, go ahead; otherwise, just make a note of it and move on.)
 +
 
 +
== HTML forms ==
 +
 
 +
Sometimes, you'll come across text that needs to be stripped in HTML forms. For example, in the following example the contents of the <code>&lt;p&gt;</code> tag and the value of the submit button need to be English-stripped:
 +
 
 +
<source lang="html4strict">
 +
<form method="post" action="create.bml">
 +
    <input type="hidden" name="mode" value="codesubmit">
 +
    <p>Enter your invite code:</p>
 +
    <input type="text" name="invite" size="20" maxlength="20">
 +
    <input type="submit" value="Create Account">
 +
</form>
 +
</source>
 +
 
 +
Notice that we do *not* want to English-strip:
 +
 
 +
* '''name="invite"''': The word "invite", although English, is being used here as a field name, and is not shown to the user. The code will be looking out for a field called "invite" no matter what language the user is using, so this must not be changed.
 +
* '''value="codesubmit"''': Same as above - this value is used by the code.
 +
 
 +
However, the value of a submit button (in this case, "Create Account") *is* shown to the user (as this is what's shown on the button itself), and thus needs to be English-stripped, despite being in a <code>value</code> attribute.
 +
 
 +
There is one exception to this. If you find a submit button with a "name" attribute, check to see if there are any more. If any two submit buttons have the same 'name' attribute, such as:
 +
 
 +
<source lang="html4strict">
 +
<input type="submit" name="action" value="Rename">
 +
<input type="submit" name="action" value="Delete">
 +
</source>
 +
 
 +
...then '''do not''' English-strip it, and make a note for whoever reviews your patch that this will need to be fixed. (This is because the code will be checking for the value of whatever button is clicked on, and that prohibits English-stripping from taking place; if you were to English-strip the value here, it would no longer work for non-English users.)
 +
 
 +
If you do not find two submit buttons with the same 'name' attribute, but there is nonetheless still a 'name' attribute on at least one, make a note for the reviewer that this is the case, but go ahead and English-strip it as normal. (The reviewer will check to see whether the code is actually checking for this value.)
 +
 
 +
= Fin =
 +
 
 +
That's basically the guide for how to English-strip a page. Don't be too afraid of messing things up; a reviewer will tell you if you have anything wrong.
 +
 
 +
[[Category:Translation]]
 +
[[Category:Development]]

Latest revision as of 02:31, 11 December 2018

Needs Update: BML::ml is deprecated and you should use LJ::Lang::ml instead. LJ::Lang::get_text isn't explained.

English-stripping a page refers to the process of taking out hardcoded English text from the BML pages, giving them an ID you can use to refer to the string, and then putting the original English text in a lookup file. In this way, you're stripping the BML files of any English text, hence the name. This is useful because by doing this, it's easy to support multiple languages; the text for different languages is held in the database and can be looked up by the aforementioned ID.

Developers with experience on the LiveJournal code may refer to this as "translation" or "the translation system". Although Dreamwidth Studios itself won't be supporting any language other than English, it's still important to learn how to English-strip pages as it means our Site Copy team can change text as necessary on the site without having to go through the code, and also because we want other users of the code to be able to implement other languages if they want to with the minimum of hassle.

Glossary

First, a bit of explanation about some of the terms we're going to use:

  • String: this refers to a piece of text. For example, this sentence can be considered a string. We'll normally use this when referring to the text that a multi-language ID refers to (defined below).
  • Multi-Language IDs: what replaces English text in a file. There are two types of ID - global IDs (which can be used by any BML page) and page-specific IDs (which are only valid on one page). When you English-strip a file, you will almost always be using page-specific IDs, but it's helpful to know about global IDs anyway.

Global IDs

Global IDs are, for the most part, defined in bin/upgrading/en.dat. However, for features specific to Dreamwidth Studios (and unusable by any other site using our code), any corresponding global IDs will be defined in bin/upgrading/en_DW.dat instead. For example, the Tropospherical sitescheme strings are stored in en_DW.dat since Tropospherical is specific to DWS, and because the strings appear in every page, page-specific IDs can't be used.

A global ID looks something like this:

date.month.december.short

You'll notice this ID is split into several parts with dots. This helps to know precisely how the string is being used; ideally, each separate part should be a subset of the part before it in some way. In this example, month is part of date; december is a month, and short means the short version of how to say this month. (in this example, it's "Dec" in the English text; there's a corresponding long version too, which is simply "December").

Each section name should be lower-case and use only letters, digits, and the underscore and hyphen characters. (There aren't actually any set rules for the characters you can use in IDs in the code, but this is how it's been done so far.) The number of sections in an ID is arbitrary, as are the section names themselves. However, you should always have at least two sections in an ID for ease of use.

Page-specific IDs

Page-specific IDs are defined in a file of the same name as the page it applies to with the additional extension .text. For example, for a page htdocs/login.bml, the corresponding page-specific ID file will be htdocs/login.bml.text.

Page-specific IDs begin with a dot, and thereafter follow the same rules as global IDs. For example, one of the strings in the htdocs/login.bml.text file in dw-free has this ID:

.createaccount.header

Generally, in page-specific IDs, the names you'll use for your sections will correspond to the sections of the page in question. So this ID, for example, refers to the header of the section that invites the user to create an account if they don't already have one.

Again, the actual names and number of sections is arbitrary, but you should always have your IDs follow the structural flow of the content of the page for ease of use.

All IDs should be listed in alphabetical order, if possible.

How to English-strip

A basic example

In Perl, literal text strings are represented by surrounding quote marks. For example:

"Enter your invite code below:\n"

The "\n" in this example is called a 'newline', and signals to Perl that it should start a new line when it encounters it.

The string itself may be surrounded on the same line by other Perl code, such as:

$ret .= "Enter your invite code below:\n";

In these examples, the string is highlighted in red. Your aim here is to get this string English-stripped using the LJ::Lang::ml Perl function. The function itself isn't used in a literal Perl string, so it doesn't need quotes around it. However, the newline "\n" character *does* need to be in a literal string with quotes around it, which means you need to combine the two. This is done using a dot - . - which is how you tell Perl to combine a literal string and something else. This is how it would end up:

$ret .= LJ::Lang::ml( ".createaccount.enter_invite_code" ) . "\n";

Then put the line into the corresponding .text file:

.createaccount.enter_invite_code=Enter your invite code below:

Always make sure you don't have any HTML in a string, unless it's part of a sentence. (ie, don't include wrapping <p> tags, etc.).

Updating strings

If working on a bug that requires the content of a string to be updated, you must create an entirely new string. The details are a bit arcane, but the upshot is that this is necessary to ensure the database gets updated.

Say you have example.string = This is an example. To update it, instead of simply editing the .text file to example.string = This is a different example, you must change to (e.g.) example.string2 = This is a different example, and change the corresponding instance of example.string in the parent file.


Multi-line strings

It's possible to have a multi-line string in a .text file. Simply replace the equals sign with two less-than signs (<<), and end the string with a dot on its own line:

email.invitecoderequest.accept.body<<
 Your request for invites has been granted. You can view all your invite codes here:
 
   [[invitesurl]]

 .

Unknown data

Sometimes, there will be parts of a string which contain information that you can't specifically know when you're English-stripping, such as the username of the logged-in user. For example, the code might say something like:

$ret .= "$u->{user}, send your $LJ::SITENAMESHORT invite code:\n";

In the actual HTML output, this might look something like:

mark, send your Dreamwidth invite code:

Even though the "$u->{user}" and "$LJ::SITENAMESHORT" parts in this example are highlighted in red, you can tell they're pieces of data by the dollar sign; anything in a string that starts with a dollar sign is data that needs to be kept somehow. You don't need to understand what the names mean in order to English-strip them, just the way they're used. (Of course, if you do understand them, it'll be easier to give them meaningful labels.)

In order to use a piece of data in a multi-language string, you need to assign it a label. For example, for the first piece, let's call it "username". Then, you take the data part exactly as written (including the dollar sign), and combine the two with =>. If you have multiple pieces of data, as above, you can use commas to separate them:

username => $u->{user}, sitename => $LJ::SITENAMESHORT

You then surround the whole thing with braces:

{ username => $u->{user}, sitename => $LJ::SITENAMESHORT }

You then need to use the LJ::Lang::ml function and give it both the multi-language ID and the data itself:

$ret .= LJ::Lang::ml( ".invites.send_invite_code",
                   { username => $u->{user}, sitename => $LJ::SITENAMESHORT }
               ) . "\n";

Note that in this example, I've split the line into three lines in the middle of the line. Perl is perfectly happy with this as long as you do it in the right place - for example, not in the middle of a literal string. You could write the above as a single line if you wanted but it's not really very easy to read and shorter lines are generally preferred.

The key thing to note here is that after the multi-language ID, I've put a comma, then pasted the notation we constructed above after it, while still inside the parentheses of the LJ::Lang::ml() function. After that, we continue on as normal - the closing parenthesis, and the newline character.

We're now done for the Perl part of it, and we now need to add the text to the .text file. Fortunately, this is a lot easier; when you need to put data in a string, simply refer to its label surrounded by two square brackets. The above string would be represented in the .text file as follows:

.invites.send_invite_code=[[username]], send your [[sitename]] invite code:

We're done. Yay!

Plurals and numbers

These are described in excruciating detail in Embedding plural forms into translations, but here's a quick example:

$ret .= "You have $num message" . ( ( $num != 1 ) ? 's' : '' ) . " in your inbox.";

You could use a variable for the plural, like this:

$ret .= LJ::Lang::ml( ".inbox.num_msgs",
                   { num => $num, plural => ( $num != 1 ) ? 's' : '' }
               );
.inbox.num_msgs=You have [[num]] message[[plural]] in your inbox.

However, this would still be baking English into the source code - not actual English text in this case, but English grammar in the form of singular and plural inflections. Instead, you can use:

$ret .= LJ::Lang::ml( ".inbox.num_msgs", { num => $num );
.inbox.num_msgs=You have [[num]] [[?num|message[messages]] in your inbox.

This takes care of applying the rules for English plurals for you, and lets translators (with help from some magic in LiveJournal and Dreamwidth source code) handle it appropriately by just specifying a text string for their language, without having to muck around in the Dreamwidth source code - which is, after all, the goal of the translation system.

Don't split sentences

For example, you should not do this in your .text file:

.createaccount.enter_invite_code.part1=, enter your 

.createaccount.enter_invite_code.part2= invite code:

Heredocs

Sometimes, you'll come across Perl constructs that look something like this:

$ret .= <<HTML;

...followed by a block of text that isn't Perl code, followed by an "HTML" on its own line (or whatever was after the << in the original line). This format of text is known in Perl parlance as a "heredoc". If you can replace any text in there with <?_ml ... _ml?> tags, and it works, you should do so. However, if any need the use of the LJ::Lang::ml function, it's probably best to have someone look at it who codes Perl, since these require care in order to get right. (Of course, if you know Perl and know how to fix it, go ahead; otherwise, just make a note of it and move on.)

HTML forms

Sometimes, you'll come across text that needs to be stripped in HTML forms. For example, in the following example the contents of the <p> tag and the value of the submit button need to be English-stripped:

<form method="post" action="create.bml">
    <input type="hidden" name="mode" value="codesubmit">
    <p>Enter your invite code:</p>
    <input type="text" name="invite" size="20" maxlength="20">
    <input type="submit" value="Create Account">
</form>

Notice that we do *not* want to English-strip:

  • name="invite": The word "invite", although English, is being used here as a field name, and is not shown to the user. The code will be looking out for a field called "invite" no matter what language the user is using, so this must not be changed.
  • value="codesubmit": Same as above - this value is used by the code.

However, the value of a submit button (in this case, "Create Account") *is* shown to the user (as this is what's shown on the button itself), and thus needs to be English-stripped, despite being in a value attribute.

There is one exception to this. If you find a submit button with a "name" attribute, check to see if there are any more. If any two submit buttons have the same 'name' attribute, such as:

<input type="submit" name="action" value="Rename">
<input type="submit" name="action" value="Delete">

...then do not English-strip it, and make a note for whoever reviews your patch that this will need to be fixed. (This is because the code will be checking for the value of whatever button is clicked on, and that prohibits English-stripping from taking place; if you were to English-strip the value here, it would no longer work for non-English users.)

If you do not find two submit buttons with the same 'name' attribute, but there is nonetheless still a 'name' attribute on at least one, make a note for the reviewer that this is the case, but go ahead and English-strip it as normal. (The reviewer will check to see whether the code is actually checking for this value.)

Fin

That's basically the guide for how to English-strip a page. Don't be too afraid of messing things up; a reviewer will tell you if you have anything wrong.