Difference between revisions of "Explanations"
m (→Outdated terminology that we can't shake) |
|||
(9 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[Category: Development]] | [[Category: Development]] | ||
+ | |||
+ | See also: [[Decisions and tradeoffs]] | ||
=Why do we do it that way?= | =Why do we do it that way?= | ||
Line 7: | Line 9: | ||
==Why entry IDs are not in sequential order== | ==Why entry IDs are not in sequential order== | ||
− | When an entry ID is assigned -- the "1239582.html" part of an entry URL -- it isn't sequential from the last entry posted to the account. That is, entries in your account aren't numbered "1.html", "2.html", etc: they're assigned a number that's the journal entry number (the 'jitemid') | + | When an entry ID is assigned -- the "1239582.html" part of an entry URL -- it isn't sequential from the last entry posted to the account. That is, entries in your account aren't numbered "1.html", "2.html", etc: they're assigned a number that's the journal entry number (the 'jitemid') & a random number (the 'anum') between 1 and 256. |
− | This is done for two reasons: to slow down bots, spiders, and spammers from going through all entries in an account one-by-one, and to prevent someone from being able to quickly tell that they can't see an entry in a journal. ( | + | This is done for two reasons: to slow down bots, spiders, and spammers from going through all entries in an account one-by-one, and to prevent someone from being able to quickly tell that they can't see an entry in a journal. (The casual observer may notice that the last visible entry, "I'm A Teapot" was #418 and this one, "I'm So High Right Now" is #420, implying the existence of #419, which they can't see. They might not notice when there's a minimum three-digit difference in the number.) |
+ | |||
+ | ===Discussion=== | ||
+ | |||
+ | ;nfagerlund | ||
+ | :Oh also, is the whole tangle of anum / itemid / ditemid bitwise math documented anywhere, like on the wiki or something? | ||
+ | :none of the code seems to have explanatory comments | ||
+ | |||
+ | ;momijizukamori | ||
+ | :Not that I am aware of :c | ||
+ | |||
+ | ;nfagerlund | ||
+ | :do you know it well enough to refresh my memory? | ||
+ | :I know SOMEONE here seemed to have an intuitive grasp on it the last time it came up | ||
+ | |||
+ | ;momijizukamori | ||
+ | :no, that's more of a Mark question I think. Bit match gives me a headache | ||
+ | |||
+ | ;nfagerlund | ||
+ | : *thumbsup_tone2* hard same | ||
+ | |||
+ | ;momijizukamori | ||
+ | :I know one of them (I think itemid?) is the DB id number, and then it gets combined with the anum in some way to create the ditemid, which is what the user sees - basically so the visible item ids of sequential entries aren't themselves sequential, etc | ||
+ | |||
+ | ;mark | ||
+ | :Yeah so | ||
+ | :ditemid means “display itemid” and is the value we use in URLs and show to users etc | ||
+ | :Think of it like the public ID | ||
+ | |||
+ | ;alierak | ||
+ | :anum is just a random number to frustrate attempts to walk id space, yeah | ||
+ | |||
+ | ;mark | ||
+ | :“jitemid” is the actual journal itemid I.e what’s in the database, this is just a sequential number | ||
+ | :“anum” is “a number”, it’s a random 1 byte value 0-255 | ||
+ | :The ditemid is “jitemid * 256 + anum” | ||
+ | :Or in other words, shift the jitemid left one byte and add the anum | ||
+ | :anum is random but persistent l, when the entry is posted we assign it an anum and it keeps that forever | ||
+ | :So the ditemid is stable | ||
+ | |||
+ | ;alierak | ||
+ | :So ditemid >> 8 is a jitemid, and ditemid & 255 is the anum | ||
+ | :I think Entry.pm does a better job of documenting this than Talk.pm | ||
+ | |||
+ | ;nfagerlund | ||
+ | :Thank you! So, this ID obfuscation is a property of entries... do comments have a similar thing? | ||
+ | :(and if so, does it use the same names for the different properties.) | ||
+ | :(tangent: if I remove a localized string, am I supposed to also add it to deadphrases.dat? ) | ||
+ | |||
+ | ;kareila | ||
+ | :supposed to but it's not the end of the world if you forget | ||
+ | :it just reduces string bloat I think | ||
+ | |||
+ | ;alierak | ||
+ | :I don't think comments have the same sort of id obfuscation, but maybe threads do? | ||
+ | :Dunno, there are jtalkid and dtalkid things floating around though | ||
+ | :(maybe comments are using the same anum as the entry) | ||
+ | |||
+ | ;nfagerlund | ||
+ | :well, explaining how it works with entries definitely helps me know what to watch out for! I'll see what turns up. | ||
+ | |||
+ | ;alierak | ||
+ | :Indeed, comments are using the same anum as the entry, to get from jtalkid to dtalkid | ||
==Why you can't just change text in a translation string in a patch== | ==Why you can't just change text in a translation string in a patch== | ||
Line 30: | Line 94: | ||
Performance reasons. Older entries have a much lower chance of being cached, so the further back you scroll, the more you have to hit the database directly instead of pulling an entry out of the cache. Disabling pages that call larger number of entries, such as the reading page or the Recent Entries page once you get past a certain skip value, reduces the server load. | Performance reasons. Older entries have a much lower chance of being cached, so the further back you scroll, the more you have to hit the database directly instead of pulling an entry out of the cache. Disabling pages that call larger number of entries, such as the reading page or the Recent Entries page once you get past a certain skip value, reduces the server load. | ||
+ | |||
+ | To browse older entries in journals, browse by date. | ||
+ | |||
+ | To browse older entries on reading pages, have a paid account and browse by date. | ||
==User-level action logging: when to use infohistory vs userlog== | ==User-level action logging: when to use infohistory vs userlog== | ||
Line 55: | Line 123: | ||
We then incremented to dversion 9, changing how icons were stored and accessed to allow for icon renaming. | We then incremented to dversion 9, changing how icons were stored and accessed to allow for icon renaming. | ||
+ | ==Gender choices== | ||
+ | The gender choices for the [http://www.dreamwidth.org/stats site statistics] are: | ||
+ | * Female | ||
+ | * Male | ||
+ | * Other | ||
+ | * Rather not say | ||
+ | |||
+ | There are enough people in the world whose gender does not fit into a binary female/male classification, and who are willing to disclose this information and be counted in the site's statistics, that it was not appropriate to combine "Other" with "Rather not say". | ||
=Outdated terminology that we can't shake= | =Outdated terminology that we can't shake= |
Latest revision as of 23:02, 27 May 2020
See also: Decisions and tradeoffs
Contents
- 1 Why do we do it that way?
- 1.1 Why entry IDs are not in sequential order
- 1.2 Why you can't just change text in a translation string in a patch
- 1.3 Why we rate-limit failed logins
- 1.4 Why the base directory is $LJHOME instead of $DWHOME
- 1.5 Why you can only scroll back so far on the reading page or the Recent Entries page
- 1.6 User-level action logging: when to use infohistory vs userlog
- 1.7 dversion: old database revisions
- 1.8 Gender choices
- 2 Outdated terminology that we can't shake
Why do we do it that way?
There's a lot of stuff in the code that isn't always obvious at first glance, and the way of doing things can be confusing or bizarre. There are also a number of code-design choices that aren't always intuitive, but really do have very good reasons for doing it that way -- reasons people generally only learn the first time they have a pull request bounced back to them with a request for revision. This is an attempt to get some of them out of our heads and onto the screen. Some of them are issues that will come up in pull requests; some of them are just a "this is why we do things this way" documentation of design decisions or common conventions.
Why entry IDs are not in sequential order
When an entry ID is assigned -- the "1239582.html" part of an entry URL -- it isn't sequential from the last entry posted to the account. That is, entries in your account aren't numbered "1.html", "2.html", etc: they're assigned a number that's the journal entry number (the 'jitemid') & a random number (the 'anum') between 1 and 256.
This is done for two reasons: to slow down bots, spiders, and spammers from going through all entries in an account one-by-one, and to prevent someone from being able to quickly tell that they can't see an entry in a journal. (The casual observer may notice that the last visible entry, "I'm A Teapot" was #418 and this one, "I'm So High Right Now" is #420, implying the existence of #419, which they can't see. They might not notice when there's a minimum three-digit difference in the number.)
Discussion
- nfagerlund
- Oh also, is the whole tangle of anum / itemid / ditemid bitwise math documented anywhere, like on the wiki or something?
- none of the code seems to have explanatory comments
- momijizukamori
- Not that I am aware of :c
- nfagerlund
- do you know it well enough to refresh my memory?
- I know SOMEONE here seemed to have an intuitive grasp on it the last time it came up
- momijizukamori
- no, that's more of a Mark question I think. Bit match gives me a headache
- nfagerlund
- *thumbsup_tone2* hard same
- momijizukamori
- I know one of them (I think itemid?) is the DB id number, and then it gets combined with the anum in some way to create the ditemid, which is what the user sees - basically so the visible item ids of sequential entries aren't themselves sequential, etc
- mark
- Yeah so
- ditemid means “display itemid” and is the value we use in URLs and show to users etc
- Think of it like the public ID
- alierak
- anum is just a random number to frustrate attempts to walk id space, yeah
- mark
- “jitemid” is the actual journal itemid I.e what’s in the database, this is just a sequential number
- “anum” is “a number”, it’s a random 1 byte value 0-255
- The ditemid is “jitemid * 256 + anum”
- Or in other words, shift the jitemid left one byte and add the anum
- anum is random but persistent l, when the entry is posted we assign it an anum and it keeps that forever
- So the ditemid is stable
- alierak
- So ditemid >> 8 is a jitemid, and ditemid & 255 is the anum
- I think Entry.pm does a better job of documenting this than Talk.pm
- nfagerlund
- Thank you! So, this ID obfuscation is a property of entries... do comments have a similar thing?
- (and if so, does it use the same names for the different properties.)
- (tangent: if I remove a localized string, am I supposed to also add it to deadphrases.dat? )
- kareila
- supposed to but it's not the end of the world if you forget
- it just reduces string bloat I think
- alierak
- I don't think comments have the same sort of id obfuscation, but maybe threads do?
- Dunno, there are jtalkid and dtalkid things floating around though
- (maybe comments are using the same anum as the entry)
- nfagerlund
- well, explaining how it works with entries definitely helps me know what to watch out for! I'll see what turns up.
- alierak
- Indeed, comments are using the same anum as the entry, to get from jtalkid to dtalkid
Why you can't just change text in a translation string in a patch
If you want to change some text that appears on a page, and the page has already been English-stripped, you can't just change the text in the translation file. That is, if you have code that's referencing the translation string "example.foo.string", and you want to change the text in example.foo.string, you can't just edit en.dat so the version of the string in en.dat is different. Instead, you have to change the call to the string by removing the old string and referencing a new one (in this case, convention would be to change the code so it's referencing "example.foo.string2" and to put "example.foo.string" in deadphrases.dat) and put your new text in the new string.
This is because in many cases, the version of the text that's in the code is not always the version of the text that's on the live site -- site administrators can edit the text "on the fly" by using the site's translation system to change the text that's shown to users of dreamwidth.org. That doesn't change the text string in the code, though. To avoid overwriting all those changes that have been made over time, the site admins don't allow texttool.pl, the script that loads and manages translation strings, to overwrite the version of the text that's displayed on the live site: we assume that if the version in Github and the version on the live site are different, the version on the live site is the preferred version. So, if you only change the translation string in the associated textfile, it will never be loaded onto the live site.
Newly-created strings, however, are loaded without any problem -- so all text changes have to be done that way.
Why we rate-limit failed logins
If you input a wrong password more than 3 times, the login page starts to implement a "backoff" -- it won't let you try again until five minutes has passed. (And on subsequent login failures, the rate limit increases.) This is to prevent automated bots from trying to break into accounts by trying every password in a password file or dictionary. (Sometimes people ask "but who would want to break into my account on a journal site" -- the answer is, spammers. Accounts that already exist, that have been frequently linked to, and that have a history of content have more "search engine juice": breaking into those and using them to post spam gives the spammers a wider reach.)
Why the base directory is $LJHOME instead of $DWHOME
...and why there are a few more references to LJ in the code or in variable names: basically, those things were so deeply baked into the code that if we changed things, it had the possibility of introducing a ton of bugs for no good reason. Since end-users never see those things -- they're development-only -- we felt that the risk of introducting bugs (many of which had the potential for being very subtle and hard to diagnose) was too high to muck about with things that had been working for a decade.
Why you can only scroll back so far on the reading page or the Recent Entries page
Performance reasons. Older entries have a much lower chance of being cached, so the further back you scroll, the more you have to hit the database directly instead of pulling an entry out of the cache. Disabling pages that call larger number of entries, such as the reading page or the Recent Entries page once you get past a certain skip value, reduces the server load.
To browse older entries in journals, browse by date.
To browse older entries on reading pages, have a paid account and browse by date.
User-level action logging: when to use infohistory vs userlog
There are two systems for logging "this user did this thing at this time" type events: statushistory (/admin/statushistory, uses prop 'historyview') and userlog (/admin/userlog, uses prop canview:userlog). There isn't really a great rubric for picking one over the other. Roughly speaking, statushistory is for system-level events that were instituted by the application itself or by a site admin (suspension/unsuspension, payments, priv addition/deletion, many console commands, etc) and userlog is for account-level changes that were instituted by the user (entry deletion, icon deletion, community maintainer changes), but there are a few exceptions. (Renames go in statushistory, for instance.)
If you can't figure out which logging system to use, go ahead and ask somebody. Factors to keep in mind: with statushistory, you can search by account ("show me all the things that have been done to the account denise"), by event type ("show me all recent suspends"), or by admin who took the action ("show me all actions denise has taken, on any account"), while userlog can only search by target account ("show me all the things that have been done to the account denise"). Userlog shows all things that have been done to that account, back to account creation, while statushistory is limited to the last 1000 actions. Statushistory only logs the action (and any notes generated by the code/any comments included in the console command), while userlog also logs the IP address and uniq cookie that the request came from. These, and various other factors, can influence which logging option makes the most sense.
dversion: old database revisions
You may occasionally see reference to 'dversion' in the code. This stands for 'data version', and was used on LJ for times when the site's data structure changed in such a way as to be incompatable with the old way of doing things and the change was done slowly over a period of time to avoid slamming the DB too hard with one mass update. Incrementing the dversion means that you can check whether a particular user was converted to the new way of doing things or not, so you know what style of data you're getting back for a particular account.
This is mostly legacy: Dreamwidth has only made one dversion change, from 8 to 9. There are still some lingering remnants in the code, however, although we have tried to clean that up a bit. For historical purposes, the old dversion changes were:
- 0: the first version of LJ, in which there were no database clusters.
- 1: the first pass at database clustering, in which user journals were clustered but icons weren't.
- 2: the first version with all user data fully clustered.
- 3: conversion to 'weekuserusage', an old (now obsolete) way of tracking user activity.
- 4: clustering of the userproplite2 and userproplist tables, and allowing userprops to be stored on both the global cluster and the user's cluster ('multihomed' userprops)
- 5: clustering of the S1 styles and S1 style overrides
- 6: clustering of memories and memory keywords, plus friend groups
- 7: clustering of all icons with keywords and support for icon comments
- 8: clustering of polls
We then incremented to dversion 9, changing how icons were stored and accessed to allow for icon renaming.
Gender choices
The gender choices for the site statistics are:
- Female
- Male
- Other
- Rather not say
There are enough people in the world whose gender does not fit into a binary female/male classification, and who are willing to disclose this information and be counted in the site's statistics, that it was not appropriate to combine "Other" with "Rather not say".
Outdated terminology that we can't shake
In some cases, you may see people who have worked on the code for ages use terminology that's really outdated. Some of these include:
lastn
Occasionally used instead of 'Recent' or 'Recent Entries'. Under S1, the original customization system, 'lastn' stood for "last N entries" -- ie, the page that showed the N most recent entries, based on what the journal owner had chosen for the number of entries to show.
When we broke LJ's concept of 'friend' into the component parts of "I want to read you" (subscription) vs "I want to authorize you to read my locked stuff" (access), we changed a lot of stuff. You may still sometimes hear:
- friends page: the Reading page
- friendsfriends: the Network
- friends list or flist: the Circle
- friend group: access (or, sometimes but rarely, subscription) filter
- friendslocked or flocked: protected entry which is only visible to your access list.
- checkfriends: the API function to query whether or not new entries had been posted to the friends page
userpic
The user-facing text and all the documentation (should) refer to them as 'icons', which is what we standardized on, but on LJ they were called 'userpics' and 'icons' interchangeably for so long that some of us can't shake calling them 'userpics'. You should use 'icon', though.
layout
Often used instead of 'style', not be confused with 'layout layer'.
site scheme
Often used instead of 'site skin'.