Wed 20 Dec 2006
Determining Language
Posted by Jamie under Development News, Gengo.
This is a technical article about how Gengo determines the language of a WordPress page and how it will change in version 0.81.
There have been a number of questions recently relating to how Gengo uses redirects to ensure a consistent language interface. Determining the language is a complex process, made all the more problematic by the fact that inevitably people will follow bad links to your site and that Gengo now supports reading in multiple languages (site author determined, for MySQL 4.1 or above).
When a person visits a Gengo-enabled site, the language is determined using the following steps. As soon as we find the language, we stop:
- Is the language specified in the URL?
- Does the user have a Gengo cookie from a previous visit?
- Can we guess the language from the visitor’s browser (the User-Agent)
- Has the site author chosen what language to display articles for the first visit?
- Finally, if all else fails, use the site’s default language
Sometimes people follow links to sites that aren’t quite right. In those cases, Gengo has to fix the language so we don’t get 404 errors all the time. It tries to do this by redirecting the user to the place they meant to go to.
Some people like permalinks with /en/ or /fr/ on the end, like this site. Some people don’t. If we aren’t appending URLs, but somebody visits with a URL appended, we should strip it off. If we are appending URLs, but somebody visits without an appended URL, we need to append the codes. If somebody types a URL with a non-existant code, we have to handle that too. In all of these cases, a redirect is currently issued. There have been some concerns that this will leak a bit of PageRank.
WordPress plugins set up their locale using:
load_plugin_textdomain('PluginName');
As soon as the first call to this function is made, the ‘get_locale’ hook executes. Gengo listens for this hook, then sets the locale depending on the page content. Unfortunately, many plugins call this function as soon as they are loaded, rather than waiting till the appropriate time which is in the ‘init’ hook. The ‘get_locale’ hook is only called once, which means that if a plugin calls load_plugin_textdomain() before Gengo is loaded, the locale will not be set correctly. In an effort to work around this, Gengo determines the page language and locale as soon as Gengo is loaded. This still depends on Gengo being loaded before other plugins, which can’t be guaranteed. Another side effect of this is that because the language determination is done so early, some useful WordPress classes aren’t yet loaded or initialised. A very useful one for language determination is WP_Rewrite, which describes the site’s permalink information. Without this, Gengo has to do lots of work itself to figure out what kind of permalinks a site is using.
Originally, language determination was going to be looked at in version 0.9 of Gengo, but I think it’s an important enough problem to be at least partially addressed in 0.81. A complete solution needs to take into account the following factors:
- We may or may not use ‘pretty’ permalinks
- We may or may not /en/ or language=en appearing in our permalinks
- We may or may not have ‘index.php’ in our permalinks
- The user may or may not have a cookie
- The user may or may not be reading in a combination of languages
- The user may be searching for something using ?s=
- The language codes supplied may be incorrect, or non-existant
- The initial URL might point to a non-existant post, in which case we don’t want to redirect.
- The number of redirects needs to be kept to a minimum
- Some URLs may be specifically marked for exclusion
- In the future, some posts may be marked as having ‘No Language’ (scheduled for Gengo 0.9)
They weren’t kidding when they said every option you give a user doubles the amount of code! That’s a lot of conditions to handle. Gengo 0.8 does this mostly pretty well, although it falls down in one or two areas. The code that does this is also pretty complicated and hard to debug, which isn’t good for future maintenance. So, work is currently ongoing to sort it out.
One major part of this change is that the code to do all this is going to move into the ‘get_locale’ hook we talked about before. This will allow the permalink detection to be greatly simplified, as well as being more logical. One side effect of this is: plugins that are localised which call plugin_load_textdomain() before the ‘init’ hook will cause language determination to break. This will not be because Gengo is broken, it will be because the other plugin is broken. This isn’t just my opinion, it’s Ryan’s too. Although it would be nice if Gengo could work around this, the only solutions are inconsistent and unreliable at best. At least now there will be clarity.
I’m currently also investigating the feasibility of preventing a 302 redirect when presented with a URL that genuinely leads nowhere. As was pointed out in the Gengo forums, this leads to a situation like:
http://url.com/does-not-exist -> 302 found ->
http://url.com/does-not-exist/$lang/ -> 404 Not Found
Even if this doesn’t leak PageRank, it’s still messy. More news on this to come when I have a few more answers…
4 Responses
Comments:
Leave a Reply
You must be logged in to post a comment. Log in .

December 20th, 2006 at 10:20 pm
Are you sure that we should have language codes appended?
Maybe it’s OK for blogs with only two languages, but http://wp-multilingual.net/fr+yi+de+es+it+bg+hu+en/ is likely to cause confusion.
And as I mentioned elsewhere, it’s not a “pretty” permalink, if you link from (for example) wp-multilingual.net/gengo/en/ to wp-multilingual.net/gengo/faq/en/. It doesn’t reflect any directory structure at all.
December 21st, 2006 at 12:12 am
I take your point on that, though it’s worth remembering the following:
Firstly, your example permalink would only be used for people who could read all 8 languages on this site, and that’s pretty unlikely.
Secondly, the multiple prefixes occur only for pages where posts are grouped, such as the home page and categories. For single posts and pages, it will always be a single code, the code of the article you are reading.
And yes, it doesn’t reflect directory structure, but it does reflect the content of the page and helps to show at a glance what language a page will be in. Not all URLs reflect directory structure - take Flickr, for example.
Jamie.
December 22nd, 2006 at 11:12 am
What do you think about a URL like:
site.com/2006/12/22/article/?l=en+de+fr
The query string would prevent the directory structure from being broken, and it looks like it could be implemented without doubling the amount of code.
And I do hate you for that picture. :)
Georg
December 22nd, 2006 at 10:46 pm
That actually already works, would you believe :-)
Try site.com/2006/12/22/article/?language=en+de+fr
The site.com/2006/12/22/article/en+de+fr/ is actually just rewritten to the above permalink. Honestly though, I think the second one is cleaner. Just my opinion :-)
Jamie.