This is a technical article about how Gengo determines the language of a WordPress page and how it will change in version 0.81.

There have been a number of questions recently relating to how Gengo uses redirects to ensure a consistent language interface. Determining the language is a complex process, made all the more problematic by the fact that inevitably people will follow bad links to your site and that Gengo now supports reading in multiple languages (site author determined, for MySQL 4.1 or above).

When a person visits a Gengo-enabled site, the language is determined using the following steps. As soon as we find the language, we stop:

  1. Is the language specified in the URL?
  2. Does the user have a Gengo cookie from a previous visit?
  3. Can we guess the language from the visitor’s browser (the User-Agent)
  4. Has the site author chosen what language to display articles for the first visit?
  5. Finally, if all else fails, use the site’s default language

Sometimes people follow links to sites that aren’t quite right. In those cases, Gengo has to fix the language so we don’t get 404 errors all the time. It tries to do this by redirecting the user to the place they meant to go to.

Some people like permalinks with /en/ or /fr/ on the end, like this site. Some people don’t. If we aren’t appending URLs, but somebody visits with a URL appended, we should strip it off. If we are appending URLs, but somebody visits without an appended URL, we need to append the codes. If somebody types a URL with a non-existant code, we have to handle that too. In all of these cases, a redirect is currently issued. There have been some concerns that this will leak a bit of PageRank.

WordPress plugins set up their locale using:

load_plugin_textdomain('PluginName');

As soon as the first call to this function is made, the ‘get_locale’ hook executes. Gengo listens for this hook, then sets the locale depending on the page content. Unfortunately, many plugins call this function as soon as they are loaded, rather than waiting till the appropriate time which is in the ‘init’ hook. The ‘get_locale’ hook is only called once, which means that if a plugin calls load_plugin_textdomain() before Gengo is loaded, the locale will not be set correctly. In an effort to work around this, Gengo determines the page language and locale as soon as Gengo is loaded. This still depends on Gengo being loaded before other plugins, which can’t be guaranteed. Another side effect of this is that because the language determination is done so early, some useful WordPress classes aren’t yet loaded or initialised. A very useful one for language determination is WP_Rewrite, which describes the site’s permalink information. Without this, Gengo has to do lots of work itself to figure out what kind of permalinks a site is using.

Originally, language determination was going to be looked at in version 0.9 of Gengo, but I think it’s an important enough problem to be at least partially addressed in 0.81. A complete solution needs to take into account the following factors:

  1. We may or may not use ‘pretty’ permalinks
  2. We may or may not /en/ or language=en appearing in our permalinks
  3. We may or may not have ‘index.php’ in our permalinks
  4. The user may or may not have a cookie
  5. The user may or may not be reading in a combination of languages
  6. The user may be searching for something using ?s=
  7. The language codes supplied may be incorrect, or non-existant
  8. The initial URL might point to a non-existant post, in which case we don’t want to redirect.
  9. The number of redirects needs to be kept to a minimum
  10. Some URLs may be specifically marked for exclusion
  11. In the future, some posts may be marked as having ‘No Language’ (scheduled for Gengo 0.9)

They weren’t kidding when they said every option you give a user doubles the amount of code! That’s a lot of conditions to handle. Gengo 0.8 does this mostly pretty well, although it falls down in one or two areas. The code that does this is also pretty complicated and hard to debug, which isn’t good for future maintenance. So, work is currently ongoing to sort it out.

One major part of this change is that the code to do all this is going to move into the ‘get_locale’ hook we talked about before. This will allow the permalink detection to be greatly simplified, as well as being more logical. One side effect of this is: plugins that are localised which call plugin_load_textdomain() before the ‘init’ hook will cause language determination to break. This will not be because Gengo is broken, it will be because the other plugin is broken. This isn’t just my opinion, it’s Ryan’s too. Although it would be nice if Gengo could work around this, the only solutions are inconsistent and unreliable at best. At least now there will be clarity.

I’m currently also investigating the feasibility of preventing a 302 redirect when presented with a URL that genuinely leads nowhere. As was pointed out in the Gengo forums, this leads to a situation like:

http://url.com/does-not-exist -> 302 found ->
http://url.com/does-not-exist/$lang/ -> 404 Not Found

Even if this doesn’t leak PageRank, it’s still messy. More news on this to come when I have a few more answers…