Historically, when trying to identify the language of a given blog post, we have only been looking at the post’s raw text, i.e. the post’s body text, stripped from HTML. In general this works very well, given that the body text is “long enough”.
However, we have noticed that some bloggers tend to write very little in the actual blog post. For example, the post’s body text may only consist of a single word and a bunch of images. Even against all odds we still attempted to identify the language for such posts, with varying outcome.
In order to somewhat increase the accuracy of our language identification we have, effectively yesterday, decided to include the post’s title when identifying the language, provided that the title is not the same as the body text. Naturally, if the title is short, or non-existent, this will likely not improve on the situation at all, but in those cases when the title is at least a few words we should expect to see more reliable results.
Nothing is for free though. We have noticed, since the change, that some Tumblr blogs use a certain title1 and omits body texts. Leaving us with only the title for identification (in our old algorithm these posts would not have been identified at all!2). This has caused an interesting eight-fold increase in identification of Vietnamese posts. We are hoping to be able to address this peculiarity promptly.
Oh, and as a bonus as of today we have started to identify Chinese (zh) posts. We expect this to improve the quality over all languages as Chinese posts may have been identified as non-Chinese before.
- We include the blog post’s title when identifying its language rather than just the body text, this should improve on the quality of the language field
- We can detect Chinese posts now, this will also improve the quality of other languages
As always, please contact us if you have any questions or concerns.
By Robin Wallin