Back to the Nordic roots

Beautiful Atlanterhavsveien, Norway
Beautiful Atlanterhavsveien, Norway

We at Twingly have been delivering Nordic blog data since 2006 and have for a long time considered ourselves as the market leader in Scandinavia.

However, in our quest to find blogs in the different corners of the world we started to feel a sensation that we were slipping some when it came to our Nordic blog coverage. We have a lot of automatic methods to find new blogs of course, but each market needs attention and caring to flourish.

To increase our Nordic coverage, we selected the languages that we support: Swedish, Norwegian, Danish, Finnish and Icelandic. There are also other languages in this region like for instance Sami and Kven, but we rarely get requests for these 🙂

We gave the Nordic languages our best shot, scanning our current blogs for links to other blogs, made sure that all of them are in our automatic system for all new published blog posts etc. Since we also have pretty strict rules of what kind of blogs we welcome into our universe to prevent spam, we also ensured that we approved all new blogs correctly.

Whenever we make specific efforts like this we always try to automate processes that we see give a good result. In this case it has lead to enhanced services for crawling and once they are in production we list them in Source discovery and ingestion.

Icelandic blog posts per week
Icelandic blog posts per week

As always when it comes to coverage you have to wait some time to see the effect. Now, one month later, we can see that we increased greatly in average daily blog posts in Finnish (30-40%) as well as Icelandic (about 75%).

However, we didn’t see the same increase in the other Nordic languages. In one way it is good because it means that we have good coverage in Swedish, Norwegian and Danish but it is always disappointing when you make an effort and the result just confirm that you should have focused elsewhere.

Once we have developed even more blog discovery techniques we will apply them to these languages too, to see if we can find those small sparkling blogs hiding somewhere behind dwarf birches, next to fjord beds or daydreaming under a windmill.

By Pontus Edenberg

Language detection changes

Historically, when trying to identify the language of a given blog post, we have only been looking at the post’s raw text, i.e. the post’s body text, stripped from HTML. In general this works very well, given that the body text is “long enough”.

However, we have noticed that some bloggers tend to write very little in the actual blog post. For example, the post’s body text may only consist of a single word and a bunch of images. Even against all odds we still attempted to identify the language for such posts, with varying outcome.

In order to somewhat increase the accuracy of our language identification we have, effectively yesterday, decided to include the post’s title when identifying the language, provided that the title is not the same as the body text. Naturally, if the title is short, or non-existent, this will likely not improve on the situation at all, but in those cases when the title is at least a few words we should expect to see more reliable results.

The language improvements apply to Twingly Search API, Twingly LiveFeed API and our public search.

Market on the Mekong river

Nothing is for free though. We have noticed, since the change, that some Tumblr blogs use a certain title1 and omits body texts. Leaving us with only the title for identification (in our old algorithm these posts would not have been identified at all!2). This has caused an interesting eight-fold increase in identification of Vietnamese posts. We are hoping to be able to address this peculiarity promptly.

Oh, and as a bonus as of today we have started to identify Chinese (zh) posts. We expect this to improve the quality over all languages as Chinese posts may have been identified as non-Chinese before.

tl;dr

  • We include the blog post’s title when identifying its language rather than just the body text, this should improve on the quality of the language field
  • We can detect Chinese posts now, this will also improve the quality of other languages

As always, please contact us if you have any questions or concerns.

By Robin Wallin


  1. Hint 
  2. In some cases we would fall back on our best guess for the blog in general