8 million reasons for transparency in media coverage

Hands_my_data
Show’em what you got…

Coverage is quite essential when it comes to all kinds of media monitoring. It is difficult to track what is said about a brand or to measure the effect of a campaign if you do not have the correct data supplied.

However, it is somewhat difficult to know if the sources you, as a media monitoring company, bring to your clients are enough or sometimes even active. It would be much easier to compare different data suppliers and their advantages, as well as a help for the media monitoring companies that are self-supplied, if there was a standard in how to measure coverage.

A first step to getting closer to a standard could be that everyone is transparent and publishes their numbers, whether it is blogs, news, message boards, podcasts etc. Then we can start to adjust and eventually there might be an accepted way to measure coverage among data suppliers and media monitoring companies.

For us, dealing with blog data, there will always be more blogs out there to monitor and there is a constant struggle in finding them. We are continuously adding new methods to increase the coverage but others’ numbers would definitely spur us even more, and most likely other data suppliers in this industry, to do better.

We have made the numbers for our data public, regardless of how the numbers measure to others, and it would be great to see others do the same. When it comes to blogs, we have chosen the term “active blogs” to separate the data that matters from giant empty numbers. An “active blog” for us is a blog with a post during the last 6 months.

Veerabhadra Temple, Lepakshi
Veerabhadra Temple in Andhra Pradesh, home of Telugu

Every new active blog that we add to our monitoring is important. Even though we are talking about them in bulk, every single source is a reason for transparency, whether it is the 3 new blogs that we add every day in average in Telugu (Indian language spoken mainly in Andhra Pradesh) or the entire volume of 8 million active blogs.

Naturally, the quality of the data you supply is also important. However, that is a more difficult task to measure when it comes to these volumes of data, divided over different markets etc. Please share if you have any ideas here, otherwise we can start to agree on how to show the numbers first and then get down to the tough business of quality 🙂

Please let us know what you think or if you prefer to see blog coverage presented in any other way. We can of course also supply you with other specific numbers from our blog data if you like.

By Pontus Edenberg

Finding the needles: blog discovery

first_indexed_blogs
First blogs indexed by Twingly

Twingly has, as mentioned in our last blog post, been indexing blogs since 2006.1 A few of you out there might wonder: the Internet is a big place, how do you find blogs in the enormous haystack? To answer that question, we need to journey back to 2006 and travel until present day. We will also take a glimpse at the future of blog indexing here at Twingly.

Before we start, we ought to mention that we have a very important requirement on the blog in order for us to be able to index it: the blog must have a discoverable feed in either the Atom or the RSS format. This requirement, or limitation if you will, alone makes life much easier as we can quickly discard most of the pages found on the Internet.

In the beginning, there was our “Blog Provider Monitor system”, or “Provider system“ in short. Shortly put the system consists of a set of specialized automatic indexers, also known as crawlers or spiders, with defined rules. For example one provider might keep watch on a certain blog hotel, while another could watch an aggregated blog top list.

In January 2007 we introduced an interface for automatic pings, XML-RPC ping, which enables blogs to automatically notify us when there is new blog content to be found. This enabled self-hosted blogs, which cannot be found by our Provider system, to find their way into our index.

In February 2008 manual ping saw the light on http://www.twingly.se. The ping page allows bloggers to manually notify us when they wanted to have their blog indexed by us, further increasing our coverage.

While the three systems mentioned above are great at finding new blogs, they all share a flaw – they have no memory. In theory they could find a blog once, never to find it again. To solve that problem our Automatic Ping system was born, it was fed with blogs that were deemed worthy of continuous monitoring. Blogs put into the automatic system were typically customer requests, blogs found by certain providers and blogs manually added by Twingly.

The first iterations of Autoping were quite rudimentary and our current generation of Autoping came alive in September 2012. It supports balancing (i.e. how often to ping a given blog), duplicate detection and other techniques that ensures that we do not consume more resources than necessary and retrieve blogs in a timely fashion.

For the past year we have been working very hard to further increase our blog coverage. This includes projects such as:

  • Fine-tuning and creating new Providers.
  • Finding blogs in outgoing links in newly indexed blog posts (since May 2014). The system was extended to check for outgoing links on the blog’s front page in October 2015.
  • Finding blogs mentioned in social media (several projects during 2015).
  • Re-visiting all of the blogs (over 80 million(!)) in our index, ensuring that we keep the ones that are still alive and active under automatic monitoring (ongoing since September 2015).
  • Ensuring all of our newly discovered blogs are automatically monitored (October 2015).
  • Providers capable of handling web pages creating its content dynamically via JavaScript (November 2015).

We are still not satisfied and the future holds many interesting projects.

HAL9000.svg
Twingly’s Blog AI?

A huge challenge when indexing blogs is to prevent the accidental ingestion of undesirable sites such as news sites and forums. Therefore we have quite strict rules for content reaching our systems through, for example, social media, outgoing links and XML-RPC. Naturally, the strict rules likely make us reject actual blogs. To remedy this we have instituted a “Blog AI” project which aims to solve this problem, as the name implies we want an automated system which can deem whether a given site on the Internet is a blog or not. The project is split into several parts and the first part concerns the ability to be able to detect custom domains2. We expect to see parts of that system in production soon.

Another challenge is to find and index newly published posts in a timely fashion. As mentioned, we do have our Automatic ping system, but it makes assumptions about the user’s blogging pattern based on past behavior. To overcome this problem we have started to work on the next generation of Autoping that will be using the PubSubHubbub protocol for blogs that supports it. This means that we will be able to index posts instantaneously after they have been published! We hope to have this ready for evaluation soon.

Keep an eye out for more in-depth posts and in the meanwhile check out our source discovery and ingestion documentation.

By Robin Wallin


  1. 4th of October, 2006 to be precise.
  2. A blog that is hosted on a blog platform but uses its own domain name, i.e. blog.twingly.com which is, infact, hosted on wordpress.com

Back to the Nordic roots

Beautiful Atlanterhavsveien, Norway
Beautiful Atlanterhavsveien, Norway

We at Twingly have been delivering Nordic blog data since 2006 and have for a long time considered ourselves as the market leader in Scandinavia.

However, in our quest to find blogs in the different corners of the world we started to feel a sensation that we were slipping some when it came to our Nordic blog coverage. We have a lot of automatic methods to find new blogs of course, but each market needs attention and caring to flourish.

To increase our Nordic coverage, we selected the languages that we support: Swedish, Norwegian, Danish, Finnish and Icelandic. There are also other languages in this region like for instance Sami and Kven, but we rarely get requests for these 🙂

We gave the Nordic languages our best shot, scanning our current blogs for links to other blogs, made sure that all of them are in our automatic system for all new published blog posts etc. Since we also have pretty strict rules of what kind of blogs we welcome into our universe to prevent spam, we also ensured that we approved all new blogs correctly.

Whenever we make specific efforts like this we always try to automate processes that we see give a good result. In this case it has lead to enhanced services for crawling and once they are in production we list them in Source discovery and ingestion.

Icelandic blog posts per week
Icelandic blog posts per week

As always when it comes to coverage you have to wait some time to see the effect. Now, one month later, we can see that we increased greatly in average daily blog posts in Finnish (30-40%) as well as Icelandic (about 75%).

However, we didn’t see the same increase in the other Nordic languages. In one way it is good because it means that we have good coverage in Swedish, Norwegian and Danish but it is always disappointing when you make an effort and the result just confirm that you should have focused elsewhere.

Once we have developed even more blog discovery techniques we will apply them to these languages too, to see if we can find those small sparkling blogs hiding somewhere behind dwarf birches, next to fjord beds or daydreaming under a windmill.

By Pontus Edenberg