Twingly has, as mentioned in our last blog post, been indexing blogs since 2006.1 A few of you out there might wonder: the Internet is a big place, how do you find blogs in the enormous haystack? To answer that question, we need to journey back to 2006 and travel until present day. We will also take a glimpse at the future of blog indexing here at Twingly.
Before we start, we ought to mention that we have a very important requirement on the blog in order for us to be able to index it: the blog must have a discoverable feed in either the Atom or the RSS format. This requirement, or limitation if you will, alone makes life much easier as we can quickly discard most of the pages found on the Internet.
In the beginning, there was our “Blog Provider Monitor system”, or “Provider system“ in short. Shortly put the system consists of a set of specialized automatic indexers, also known as crawlers or spiders, with defined rules. For example one provider might keep watch on a certain blog hotel, while another could watch an aggregated blog top list.
In January 2007 we introduced an interface for automatic pings, XML-RPC ping, which enables blogs to automatically notify us when there is new blog content to be found. This enabled self-hosted blogs, which cannot be found by our Provider system, to find their way into our index.
In February 2008 manual ping saw the light on http://www.twingly.se. The ping page allows bloggers to manually notify us when they wanted to have their blog indexed by us, further increasing our coverage.
While the three systems mentioned above are great at finding new blogs, they all share a flaw – they have no memory. In theory they could find a blog once, never to find it again. To solve that problem our Automatic Ping system was born, it was fed with blogs that were deemed worthy of continuous monitoring. Blogs put into the automatic system were typically customer requests, blogs found by certain providers and blogs manually added by Twingly.
The first iterations of Autoping were quite rudimentary and our current generation of Autoping came alive in September 2012. It supports balancing (i.e. how often to ping a given blog), duplicate detection and other techniques that ensures that we do not consume more resources than necessary and retrieve blogs in a timely fashion.
For the past year we have been working very hard to further increase our blog coverage. This includes projects such as:
- Fine-tuning and creating new Providers.
- Finding blogs in outgoing links in newly indexed blog posts (since May 2014). The system was extended to check for outgoing links on the blog’s front page in October 2015.
- Finding blogs mentioned in social media (several projects during 2015).
- Re-visiting all of the blogs (over 80 million(!)) in our index, ensuring that we keep the ones that are still alive and active under automatic monitoring (ongoing since September 2015).
- Ensuring all of our newly discovered blogs are automatically monitored (October 2015).
We are still not satisfied and the future holds many interesting projects.
A huge challenge when indexing blogs is to prevent the accidental ingestion of undesirable sites such as news sites and forums. Therefore we have quite strict rules for content reaching our systems through, for example, social media, outgoing links and XML-RPC. Naturally, the strict rules likely make us reject actual blogs. To remedy this we have instituted a “Blog AI” project which aims to solve this problem, as the name implies we want an automated system which can deem whether a given site on the Internet is a blog or not. The project is split into several parts and the first part concerns the ability to be able to detect custom domains2. We expect to see parts of that system in production soon.
Another challenge is to find and index newly published posts in a timely fashion. As mentioned, we do have our Automatic ping system, but it makes assumptions about the user’s blogging pattern based on past behavior. To overcome this problem we have started to work on the next generation of Autoping that will be using the PubSubHubbub protocol for blogs that supports it. This means that we will be able to index posts instantaneously after they have been published! We hope to have this ready for evaluation soon.
Keep an eye out for more in-depth posts and in the meanwhile check out our source discovery and ingestion documentation.
By Robin Wallin
- 4th of October, 2006 to be precise.
- A blog that is hosted on a blog platform but uses its own domain name, i.e. blog.twingly.com which is, infact, hosted on wordpress.com
One thought on “Finding the needles: blog discovery”