“The biggest challenge with Big Data is to stop focusing on Big Data”

Every second, a huge and every increasing amount of data is published on the web. Gavagai, a Twingly Data client based in Stockholm, has developed a Technology to read, aggregate and understand this content. Fredrik Olsson, the Chief Data Officer, gives some more insights into this fascinating business and about what the startup is able to do with the blog data it collects.

At Gavagai, you do some sophisticated stuff. Please tell us in a few sentences what your business is all about?
It’s about continuously reading tremendously large and dynamic text streams, and delivering timely, and actionable intelligence based on the aggregation of information therein. Of course, what is actionable depends on the information needs you as an actor in a particular domain have, be it brand management, assessing threat levels for targets-at-risk, or keeping track of the sentiment towards a particular tradable asset. Example information needs that you are able to address using Ethersource, our system, include:

* How is my brand perceived in comparison to those of my competitors’?
* Why are my customers unsubscribing from the services that I’m offering?
* When is the best time to launch this particular advertising campaign?
* How is the campaign, recently launched by my competitor, received
among my target audience?
* Where is it most likely that the on-line protests against a certain
phenomenon will be publicly manifested in terms of a demonstration?

We have a number of case studies available at our blog.

Fredrik Olsson

What’s the founding story of Gavagai?
Gavagai was founded in 2008 by my colleagues Jussi Karlgren and Magnus Sahlgren, as a spin-off from the Swedish Institute of Computer Science (SICS). Gavagai was formed as a response to the many inquiries Magnus and Jussi received from people outside SICS regarding their research. Gavagai has been operational in its current incarnation since late 2010.

You are one of Twingly’s Data clients, that means you are using our API to access data from Swedish and English speaking blogs. Why do you need this information and what do you use it for?
We read data from Twingly 24/7. In particular, the Twingly live feed gives what we believe to be a very good coverage of Swedish blogs, which of course is very important to us in meeting the kinds of information needs outlined above, expressed by domestic actors.

Do you have any insights about this data from Blogs in Swedish and English you want to share? Some surprising fact or observation?
One epiphany we had some time ago was that we’re now able to aggregate and inspect attitudes and opinions of a population as a whole, that’s not necessarily visible in any of the parts. For instance, we can clearly see that Swedish bloggers are optimistic during holidays and weekends, something which is very hard to assess from the posts of any one individual. Analogously, we also pick up on aversive or hostile tendencies in the online population towards a given subject, but where it is hard to identify all the facets of the tendency in any one individual. For example, we recently set up a Xenophobic Tracker using, among other things, the Swedish blogosphere as input; the propensity of violent expressions in that context is not a pretty read.

But it’s not the peak items that we’re most pleased with. With Ethersource, we can pick up and note weak signals and tendencies where other methods fail.

What type of companies or organisations use your services?
The kinds of actors that require actionable intelligence in their efforts to manage brands, make informed decision based on the ‘temperature’ of an on-line population as a whole, keep track of the general mood in the markets, or trade with specific assets.

Your titel is “Chief Data Officer”. That’s not too common, is it? Do you think every company will need a CDO in the future?
No, I don’t think every company will need a CDO in the future. Hopefully, companies will be able to scale down on their data management activities, perhaps due to their use of tools and techniques such as Ethersource, and instead focus on their core business. Much the same way we are able to focus on our core business by obtaining data from Twingly instead of harvesting it all ourselves.

Big Data is one of the hottest buzzwords right now, which is a field you are active in. What’s the potential and biggest challenges of the increasing amount of data?
We’re currently concerned with human-generated text, so it is in the light of that the response to this question should be read.

The biggest challenge with Big Data is to stop focusing on Big Data. Big Data will, by virtue of the prevailing definition, always be slightly too big to handle with common tools. This has mainly resulted in people being obsessed with processing speed and ability to store large amounts of data. Few, if any, have focused on a layer in the so called Big Data Stack that so far has been missing: the Semantic Processing Layer. The key challenge for Big Data is to come to the point where it is easy and swift to turn massive data streams into actionable intelligence; knowledge that you and your organization can act upon in order to obtain a competitive advantage. To put it another way; the key challenge of Big Data is to be of service.

Being a researcher by training and heart, I believe that we’ve yet to imagine the biggest potential there is in harnessing truly Big Data. Let’s talk about that in a few years, when a more representative sample of our world’s population is active on-line. Then, we’ll be able to find the collective answers to questions to mankind, that we’re not able to think of now.

What’s on your roadmap for the upcoming years? Where do you see the biggest growth and potential for Gavagai?
We’ve got very exciting times ahead of us! Ethersource is already unique in the way it is able to read amounts of text that would overwhelm traditional language processing methods, handle multiple (all) languages, in real-time, and learn from variations in the input in an unsupervised manner.

Our development plans involve some fairly hefty stuff. In the short term, we’ll roll out a game changer in terms of a way of identifying the many meanings of a given concept, and use that information to disambiguate expressions of that concept as they appear in social media. For instance, imagine that you are a brand manager for Apple, Visa, “3” or some other brand with an inherently ambiguous and common name: How do you go about monitoring the attitudes and opinions towards the meaning of the word that constitutes your brand, and only that meaning? There is a solution…

The biggest growth and potential for Gavagai is as a supplier of the Ethersource technology to other companies, such as analytics firms, trading desks, governmental agencies etc, that already have an infrastructure in place, but that lacks the competitive edge the ability to understand and make sense of large text streams in multiple languages gives. Ethersource is an implementation of the Semantic Processing Layer of the Big Data Stack, and we intend to move it as such.