Inventing smartify: how our Labs team did it

Some background

Smartocto has innovation hardwired in its DNA. We believe that we can only help online journalism overcome its challenges with digital transformation if we continuously try to improve our own product, services, features and functionalities. To streamline requests for innovations, we set up a Labs Team in which we discuss all the opportunities and ideas on our roadmap, and determine which should have priority.

AI for the newsroom

Over the past year, smartocto's Labs Team has been working hard to deliver AI-driven decision support to newsrooms. We aim high, having set our goal to predict the impact of a story across various channels before the story itself is launched and shared. We started to envision how powerful it would be if editors could only learn about a particular story's potential to drive traffic back to their platforms via link clicks, shares, likes, conversions and otherwise.

But we wanted more. All packed around a simple, yet all-encompassing question: how much more could news organisations profit from focusing their content strategy on potential successes? That simple thought (that took us only a few minutes to think it) was the basis of building smartify as we know it now (which took a little bit more time).

Web content popularity prediction in smartify

Generally, the problem that we are cracking at smartocto is known in data science as the problem of web popularity prediction. In the era of solutionism, we are confronted with a problem that basically has no ready-made solution. Algorithms are assigned unprecedented powers (when it comes to automated machine learning, for example). Everyone believes that there is an arithmetic solution for everything. Any set of data or numbers must be solvable or interpretable.

We expect predictions to be accurate. At all times. About everything. We decided to share our inspiration, our struggle and our fascination with this problem of markedly different nature with you, letting you peek under the hood of what is happening in smartocto’s Labs - who will soon begin deploying their decision support solution to selected clients. The solution itself is: smartify.

Smartify faces a very difficult variant of the web popularity prediction problem. It is known as the problem of ex-ante popularity prediction (c.f. this review article from 2019 for an elaboration). In this version of the problem, a predictive model needs to estimate the future popularity of a (e.g. Facebook) post even before the post is actually shared on the social network. In order to do so, smartify uses only one feature that actually follows the publication of a post on the client’s website: the number of pageviews the post achieves in the first half hour since its publication.

These early pageviews are then used in a particular heuristic discovered in smartocto’s Labs to estimate the “real thing”, namely, the number of impressions that a post might reach. We do not believe that a higher number of website pageviews somehow causes a higher number of impressions or clicks on the social channel, of course - but they certainly are correlated to a certain level.

The reach of a post proved to be of crucial importance in all our research on web popularity prediction and it is very difficult to predict itself. How many impressions a particular post will reach depends upon many factors, of which some are deeply buried in the black-box of a particular web channel’s algorithm. Such algorithms determine what content will be shared to what users and when.

Besides, they sometimes can be put under user’s control, e.g. users can boost their posts to obtain more impressions. That makes it very difficult to estimate post impressions precisely before the post is actually shared, but we have figured out a suitable approximation which helps us guess at least the league in which a particular post will play.

Beyond this estimate of post impressions, our mathematical models of web popularity take into account a plentitude of other factors: the temporal features, describing when the post was shared or the time elapsed between its publication on the website and another web channel, the semantic analyses of the post titles and the editorial tags used, additional content metrics such its readability and lexical diversity measures, the historical statistics on the performance on similar posts in the past, and more. What is missing in the variant of the web popularity prediction problem that smartify addresses is a set of behavioural features, like early post clicks, shares, likes, and similar: they become available only after the post is shared.

Reusing history

Certainly, decision support systems like smartify will never take over human editorial decisions. That is the reason why we call them decision support systems: they are built in order to help us sort out large amounts of information so as to understand the decision landscape better, to advise us and not just bluntly decide for us. That, in conjunction with the fact that post impressions - as the main drivers of post popularity - are not essentially predictable, made us think about building a version of smartify that splits the control and prediction part of its operation between the editors and their AI.

Our research efforts showed us that we are able to build quite good predictive models of web popularity under the condition that we train our machine learning models with real, historical data on the number of impressions that any particular post in our models’ training sets has reached. However, we do not have that information before the post is shared (even worse - such information is available only at the very end of the post’s lifecycle), and we can only predict it up to a certain degree.

But if we already have a model pre-trained on real, historical data, then for each new post we can ask the model to provide a series of predictions, each prediction made on an assumption that a post would reach some number of impressions. That means that we can analyze any given post and then predict the number of clicks it will reach, conditional on that post reaching 100, 1.000, 10.000, 100.000,.. impressions.

Since impressions necessarily grow with time, in effect we have begun to estimate the time-series of post’s popularity. And since our models are not bad at predicting a post's popularity for a certain number of impressions, we have all the reasons to believe that we can estimate such time-series nicely.

Finally, it turns out to be possible to qualify any given post by a single metric of its future potential in this approach, ranking the posts according to their expected click-through-rate (CTR). And that is how smartify can choose which posts to recommend to its users for further elaboration (e.g. by sharing them, boosting their performance, or following up on the post discussion).

On control and prediction

Going beyond what can be done and back to the question of the division of labor between us humans and the AI, we now begin to think that providing feedback on future performance of the post under different scenarios could be the best thing to do. Again, post performance can be influenced by a myriad of factors of which only some are under publisher’s control.

Looking at the full landscape of possibilities - while still being able to assess any post by a single performance metric - leaves enough control in the editorial decision making process. The post might be ranked high by a single metric, but maybe one would expect that a post from a particular section of their website should grow faster in popularity than the model predicts it would?

What follows is that single metrics of popularity should not be considered in isolation, and smartify is able to provide a wider context for such decision problems. The post could be boosted not because the system predicts it would be highly popular, but because the newsroom thinks the topic to be of high importance and wants to increase the public awareness on it.

Well then, let’s check how many clicks the post would obtain via smartify’s predictions over a range of impressions that it might reach, to estimate how much to invest in it.

To wrap it all up

Smartify is a very promising and innovative tech solution that will bring a lot of knowledge and relevant real time decision making to the newsroom. We’re convinced about that. But - although we like what we see so far - we need to be modest and realistic about this as well. Smartify needs as much data as we can get to become better, more accurate and precise. We also aim for the moon but in all fairness there’s a lot to overcome in the following period. Working closely with our clients, getting feedback and collecting more data to challenge the predictions will help to optimise smartify all together. So here’s our little shopping list:

The data sets that are currently used to train the smartify predictive models are scarce. The Labs Team is currently working with data sets that are orders of magnitude smaller than the data sets used to explore different approaches to this problem in the academic literature. Therefore more data collection is our first concern. With that we can estimate the limits of the existing approaches to web content popularity prediction. You know by now that this is something to strive for. (You’re always welcome to reach out if you believe you can help)
We foresee that the smartify predictive machinery will be improved by exploring the public discourse in real time and developing an “early warning system” to discover topics that might become popular in the immediate future. This is the first spin off to be honest. But within the first results we see direct clues that signify this could become more important as an additional suggestion for newsrooms.
We are also contemplating a more complex approach where (a) the first prediction is made before the post is shared on social media by the current smartify predictive machinery, and then (b) subsequently improved by another, different predictive model after the post is shared and the first reactions to it thus made available (and validated).

So as you can imagine, we will be busy in the coming period with optimising, validating and fine-tuning the way smartify can help newsrooms make better decisions. We believe it to be the next paradigm in content analytics and are more than excited with the results we already see. But both Serbia as well as the Netherlands are famous soccer-playing countries and the saying goes that the first goal doesn’t win the game (per se) - there’s always Germans to consider as Gary Lineker once delicately explained*.

So, although it feels we’re 1-0 ahead, there’s at least another half time to play. So please give us your thoughts or follow us to find out how we will proceed with the development of smartify! And in the mean time you can check our smartify whitepaper for some more information.

* The famous quote: "Football is a simple game: 22 men chase a ball for 90 minutes and, in the end, the Germans always win," Lineker said.

download whitepaper

---

This project is developed with support from the Ministry of Economic Affairs, National EZ subsidies, MIT executed by RVO.