Modeling News Coverage with Python. Part 2: Starting News Forecasting with Limited Data

Matt Brockman
Towards Data Science
9 min readOct 10, 2019

--

Earlier I wrote a bit about getting news stories to start modeling and now I’ll play around a bit with how to forecast coverage and measure if a model’s any good.

I’ll start this post using an off the shelf neural network from fast.ai to generate some models of publisher daily production using just some articles from a set of publishers, evaluate the fit of the model, and then show why these models aren’t that great for the limitations of the dataset we’re dealing with (entirely arbitrary limitations to make it easy to run the code, messing with granularity and adding a longer period of articles will get you better results). Then this’ll play a bit with auto-regression and moving averages and see how that outperforms the neural network for the limited data that we’re utilizing on this, resulting in generating trends for coverage for each publication like this:

SARIMA models for select publications for select coverage with error on test predictions

Let’s get started!

Checking if we still have the data

Last time we pulled 60 days of news coverage from GDELT (https://www.gdeltproject.org). We’ll start making by sure we still have it, and if we don’t, we’ll download it.

Now that we’ve verified we have all the files we need, we can go through them and pull the articles for the New York Times, Washington Post, Fox News, and CNN like before.

After running the code, we have the news articles in our df. Next we can go ahead and break down the coverage of each of the target countries per day per publisher.

First we’ll make sure the date’s a good datetime and set our index to the date.

df.DATE = df.DATE.apply(lambda x: str(x))
df.DATE = pd.to_datetime(df.DATE)
df.fillna("", inplace=True)
df.set_index("DATE", drop=True, inplace=True)

Next we’ll go and one-hot encode the df for whether the article’s been labelled as relevant to one of our countries and group by the publisher and date.

df["dprk"] = df["LOCATIONS"].apply(lambda x: x.find("North Korea") > -1)
df["ukraine"] = df["LOCATIONS"].apply(lambda x: x.find("Ukraine") > -1)
df["russia"] = df["LOCATIONS"].apply(lambda x: x.find("Russia") > -1)
df["iran"] = df["LOCATIONS"].apply(lambda x: x.find("Iran") > -1)
df["china"] = df["LOCATIONS"].apply(lambda x: x.find("China") > -1)
loc_df = df.groupby(["SOURCES", "DATE"])[["dprk", "ukraine", "russia", "iran", "china"]].sum()

So now we can create a big df we’ll call time_series to hold the date values for each publisher and country with the columns labelled as such.

time_series = pd.DataFrame()
for publisher in mySources:
time_series = pd.concat([time_series, loc_df.ix[publisher].add_prefix(“{}_”.format(publisher))], axis=1)
time_series.plot(figsize=(20,10))

Cool, so you should have something that looks like that now, where we can see there’s a weekly seasonality to the data. Also, the Ukraine stuff takes off around the 23rd of September.

So let’s run these time series through a neural network and see what it comes up with!

Using Tabular Model from Fast.ai

So now we’re going to use a tabular model (https://docs.fast.ai/tabular.html) from fastai. I’m using the tabular model because it’s quick and easy even though the time series is way too short to expect a great outcome and I’m not going to fiddle with getting the results nice, but it gets the point across.

Noticing the seasonality, we’re going to want the model to be looking at the last couple days of data. Since it’s weekly seasonal, we can just give it the last 7 days and hope that helps it figure out that there’s a seasonal component to the datas. Also, we can give it the day of the week that the article’s produced on and maybe that’ll help it out. For example, for Fox News on North Korea on Day X, we’re going to want all publication data from Day X-1, X-2… and so forth. So we can create a function to do that for us.

We can then create a method that lets us take one of the series from our time_series dataframe and train a neural network to try to model it.

So for demonstration purposes, we can pick one of our time series and see what the model it generates looks like. We’ll just run through once at a pretty large learning rate just to see what it looks like. If you’re not familiar with fastai, go take the course. Otherwise, you basically want to see how well you can play golf with the fit_one_cycle numbers, although this is just an arbitrary example.

learn = genNN(time_series, “foxnews.com_russia”)
learn.fit_one_cycle(1, 3e-02)

So to visualize the results, we’ll want to pull the predictions out of our learner and graph them against the real results. We can color the training, validation, and testing sets differently to get a sense of how different parts of the model perform.

So is our model good or bad? The October accuracy is likely a fluke. If you noticed, we also went and pulled the mean of the training set so we can compare if we’re even doing better than chance so far. We’ll use root mean square error as the evaluation of how well the model’s doing.

>>> RMSE training  using mean: 7.255011530500299
>>> RMSE training using model: 7.2815201128101155
>>> RMSE testing using mean: 8.930425521776664
>>> RMSE testing using model: 6.544865479426577

So this model actually doesn’t do too horribly; it does a really bad job of predicting what’ll happen with the training sets, but somehow it figures out the early October spike in reporting. (I don’t know why). But let’s re-run the model generation and see if it happens again. We should get different results because the model starts off with everything randomized and we’re just following the gradient.

learn = genNN(time_series, “foxnews.com_russia”)
learn.fit_one_cycle(1, 3e-02)

And re-run the code to display the graph results in:

Re-running the evaluation, I get that it’s getting out performed by just taking the mean on the training set and doing a bit worse than the previous one.

>>> RMSE training  using mean: 7.255011530500299
>>> RMSE training using model: 7.314209362904273
>>> RMSE testing using mean: 8.930425521776664
>>> RMSE testing using model: 7.068090866517802

We can mess with the learning rates to make it nicer,

learn = genNN(time_series, "foxnews.com_russia")
learn.fit_one_cycle(1, 2e-02)
learn.fit_one_cycle(5, max_lr =3e-04)
...
>>> RMSE training using mean: 7.255011530500299
>>> RMSE training using model: 9.782066217485792
>>> RMSE testing using mean: 8.930425521776664
>>> RMSE testing using model: 5.876779847258329

But generally speaking, it’s not really all that good of a model. But then again, it shouldn’t be because it’s only using a couple publishers and a few general countries on 60 days of data.

That was fun. We trained a neural network, but it wasn’t very smart; it didn’t seem to figure out the weekly seasonality even though we hand fed it the days. Let’s move on to what happens if we look at auto-regression with moving averages using more traditional models. (Which I guess we should have started with).

Seasonality, Auto-Regression, and Moving Average

It’s pretty easy with python to decompose a time series. Assuming you still have the time_series dataframe loaded, we can decompose each column into its component observed, trend, seasonal, and residual pieces using statsmodels (https://www.statsmodels.org/stable/index.html) and plots them.

import statsmodels.api as sm
res = sm.tsa.seasonal_decompose(time_series["foxnews.com_russia"])
res.plot()
Foxnews Coverage including Russia Decomposed

Horray it decomposed!

We can also look at the autocorrelation, or how much values relate to previous values.

from statsmodels.graphics.tsaplots import plot_acf
plot_acf(time_series["foxnews.com_russia"])

We can see the seasonal component in there as well as the trend.

And we can look at the moving average. We’ll take a 2 day simple average of the previous two days.

f_ru = time_series[["foxnews.com_russia"]].copy()
f_ru.columns = ["actual"]
f_ru["sma"] = 0
for i in range(0,len(f_ru)-2):
f_ru.iloc[i+2,1] = np.round(((f_ru.iloc[i,0]+
f_ru.iloc[i+1,0])/2),1)
f_ru.plot(title="Fox Russia prior 2 days SMA")

And we can check the RSME

f_ru = f_ru[2:]
print("RMSE testing using SMA2: {}".format(sqrt(mean_squared_error(f_ru.actual, f_ru.sma))))
>>> RMSE testing using SMA2: 7.666822836990112

So we have a RSME that’s slightly worse than what we got above but for way less effort (aside from me forgetting I had to use .copy() to make Pandas not complain). Not too shabby, but it’s actually worse than just taking the mean. Probably. I didn’t actually check for the whole duration before as opposed to this time. So let’s integrate the seasonality, auto-regression, and moving average using the Python Statstools SARIMA package.

So we’re going to eyeball the parameters on this one. Basically, we’re going to need the autoregressive moving average component and the time series component as the order and seasonal orders for the model. We’ll pretend it’s a weekly seasonality as well which is generally true.

So here goes. First we declare and fit a model.

from statsmodels.tsa.statespace.sarimax import SARIMAX
issue = “foxnews.com_russia”
train_l = 55
s_model = SARIMAX(endog = time_series[[issue]][:train_l][1:],
exog = time_series[[x for x in time_series.columns if x != issue]][:train_l].shift().add_suffix(“_l1”)[1:], order=(2,0,1), seasonal_order=(1,0,1,7)).fit()

Now we want to go and get our predictions. We’re OK using the real values for previous counts because we’d have that in the real world unless we’re predicting out really far (but we didn’t do that earlier).

f_ru = time_series[[issue]].copy()[1:]
f_ru.columns = ["actual"]
f_ru["predicted"] = s_model.predict(end=datetime.datetime(2019, 10, 6), endog = time_series[[issue]][-5:],exog = time_series[[x for x in time_series.columns if x != issue]].shift()[-5:], dynamic= False)#plotting
ax = f_ru["actual"].plot(title="Russia Mentions in Fox", figsize=(12,8))
f_ru["predicted"][:-5].plot(color="orangered")
f_ru["predicted"][-6:].plot(color="red")

So we have a SARIMA model that looks pretty good! And how are the RMSEs?

testing = f_ru[:-5].copy()
print("RMSE training using model: {}".format(sqrt(mean_squared_error(testing.actual, testing.predicted))))
testing = f_ru[-5:].copy()
print("RMSE testing using model: {}".format(sqrt(mean_squared_error(testing.actual, testing.predicted))))
>>> RMSE training using model: 3.098788909893913
>>> RMSE testing using model: 6.281417067978587

So the SARIMAX here does way better than the neural network above. We can then re-run the code in a loop for all of the combinations we’re looking at if we want to as such (and you can add any combination of publishers and topics you like). Just a heads up, the code will spit out a lot of warnings.

It doesn’t do too well on the Ukraine forecasting but that should be expected because it didn’t fit the previous trends

Try playing with the parameters of the models and changing around the publications and coverage topics. Or better yet, go back and download a longer period of articles and see how more data improves training and testing.

Moving On

Between Part 1: Introduction and this, you should have a pretty good idea of how much we can understand about the interactions of newspapers on one another. By modeling the news coverage over time, we can evaluate if a model incorporates factors influencing news coverage better than other models.

What does this matter? Well, when people run around saying that the news is doing this or that, if we don’t have models to evaluate their claims it’s really hard to tell who’s telling the truth or not. But in just (hopefully) a few minutes we were able to grab data, run computational models, and forecast future coverage based on media coverage alone. This all plays into the political science/communications/journalism studies on agenda setting (e.g. https://www.tandfonline.com/doi/full/10.1080/15205436.2014.964871) but actually predicting news coverage isn’t as widespread as I’d thought and I’m not sure why that is.

Now, there are undoubtably a lot of issues with the math here and I leave that to the reader to figure out improving it on their own (or writing up their results :)) There are also issues with the data; we used 4 of the 100,000 news sources available to us, 5 locations from the very large domain of possible combinations of themes, locations, and people, and only 60 days of the years of coverage. We also eyeballed a LOT of the parameters here; ultimately you don’t want to be looking at just one topic but constantly monitoring them all anyway. However, hopefully this was straightforward enough that anyone can pick it up and mess with whatever they’re interested in.

I’m not entirely sure what I’m going to cover in Part 3. Possibly including measures of public interest in issues operationalizing Google Search or various Twitter datas might be fun, although the hard part for a blog post is making sure the data’s easily accessible. Might also do longer trends. I just started writing these to procrastinate on studying for midterms so haven’t thought this through all the way.

--

--