{"Title": "Modeling 20 years of scandals with python", "Date": "2013-7-4", "Category": "ipython", "Tags": "nlp, ipython, gensim, pandas", "slug": "scandal-modeling-with-python", "Author": "Chris" }

Topic detection with Pandas and Gensim

A few months ago, the undending series of recent scandals inspired me to see whether it would be possible to comb through the text of New York Times articles and automatically detect and identify different scandals that have occurred. I wanted to see if, given articles about the DOJ, IRS, NSA and all the rest, whether the text would be enough for an algorithm to identify them as distinct scandals and distinguish them from one another, in an unsupervised fashion.

This also gave me an excuse to explore gensim and show off some of pandas capabilities for data-wrangling.

The IPython notebook for this post is available at this repo (and I grabbed the ggplot-esque plot settings from Probabilistic Programming for Hackers).

Let's get started by by picking up where we left off scraping the data in part 1, and pull all those articles out of mongo.

In [2]:
from __future__ import division
import json
import pandas as pd
import numpy as np
from time import sleep
import itertools
import pymongo
import re
from operator import itemgetter
from gensim import corpora, models, similarities
import gensim
from collections import Counter
import datetime as dt
import matplotlib.pyplot as plt

pd.options.display.max_columns = 30
pd.options.display.notebook_repr_html = False
In [3]:
%load_ext autosave
%autosave 30
Usage: %autosave [seconds]
autosaving every 30s

Init

In [4]:
connection = pymongo.Connection("localhost", 27017)
db = connection.nyt
In [5]:
raw = list(db.raw_text.find({'text': {'$exists': True}}))

I ran this the first time, to make sure it doesn't choke on title-less documents (there should be code to fix it in pt. 1 now, though):

for dct in raw:
    if 'title' not in dct:
        dct['title'] = ''

Some helpful functions

The format function should be pretty self-explanatory, and search is to be used later on to verify topic words.

In [6]:
def format(txt):
    """Turns a text document to a list of formatted words.
    Get rid of possessives, special characters, multiple spaces, etc.
    """
    tt = re.sub(r"'s\b", '', txt).lower()  #possessives
    tt = re.sub(r'[\.\,\;\:\'\"\(\)\&\%\*\+\[\]\=\?\!/]', '', tt)  #weird stuff
    tt = re.sub(r' *\$[0-9]\S* ?', ' <money> ', tt)  #dollar amounts
    tt = re.sub(r' *[0-9]\S* ?', ' <num> ', tt)    
    tt = re.sub(r'[\-\s]+', ' ', tt)  #hyphen -> space
    tt = re.sub(r' [a-z] ', ' ', tt)  # single letter -> space
    return tt.strip().split()


def search(wrd, df=True): 
    """Searches through `raw` list of documents for term `wrd` (case-insensitive).
    Returns titles and dates of matching articles, sorted by date. Returns
    DataFrame by default.
    """
    wrd = wrd.lower()
    _srch = lambda x: wrd in x['text'].lower()
    title_yr = ((b['title'], b['date'].year) for b in filter(_srch, raw))
    ret = sorted(title_yr, key=itemgetter(1))
    return pd.DataFrame(ret, columns=['Title', 'Year']) if df else ret


dmap = lambda dct, a: [dct[e] for e in a]

Model generation

Now apply the format function to all the text, and convert it to a dictionary of word counts per document form that gensim's models can work with. The TfidfModel transformation will take into account how common a word is in a certain document compared to how common it is overall (so the algorithm won't just be looking at the most common, but uninformative words like the or and).

In [7]:
texts = [format(doc['text']) for doc in raw]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
tcorpus = dmap(tfidf, corpus)
In [8]:
np.random.seed(42)
model = models.lsimodel.LsiModel(corpus=tcorpus, id2word=dictionary, num_topics=15)

Analysis

As far as I know, the only way to get the topic information for each article after fitting the model is looping through and manually grabbing the topic-score list for each article.

In [9]:
_kwargs = dict(formatted=0, num_words=20)
topic_words = [[w for _, w in tups] for tups in model.show_topics(**_kwargs)]
In [10]:
%%time

it = itertools.izip(corpus, ((d['title'], d['date']) for d in raw))
topic_data = []  #for each article, collect topic with highest score
_topic_stats = []  # gets all topic-score pairs for each document

for corp_txt, (tit, date) in it:
    _srtd = sorted(model[corp_txt], key=itemgetter(1), reverse=1)
    top, score = _srtd[0]
    topic_data.append((tit, date, top, score))
    _topic_stats.append(_srtd)

topic_stats = [tup for tups in _topic_stats for tup in tups]  #flatten list(tuples) -> list
CPU times: user 7.92 s, sys: 94 ms, total: 8.01 s
Wall time: 8.84 s

The topic_data and _topic_stats lists keep data on each article and sorted lists of topic-score tuples:

In [11]:
print topic_data[0]
print _topic_stats[0]
(u'Prostitution Allegations Surface In a Congressional Bribery Case', datetime.datetime(2006, 4, 30, 0, 0), 7, 2.7811526623548612)
[(7, 2.7811526623548612), (6, 1.7265555367260221), (1, 1.5921552774434533), (4, 1.4378598933149978), (2, 1.2497103178186022), (13, 0.98475274493146003), (3, 0.86188853116490982), (5, 0.64386289075029712), (12, 0.22717706836159843), (14, -0.023088674938776539), (10, -0.040687360022584669), (9, -0.59031083481273061), (8, -0.97975393896831242), (11, -1.2718015704185801), (0, -10.095652453409469)]

Now we can put the topic information into pandas, for faster, easier analysis.

In [12]:
df = pd.DataFrame(topic_data, columns=['Title', 'Date', 'Topic', 'Score'])
df.Date = df.Date.map(lambda d: d.date())
print df.shape
df.head()
(9706, 4)
Out[12]:
                                               Title        Date  Topic  \
0  Prostitution Allegations Surface In a Congress...  2006-04-30      7   
1  Senate Panel Asked to Give S.E.C. Proposals a ...  2004-04-09      2   
2          Career C.I.A. Figure Is at Eye of Scandal  2006-05-12      7   
3  Bush Heads to Colombia as Scandal Taints Key A...  2007-03-11      1   
4            Senate Panel Rejects Ethics Office Plan  2006-03-03      2   

       Score  
0   2.781153  
1   2.808630  
2   6.272936  
3   2.713814  
4  10.141214  

By plotting the distribution of topic labels for each document, we can now see that the detected topics are not very evenly distributed.

In [13]:
vc = df.Topic.value_counts()
plt.bar(vc.index, vc)
_ = plt.ylabel('Topic count')
# df.Topic.value_counts()  #This would give the actual frequency values

One high level question I had was if certain topics can be seen varying in frequency over time. Pandas' groupby can be used to aggregate the article counts by year and topic:

In [14]:
year = lambda x: x.year
sz = df.set_index('Date').groupby(['Topic', year]).size()
sz.index.names, sz.name = ['Topic', 'Year'], 'Count'
sz = sz.reset_index()
sz.head()
Out[14]:
   Topic  Year  Count
0      1  1992    105
1      1  1993    113
2      1  1994     68
3      1  1995     54
4      1  1996     41

which can then be reshaped with pivot, giving us a Year $\times$ Topic grid:

In [15]:
top_year = sz.pivot(index='Year', columns='Topic', values='Count').fillna(0)
top_year
Out[15]:
Topic   1    2   3   4   5   6    7   8   9   10  11  12   13  14
Year                                                             
1992   105  122  48  40  17  33   99   2  12  21   8   0   46  37
1993   113   43  50  20  12  46   60   0  14  32  10   1   31   8
1994    68   37  29  23   9  31   51   0   8  37   7   0   51   7
1995    54   32  26  11   5   7   41   1   2  19   4   1   33   9
1996    41   27  26  18   7  12   45   1   8  37  13   7   38  43
1997    66   88  32  28  13  33   52   9  10  13   9   0   52  42
1998    45  156  94  67  11  29  107   0  14  56  73   2   74  33
1999    75   59  41  21  43  16   70   3   4  23  23   3   59  19
2000    65   38  47  10  96  52   56   1   4  22   9   5   65  17
2001    54   26  46  13  15  23   58   5   3  19   3   4   39  11
2002   122  140  41  66  29  34   44   1   4  13   2  11   81  15
2003    81   56  45  23  13  11   50   1   3  11   7   2   48  18
2004    66   59  69  43  29   7   56   1   2  11  20  17   57  30
2005    69   61  51  20   7   8   41   0   1  13   1  43   42   5
2006    89  248  79  19  11  10   82   5   2  16   7  11  134  16
2007    69   82  66  28  13  19   79   3   4   6   2   2   84  29
2008    61   57  62  19   8   5   70   1   4  10   6   2  153  33
2009    42   44  49  11   5  23   39   0   3  17   6   4   92  17
2010    47   58  73   4  15  27   45   1   6   9   7   1  109  14
2011    58   27  56   3   6  21   29   0   2  12   6   3  118   7
2012    53   39  69   6  15  13   58   0   0  17   1   0  102  14
2013    53   34  54   6   5  13   38   0   0   9   0   1   50   4

In Pandas land it's easy to find lots of basic information about the distribution--a simple boxplot will give us a good idea of the min/median/max number of times a topic was represented over the 21 years.

In [16]:
plt.figure(figsize=(12, 8))
top_year.boxplot() and None

Topics 8, 9, 11 and 12 hardly show up, while in typical years topics like 1 and 2 are heavily represented. The plot also shows that articles most closely associated with topic 2 actually show up 250 times for one year.

(For the curious, viewing the distribution of scandalous articles across topics for each year is as easy as top_year.T.boxplot().)

The plot method can automatically plot each column as a separate time series, which can give a view of the trend for each scandal-topic:

In [17]:
_ = top_year.plot(figsize=(12, 8))

The number of times articles with different topics show up in a year varies a lot for most of the topics. It even looks like there are a few years, like 1998 and 2006 where multiple topics spike. Plotting the sum of articles for all topics in a given year can verify this:

In [18]:
_ = top_year.sum(axis=1).plot()

Topic words

Now it's time to look at the words that the model associated with each topic, to see if it's possible to infer what each detected topic is about. Stacking all the words of the topics and getting the value counts gives an idea of how often certain words show up among the topics:

In [19]:
pd.options.display.max_rows = 22
top_wds_df = pd.DataFrame(zip(*topic_words))
vc = top_wds_df.stack(0).value_counts()
print vc
pd.options.display.max_rows = 400
page       12
enron      11
clinton    10
bush       10
gore        9
japan       9
her         8
she         8
cuomo       7
...
kohl        1
tax         1
voters      1
attorney    1
raising     1
mr          1
budget      1
bill        1
district    1
Length: 101, dtype: int64

It looks like the most common topic words are page, enron, bush and clinton, with gore just behind. It seems these words might be less helpful at finding the meaning of topics since they're closely associated with practically every topic of political scandal in the past two decades. It shouldn't be surprising that presidents show up among the most common topic words, and a cursory look at articles with the word page (using search, defined above) makes it look like the word shows up both for sexual scandals involving pages, along with a bunch references to front page scandals or the op-ed page.

You can find specific headlines from my dataset that include the word page (which duckduckgo should be able to handle) with search('page').

In the following, I've given a simple score to the topic words based on how unique they are (from a low score of 0 for the most common, up to 11 for words that only appear for a single topic). All 15 topics are summarized below with the top words scored by how common they are.

Topics 1 and 6 look to have the most cliched scandal words, while the last few topics are characterized by quite a few unique words.

In [20]:
pd.options.display.line_width = 130
top_wd_freq = {w: '{}-{}'.format(w, vc.max() - cnt) for w, cnt in vc.iteritems()}
top_wds_df.apply(lambda s: s.map(top_wd_freq))
Out[20]:
               0              1              2           3              4              5              6                 7   \
0       clinton-2      clinton-2          her-4      gore-3        enron-1         gore-3         gore-3           enron-1   
1          bush-2         gore-3          she-4       her-4        party-9  impeachment-6         page-0             mrs-5   
2           mr-11          her-4     minister-8       she-4     governor-6      bradley-8        enron-1             her-4   
3           she-4          mrs-5        japan-3      bush-2        starr-6        starr-6        japan-3            gore-3   
4           her-4          she-4        enron-1       mrs-5      voters-11      clinton-2         bush-2             she-4   
5   republicans-7        starr-6  republicans-7        ms-6         gore-3        japan-3      rowland-5           obama-8   
6         house-9    lewinsky-10          mrs-5   bradley-8      clinton-2         bush-2          she-4          mccain-7   
7          gore-3  impeachment-6        prime-8  governor-6      corzine-7        house-9         iraq-7           starr-6   
8   republican-10        japan-3         page-0   rowland-5         page-0     minister-8          her-4     impeachment-6   
9         party-9     minister-8   committee-10     japan-3  department-11  republicans-7     minister-8            bush-2   
10   democrats-10           ms-6      senate-10     cuomo-5  candidates-11    lewinsky-10        ozawa-7         spitzer-5   
11           ms-6      bradley-8  republican-10      rell-8  accounting-10    gingrich-11     japanese-6     accounting-10   
12      senate-10        enron-1   lawmakers-11   spitzer-5        cuomo-5         page-0          mrs-5          stock-11   
13    campaign-11        prime-8     japanese-6    county-7        white-9         city-7           ms-6      corporate-11   
14        state-9         page-0      ethics-11      city-7       county-7        ozawa-7        obama-8        counsel-10   
15     governor-6        white-9        bill-11     state-9     counsel-10        cuomo-5        prime-8       attorney-11   
16   president-10   companies-10        italy-8    mayor-10        state-9        enron-1    hosokawa-10           cuomo-5   
17        white-9     japanese-6   berlusconi-6     vice-11     justice-10        party-9      bradley-8        justice-10   
18     <money>-11   president-10        house-9    mccain-7        japan-3     japanese-6  republicans-7  investigation-11   
19   committee-10    clintons-11  parliament-11   albany-10   democrats-10        prime-8    kanemaru-10      companies-10   

               8             9               10              11             12              13                 14  
0         enron-1  berlusconi-6       rowland-5       rowland-5      corzine-7          page-0             page-0  
1        mccain-7       japan-3          rell-8          page-0    forrester-8         enron-1            enron-1  
2       corzine-7       italy-8    berlusconi-6          bush-2         page-0         obama-8          corzine-7  
3          page-0     rowland-5      governor-6       clinton-2    mcgreevey-9       clinton-2           mccain-7  
4          bush-2     clinton-2         italy-8   impeachment-6      spitzer-5           mrs-5          spitzer-5  
5      governor-6    japanese-6         japan-3    berlusconi-6      jersey-11        mccain-7             gore-3  
6          gore-3       enron-1  connecticut-10            ms-6        cuomo-5       spitzer-5            obama-8  
7           her-4     spitzer-5          bush-2          iraq-7         bush-2       budget-11        forrester-8  
8           she-4        page-0          iraq-7           she-4  torricelli-11          tax-11           delay-11  
9     forrester-8       cuomo-5      japanese-6           her-4    paterson-10      percent-11           county-7  
10  impeachment-6       ozawa-7         enron-1         italy-8      albany-10        nixon-11        mcgreevey-9  
11    mcgreevey-9    italian-10       clinton-2      governor-6       codey-11          city-7  representative-11  
12        soft-11        bush-2         cuomo-5          rell-8         iraq-7  blagojevich-11       berlusconi-6  
13      rowland-5  christian-11          page-0       corzine-7      rowland-5    berlusconi-6            dole-11  
14      clinton-2   hosokawa-10         ozawa-7         japan-3       county-7     giuliani-11      impeachment-6  
15       money-11       kohl-11          city-7         ozawa-7        starr-6          iraq-7           foley-11  
16      spitzer-5        rell-8      italian-10  connecticut-10       bruno-11     abramoff-11             city-7  
17     raising-11   kanemaru-10       hevesi-11         starr-6   schundler-11        mayor-10      republicans-7  
18    earmarks-11   paterson-10        county-7     forrester-8  lautenberg-11      billion-11        district-11  
19   lobbyists-11  andreotti-11           mrs-5          cia-11      clinton-2         plan-11               ms-6  

Story telling

Now comes the fun part, where we can try to find explanations for the choices of topics generated by Gensim's implementation of LSI. While LSI can be very good at finding hidden factors and relationships (i.e., topics) from different documents, there is no way that I'm aware of to easily interpret the algorithm to see why it groups documents with certain topics. The best way I know is to eyeball it, which we can do from the topic-word dataframe above.

For example, topics 1 and 5 include the words impeachment, lewinsky, gore, clinton and starr, so it's probably a safe bet to say they're referring to the Lewinsky scandal. And looking at the topic-year plot from above (In [17]), we can see that at least topic 5 has a major spike in the years following the scandal.

Both also include the rather high-scoring terms prime and minister, which are probably indicative of the large number of world news summaries included under the topics. For example, 343 of Topic 1's articles have the title News Summary, while no other topic has even 40 summaries:

In [21]:
t1 = df[df.Topic == 1].Title.value_counts()
t1[t1 > 10]
Out[21]:
NEWS SUMMARY       275
News Summary        68
BUSINESS DIGEST     26
WORLD BRIEFING      16
World Briefing      14
dtype: int64

Topic 3 looks like it's associated with state- and city-level scandals in the New England region. Aside from the cliched terms, we have rowland and rell, likely in reference to corruption in Connecticut, and some more pretty specific indicators like mayor, governor, cuomo, spitzer, city, state and albany.

Topic 12 looks like it covers New Jersey pretty well. Other than the state's name itself as one of the topic words, you've got Corzine, Torricelli, Codey, Schundler and Lautenberg, none of which appear outside of this topic except for Corzine.

Several look international in nature, especially topic 9, which has strong Italian (berlusconi, italy, italian and the unique andreotti terms) and Japanese (japan, japanese, Ozawa, Hosokawa and Kanemaru) showings, and also uniquely identifies German chancellor Helmut Kohl.

Topic 13 seems to represent public finance scandals, with unique terms budget, tax, percent, billion and plan, while topic 8 looks like it pertains more to campaign finance, with unique terms soft, money, raising, earmarks and lobbyists. Topic 7 looks like it has to do with corporate scandals, leading with the admittedly pervasive enron term, but with largely unique terms accounting, stock, corporate, attorney, counsel, investigation, companies and justice [as in Department of...?] as well.

And finally the 2nd topic appears to have a lot of legislative factors in it, with terms unique terms house, senate, lawmakers, ethics, committee, bill and parliament.

Conclusion

The results give a much less fine-grained view of scandals than what I was expecting, either because of the sources (not enough articles devoted specifically enough to particular scandals? text not sufficiently preprocessed) or the algorithm (wrong algorithm for the task? wrong settings?). Plus, it turns out there have been a lot of American political scandals in the last 20 years. Perhaps more clear patterns could be discerned by expanding the number of topics.

The detected topics seem to have a lot of noise (for example, the presidents' names show up as key words in every topic), possibly due to the imbalance from some scandals being cited more frequently than others. But when you cut out the noise and try to characterize the topics by the more infrequent key words, I was surprised by the topic clusters it was actually able to detect, from international scandals to corporate scandals to Jersey scandals. I was unfortunately not able to detect the recent set of scandals, but from the experiment, the good scandals seem to require a few years to age before there is enough data to detect them. Hopefully today's events will be easy enough to spot by rerunning this in a few months or years.

All in all, it was a fun exercise and a good reminder of the strong tradition of corruption we're part of.