lexophilia — R. ISMAKOV

MOTIVATION

The way we consume news and media reveals a lot of the society we live in, as well as of the changing style of communication and writing throughout location and time. Journalism, unlike other forms of writing, is relied on being objective and presenting pure facts without an infusion of biases and opinions. In reality, however, we are intuitively aware that this not the case. Journalism naturally encompasses the authors’ assumptions and biases within the text, visible not only through the words chosen but also through the very structure of the author’s writing style.

This allows us to look at many features, as well as many potential predictors to attempt to extrapolate what journalism and news media expose both about the writers and the reader consumers. This is useful both as an analytical tool as well as an extrapolation of predictors that will allow businesses to better understand their changing and versatile audience.

In addition to examining stylometry between media sources and across time, it would be of interest to see if writing style could predict the gender of the author. This will likely have useful applications in many cases where gender is not disclosed; if a company wants to target a specific sex, this would provide a method of deciphering gender purely from the text. Additionally, a topic of great concern within industry and tech, among other domains, is the wide gender gap present. It is claimed that this leads to implementations of products, designs, and policies being fully decided by one gender only, thus potentially hindering progress by ignoring different modes of thinking. If a clear distinction can be seen from purely analyzing political and world text, this would provide strong evidence to this claim that the genders do in fact have varying ways of observing the world, to the extent where it can even be picked out from words that should ideally be objective in nature and unconcerned with the author.

To investigate the stylometry of various news sites and the properties of their articles, various indicators of style were parsed out within the text. This type of feature analysis was implemented by looking at text properties such as mean length of words, frequency of given words, and variance of sentence length, among other parameters (see Methods below for more details). Data was obtained through web-scraping, as well as through available APIs.

METHODS

For this project, I web scraped, cleaned, and structured data gathered from multiple media site sources. I used the python library BeautifulSoup to extract the relevant text from the HTML code, and used the TextBlob library to extract language features.

I then analyzed data to discover and interpret trends, patterns, and relationships in writing among articles and between media sites by engineering features that extrapolate style from writing.

After statistical analysis of the data, I trained an Ada Boost Classifier to predict the gender of the author from these engineered features. You can input your own text and get your gender prediction by clicking on the 'Web App' tab above.

Project Pipeline

Engineered Features

Type token ratio • Mean word length • Mean sentence length • Standard deviation of sentence length • Frequency of commas • Frequency of semicolons • Frequency of exclamation marks • Frequency of question marks • Polarity • Subjectivity • etc ...

You can look through the source code title "stylometry_analysis.py" (Github Lexophilia) for more detailed information on how these features were extracted from the text. Most were counting frequencies of certain tokens. Polarity and subjectivity scores were performed through sentiment analysis using Python's TextBlob module. TextBlob extracts these scores through a library that evaluates each word's polarity and subjectivity and then outputs an average score for each article. The library has a database of words with associated polarity scores ranging from -1 (most negative) to +1 (most positive), with 0 as neutral; similarly, subjectivity scores for each word range from 0 (low subjectivity) to 1 (high subjectivity).

Differences between News Sites

I compiled data from several differing news sites, which included the ones visualized below. To compare, I took the articles of each from within the last two years and compared writing style within that timeframe. As noticeable, there were many clear features that differed strongly between them, and that appeared to be more or less stable within the observed time frame.

The different news sites seemed to have specific stylistic parameters, that were generally static throughout the observed time frame. The Opinion section displayed the most variability, while the Politics/World sections were all more static. Buzzfeed News appears to be rising in many of the features, potentially in an attempt to move from more of an informal "click-bait" type to a more legitimate news site (see visualizations below for more information).

Differences between Genders

The gender ratio was relatively similar between the different news sites. Breitbart and Slate had slightly higher ratios of men. Due to this, I removed them from the gender exploratory analysis and the predictive training data set to ensure I was modeling solely gender and not implicit details of a specific news site.

Females and males displayed significant differences among many stylistic features. You can see below examples of a few of those features. The differences seemed to be more or less similar throughout the explored six-year time frame.

STYLOMETRIC DIFFERENCES BETWEEN NEWS SITES

STYLOMETRIC DIFFERENCES BETWEEN GENDERS

MOST COMMON COUNTRIES IN HEADLINES

I examined the most common countries that appeared in the headlines of the countries from 2014 to 2017. Israel, Syria, and Russia were the three countries that appeared in the top seven most mentioned countries for all news sites.

In addition to examining the most common countries mentioned in article headlines, the way the countries are related to each other was also investigated. Below you can see the connections of the countries that appeared together in headlines.

Top most mentioned countries for each media news site. Same colors correspond to the same country.

WEB APP

To explore how the algorithm makes its predictions, you can test the web app with your own text here: Gender Prediction Web App