Language and Representation

From UBC Wiki
Jump to: navigation, search

About the writers

COGS200 Group 11: Joshua Grant, Gina Hong, Rafael Paterno, Melon Oh


Why did you choose your career? Some people choose after being inspired by a family member, a teacher or a course. Some choose based on a fleeting opportunity that seems too good to pass up. Many people, though, seem to choose after being exposed to a career in a movie or television show. The media seems to play an increasingly important role in exposing young people to opportunities. But how does the diversity of people that appear in particular media roles affect future career choices?

The research was partly inspired by the increasing number of female doctors portrayed in media that coincided with the increasing proportion of female medical school graduates. The three stills of different medical tv shows illustrates how the representation of females as a doctor increased as time passed.

A tv show that centered around doctors that ran from 1969 to 1973.
Trapper John MD, a tv show that ran from 1979 to 1986.
A tv show that revolved around doctors that ran from 1994 to 2009
The graph shows that as time passes, the distribution becomes fairly even between male and female med school graduates.

Project Aim and Hypothesis

Our aim is to evaluate the role and strength of language in pop-culture as an attractor for societal gender roles and gender perception, specifically with regards to occupation choice. We hypothesize that the increased usage of gender neutral profession terms and female representation in particular professions in popular media will correlate with an increase in the number of women in associated professions (eg. Doctors, Police Officers, etc). We also hypothesize that the effect will have a predictable delay as young watchers grow up and older watchers train for a new career. This gives numerical basis to a common belief: that certain kinds of representation and language can influence our perceptions regarding diversity and opportunities.

This is a project in the vein of culturomics. To get meaningful data, we’ll construct a large corpus using complicated algorithms to keep track of not just what is said in the given pop culture texts, but also who is saying it, who they are referring to, and as of other data. This will allow us to quantify cultural change and correlate it to various media criteria.

Also it is important to note that due to the observational nature of this research, we cannot argue for a direct causal link between language in media and occupational choices. However, our research and usage of a corpus to track language changes in popular media may be able to uncover a certain pattern or trend that may provide possible topics to be further researched.

Potential Contribution of the Research

This research can potentially contribute to the dialogue surrounding the effects of gender-neutral language, especially in how influential language and representation can be in shaping our perception. The importance of quantitative research is noted by Parks and Roberton (2005) in “Explaining Age and Gender Effects on Attitudes toward Sexist Language”. Here they found that cognitive arguments, rather than emotional arguments, better appealed to certain demographics in how they perceived the importance of gender-neutral language. If our hypothesis holds true, this research may be able to contribute towards encouraging dialogue between those that believe in the importance of gender-neutral language and representation versus those who are more skeptical of the extent of its influence.

Analysis of prior research in relevant fields such as the link between media and gender, application of dynamic states and application of n-grams, also revealed certain gaps in the knowledge that our research is aiming to fill. These points will be addressed in the sections below.

Prior research on link between media and gender

Notable links between popular media and gender have been explored exhaustively in various fields. For example, research conducted at Morgan State University by Smith A. L. sought to study the effects of media consumption on the formation and expectations of adults on topics such as gender norms, relationship dynamics, and sexual conduct. In Smith's (2014) study, “Pop culture vs. Rape Culture: The media’s impact on the attitudes towards women” , moderate correlations were discovered between listening to sexually aggressive music lyrics and people's outlook on certain scenarios/phrases. This may have indicated either a preference for certain music based on a person's existing world views or a causative relationship wherein media consumption shifted gender expectations. Researchers were unsure of which was the causative factor for their observations but concluded that the two factors were, in someway, linked to each other.

In another study titled “Sexualised music media and children’s gender role and self-identity development: a four-phase study,” researchers at the University of South Australia examined the effects of sexualized music on developing teens and children. They found that children are influenced by sexualized media which can shape their views of self-identity, gender roles, and judgements based on another's appearance. The paper suggested that the pervasive influence of mass media contributed to the creation of a sexualized environment in which children grow and develop. This sexualized environment can then influence their cognitive development such that they mature expecting genders fit into the framework that popular media laid down.

There has also been prior research related to language usage and gender perception. A research paper written by Alison Lenton and a team of psychologists at the University of Edinburgh, investigates the role of linguistic abstraction—terms becoming representative of concepts away from the objects they were originally attached to—in affecting people's perception of gender. The team used semantic analysis, a mathematical tool which calculates the degree of similarity in meaning between two words. Here they found that gendered language carries with it stereotypical information. The paper concluded that gender stereotypes exist within the most common forms of categorical referents for men and women. Furthermore, they found that there was a high degree of similarity between these categorical gender referents and certain words including dietician, nanny, and nurse which were found to be feminine and farmer, physicist, and soldier which were found to be masculine.

Not only are professions linked to certain genders, but the connotations surrounding that profession are associated with the linked gender too. In a research article titled “Does Gender-Fair Language Pay Off? The Social Perception of Professions from a Cross-Linguistic Perspective,” researchers from the University of Bern investigated how traits associated with a profession can be transferred to the gender associated with that profession. For occupations whose workforce is more male, such as political leaders; men were associated with these traits.

Prior applications of dynamic systems and attractors in research

In his book “The dynamics and evolution of social systems: new foundation of mathematical sociology,” Jurgen Kluver (2000) writes about the algorithmic complexity of sociological systems and that in such complex systems, self-organization was a recurring stable state. Through a computer program designed to model a differentiating society, Kluver found that the program settled into a classist stable state. He also noted that changing the system dynamics and introducing new attractor basins was possible, although only likely if the stimuli that enacted these changes varied enough of the individual values or ‘people’ within the program.

Based on this usage of dynamic systems and identifying of stable states and potential attractor basins, we will applied these concepts to our own research. We hypothesize that the language usage in popular media acts as attractor states on a societal level such that there is a change in representation of certain professions.

The two studies, “Pop culture vs. Rape Culture: The media’s impact on the attitudes towards women” and “Sexualised music media and children’s gender role and self-identity development: a four-phase study”, have explored a correlational link between media consumption and the psychological development of mental biases and expectations in regards to gender roles. These findings, in combination with the dynamic systems model and concept of attractor states utilized in various science fields, guided us towards our broad topic: evaluating the role and strength of language in pop-culture as an attractor for societal gender roles and gender perception.

Despite providing some valuable insight into how our psychological development of biases can be influenced by exposure to certain content in popular media, there were not many prior studies focusing specifically on the trend of language change in pop culture and how such changes were both impacted by and influenced societal trends like the proportion of women in a certain job field. Such gaps in prior researches guided us in formulating our more specific main hypothesis: increased usage of gender neutral profession terms and female representation in particular professions in popular media will correlate with an increase in the number of women in associated professions. Our proposal is designed to fill in such gaps through tracking language trends with a movie/tv-show corpus and comparing this data to the changes in gender proportion for several professions.

Prior applications of n-grams in research

Other relevant prior researches includes a study conducted by Amaç Herdagdelen (2013) that quantified language usage over time. This was done through measuring the usage of n-grams vs. the demographical information of Twitter users. Using this information, the researchers then used phrase detection heuristics and computed how often a phrase was mentioned by a male or female user. They then used this number and associated certain phrases with the gender who more frequently performs a given action or says a given phrase. The researchers discovered that certain phrases such as “become a nurse” were far more associated with a certain gender (in this case, females). This research provided us with an example of how n-grams could be used in tandem with demographic data to produce quantitative data on language patterns and trends.


Methodology Overview:

The research will create a corpus from the top 50 North American box-office gross movie of each year from from 1960 to 2010. The corpus will track data such as percentage of speaking time by gender, terms used for occupations, proportion of positive/negative adjectives applied to certain professions and gender. These language trends and representations in movies for several professions will be tracked by the corpus and be compared to the trend of gender proportion in that profession. The potential results of this data tracking is addressed later in the discussion section.

The data amassed by the corpus and the occupational data will be analyzed through a dynamic state framework, and further explained by prior researches on linguistic relativity.

About the Corpus

We will construct a large corpus of annotated movie scripts. These will track data including:

  • Speaking roles and character occupations by gender.
  • Percentage of speaking time by gender.
  • Mentions of people by professional terms by gender of the referee.
  • The terms used for occupations.
  • Adjectives applied to characters by gender and profession.

The data from each movies will then be scaled based on the relevance of popularity of the film --through evaluating its gross box office or ticket sales-- and number of lines per character in a movie. We may also wish to filter out period pieces, or films set in a fantasy or science fiction setting, since these films will use language that deviate from their production date. Although, due to contemporary films likely sharing certain linguistic sensibilities regardless of when the movies are set, it would also be interesting to graph these movies on a separate scale. This could be a topic for further research or discussion.

Why a corpus?

In order to draw any reasonable conclusions from our data set, we will need to process a sizeable amount of data. Since most movies average about 120 minutes, parsing through all movies manually would mean analyzing 250 movies, or 300,000 minutes of footage. When considering the variability and high chance of error that is involved with manual data parsing done by multiple people, developing a corpus to track this data will ensure a more consistent data set. Also, in addition to the consistency of the data set, the developed corpus can also be applied to other researches in various fields such as film studies or further linguistic research that involves mainstream media.

Testing the parsing program and areas of potential errors

The parsing program will be tested on multiple smaller data sets of approximately ~500 words for quality control. This will enable us to evaluate the error rate of the program and improve it before utilizing the program on the ~250 scripts that we will be feeding it.

Output of the parsing program for statistical analysis

Due to the various data we wish to track through our corpus, the parsing program (responsible of constructing the corpus) will have to be smart enough to tag data non-linearly. Meaning it will need the ability to do some form of a semantic analysis. This will involve incorporating coherence relations and similar heuristics into our parsing algorithm in order to output more correct data from the database, though we do expect some errors that may need to be corrected by hand or ignored as noise.

For example, to parse through adjectives applied to characters we can use semantic analysis to find the similarity between any two words. This linguistic method measures the degrees of similarity between two words by turning the words into mathematical vectors. It then calculates the similarity value of a given word pair (e.g. lawyer & man). This numerical value can then be compared to the similarity values of other word pairs to gain an understanding of how similar words are in relation to other words. Below are an example of possible words we can pair with ‘man/woman’:

  • Clerk
  • Lawyer
  • Doctor,
  • Designer
  • Affectionate
  • Caring
  • Warm
  • Adaptable
  • Earnest
  • Happy
Annotated legally blonde script

We will also obtain data on the proportion of women in various occupations by year. This data can be plotted against the media data from our corpus as an (imperfect) metric for real world progress. This could tell us to what degree media representation plays the role of an attractor to increase female participation in certain careers, how real life participation increases media representation, and if there are identifiable patterns in specific fields.

Since it’s not clear what the best predictor of state change will be, we’ll make sure that the corpus is filterable by a variety of criteria in order to give us some control in assessing the strength of each variables. Hence, we will be able to sort and filter data by year, genre, percentage of female cast members, percentage of female dialogue etc.

This large corpus can then be used for statistical analysis. We can compare proportions of women in various occupations vs. women portrayed in those occupations on screen, weighted for screen time, prominence and many other factors. This will help suggest possible correlations between media prominence and increased professional participation. Since we’ll be dealing with a large amount of data, we’ll use data visualization tools to explore possible correlations.

The data should be capable of telling us what’s changing and how much, but in order for this to be useful we should speculate on why it’s changing. From there, we can draw on insights from psychology to interpret our findings and draw larger conclusions between media and career choice for women.

Improving the Corpus

We may also find that there are opportunities to use machine learning to make an even smarter algorithms. This has been attempted on similar topics. During the course of our research, researchers at the University of Washington released a tool that compares sentence structure used by female and male characters to quantify power imbalance. This project has a large database, but tends to focus on specific insights between popular movies rather than overall linguistic trends, and doesn't attempt to draw comparisons with real-world workforce participation.

Dynamic States as an Analytical Framework

In order to analyze the data retrieved by the corpus and relate that to how the consumption of media may influence a large audience, we will use the same mathematical model as described in Kluver’s book, “Dynamics and Evolution of Social Systems.” Kluver’s model describes the trajectory of a system based on the strengths of applied attractor basins.

Formula for dynamic systems theory.


  • A is an attractor/point attractor
  • Z1 is the initial state of the system
  • F is the system function and n represents the number of times to apply f to itself

This equation describes that the trajectory of a system is dependent upon the three variables mentioned above. In our experiment, we can use these variable to mathematically quantify the possible influence that changing language usage in mainstream media has on job occupation proportion.

Linguistic Relativity to Explain the Data

Understanding how language is associated with our perception of the world, both in processing information and in developing psychological biases, is also an integral part of our investigation. One of the most talked about idea in cognitive linguistics is the idea of Linguistic Relatively, more popularly known as the Sapir-Whorf hypothesis. Divided into two branches, the weak idea argues that the differing structure of certain language can affect thought, and influence our perception. The strong idea, on the other hand, argues that the structures of language shape our thoughts.

Although there has been several arguments against linguistic relativity in the past, recently there has been an influx of studies that argue that some parts of the Sapir-Whorf hypothesis hold true. An example of such studies includes Broditsky’s (2001) study which found that Mandarin speakers perceived time in a vertical sense rather than horizontally like English speakers. The study concluded that while native languages does affect the formation of one’s “habitual thought”, the degree of such influence is weaker than what is proposed by the strong Whorfian argument. Other recent studies also include Regier and Kay’s (2009) “Language, thought, and color: Whorf was half right”, where the researchers found that different semantic domains for colors in languages do influence color perception to an extent. Based off of these prior studies, we will utilize their framework in analyzing any potential trends we see from the data we will amass.

In this day and age, the development of language has been rapidly influenced by many factors such as social media, technology, and pop culture. As our research specifically focuses on the impact of pop culture on language, we have observed that this factor has shaped the way people may identify themselves, and their way of communicating with others through the use of language influenced by pop culture. The use of language changed how society portrays womens’ role and identity throughout the years. Specifically speaking for our study, in the earlier ages, women were not huge in playing main character roles and did not have very big parts in movies/television shows. Nowadays, women appear to star in many movies, having roles of main characters that associate with professions such as doctors, lawyers, etc. whereas in the earlier ages, mostly men took part in those roles. As stated in our hypothesis, we predict that as time passes, an increasing number of women will take part on more “dominant” careers when women in film/television productions are shown as having a “masculine” role.

As mentioned in the introduction, studies have shown the relationship between gender roles and the consumption of media that influences the ways that each gender is portrayed in society.


The question of to what degree diverse media casting choices impacts career aspirations and development for women and girls is a crucial one, and much talked about; however, it’s difficult to quantify. We thought that a data-focused approach to the language and choices of major film and television media could help identify how and how much these choices might impact (or mirror) social and psychological realities.

We anticipate that there will be some variance from field to field and genre to genre, but that media representation will be a strong predictor of future workforce participation. There is also likely to be a feedback effect as females are more represented in the workplace, they’re likely to appear more in contemporary films. It may also be interesting to see how the portrayal of women in fantasy and period pieces changes with the decades — warrior princess may not be a real job, but it has arisen alongside more traditional (masculine) fantasy roles as an extremely popular trope in recent decades.

This project offers some way to quantify the effects of diverse casting on future professional participation. It may help governments and activist groups interested in promoting equal participation in professions to make data driven decisions for content rules. It’s easy to argue against feelings, instincts and experiences, but difficult to argue against numbers.

The research might also cast doubt on the effects of certain types of representation on driving participation. Are there certain professions which have had a greater pull towards equality following an increase in female representation in film or movies? A lesser pull? Is portraying more women a more powerful attractor than portraying a few higher power women? Is there a quantifiable “lag” between an important performance and a boost in participation?

Dealing with Data

While dealing with our data analysis, we will have to be careful to keep in mind how other factors influence both the media data and workforce data. Film representation may be an important influence, but there are likely more powerful influences in specific instances. We may still be able to draw some conclusions regarding specific trends since film has been ubiquitous in pop culture throughout the 20th and 21st centuries.

Data may have to be compared in aggregate; we may find, for instance, that we get the most predictable results from combining weighted data for proportion of female lines in script and proportion of females portrayed in each occupation. The corpus we construct should make it easy to perform numerous queries and enable us to explore the data to find the best fits. We can evaluate our chosen explanatory variables by seeing how well they conform to the following models: Media representation precedes a predictable bump in workforce representation.

Expected Results

It’s clear that media and real life influence each other, so we expect to see some feedback in many data sets. What’s most interesting is how and how much. By plotting data from our corpus against a real life outcome, we may be able to take the first steps in quantifying this interaction.

We predict our findings to belong in one of these four main categories.

  1. No discernible pattern or correlation
  2. Simple Feedback from media portrayal to job occupation
  3. Nuanced feedback from media portrayal to job occupation
  4. Feedback from job occupation to media portrayal

The graphs on the right hand sign represents what one of our data points from the corpus plotted with gender proportion in a certain profession may look like. Note that the dotted line could represent any of the potential language trends we may find from our corpus through utilizing semantic analysis. Among these four potential results, we are predicitng that the 3rd option of "nuanced feedback" as the most likely one.

No discernible pattern or correlation

A “noise” situation where we cannot identify any significant trends. This would indicate little to no correlation between language usage trends tracked from the corpus and the proportion of females in the particular occupation.

Sample graph of no discernible pattern or correlation. For discussion of possible results in COGS 200 group 11 proposal.

Simple feedback from media portrayal to job occupation

The graph indicates an increase in media representation that is directly replicated in job occupation proportion after ~10 years.

Sample graph of simple feedback (media portrayal influence real life). For discussion of possible results in COGS 200 group 11 proposal.

Nuanced feedback from media portrayal to job occupation

A more nuanced “feedback” situation where workforce representation leads to a pronounced bump in media representation which then amplifies the workforce representation after a few years (or vice-versa).

Sample graph of nuanced feedback (media portrayal influence real life). For discussion of possible results in COGS 200 group 11 proposal.

Life Imitating Art or Art Imitating Life?

Feedback from job occupation to media portrayal. This is an example where increased media representation is followed or caused by an increase in real life representation of women in certain professions.

Sample graph of opposite feedback (real life impacting media portrayal). For discussion of possible results in COGS 200 group 11 proposal.


From this project, we learned how to integrate the fields of psychology, computer science, and linguistics to engage with information learned in class and gleaned from outside research. From researching and exploring relatable phenomenon, we got more engaged with the material which helped it become more interesting and meaningful. Of particular interest is the work we have done in dynamic systems theory, which can help us understand the significance of attractors in real life situations. In general, we’ve explored how we can draw on multidisciplinary insights to develop a model to make sense of (or, at least, better understand) complicated, dynamic subject matter.

Shuffling through dozens upon dozens of articles taught us about the statistical, mathematical, and logical aspects behind research that goes into linguistics, psychology and computer science. We learned about the Sapir-Whorf hypothesis and the connection between spoken language, thought and perception. We explored interdisciplinary topics such as how mathematical models such as semantic analysis and dynamic states can be applied to solve problems in linguistics. Most importantly, we’ve garnered a greater understanding of how language can influence our thoughts and actions.

We think that there are a lot of interesting conclusions that can be drawn from corpuses of readily available data that can be digitally represented and judiciously analyzed. Our project may be a more obvious example, but during the process we’ve come up with many more data sets that could yield interesting results from such a treatment. Similar projects could be done for other diversity studies (such as race, immigration status, or sexuality) or less related fields.

Of course any correlations found in these data sets and between these data sets and similar are just that: correlations. Some data that we gather may be misleading or anomalous, and none of it proves a causal elationship. However, in a large enough sample with a good enough algorithm, trends in data may point at possible causal mechanisms for further study.


Boroditsky, L. (2001). Does Language Shape Thought?: Mandarin and English Speakers Conceptions of Time. Cognitive Psychology, 43(1), 1-22. doi:10.1006/cogp.2001.0748

Ey, L. (2016). Sexualised music media and children’s gender role and self-identity development: a four-phase study. Sex Education, 16(6), 634-648. doi:10.1080/14681811.2016.1162148

Herdağdelen, A. (2013). Twitter n-gram corpus with demographic metadata. Language Resources and Evaluation, 47(4), 1127-1147. doi:10.1007/s10579-013-9227-2

Horvath, L. K., Merkel, E. F., Maass, A., & Sczesny, S. (2016). Does Gender-Fair Language Pay Off? The Social Perception of Professions from a Cross-Linguistic Perspective. Frontiers in Psychology, 6. doi:10.3389/fpsyg.2015.02018

Klüver, J. (2000). The dynamics and evolution of social systems: new foundations of a mathematical sociology. Dordrecht: Kluwer Academic .

Langston, J. (2017, November 13). New tool quantifies power imbalance between female and male characters in Hollywood movie scripts. Retrieved November 28, 2017, from

Lenton, A. P., Sedikides, C., & Bruder, M. (2009). A latent semantic analysis of gender stereotype-consistency and narrowness in American English. Sex Roles, 60, 269-278.

Parks, J. B., & Roberton, M. A. (2005). Explaining Age and Gender Effects on Attitudes toward Sexist Language. Journal of Language and Social Psychology, 24(4), 401-411. doi:10.1177/0261927x05281427

Regier, T., & Kay, P. (2009). Language, thought, and color: Whorf was half right. Trends in Cognitive Sciences, 13(10), 439-446. doi:10.1016/j.tics.2009.07.001

Smith, A. L. (2014). "Pop culture v. rape culture: The media's impact on the attitudes towards women" (Order No. 1560042). Available from ProQuest Dissertations & Theses Global. (1556118286). Retrieved from

Yule, G. (2014). The study of language. Cambridge: Cambridge University Press.