By Alison Strobel, Monica Miller, Zoe Wake Hyde, Josh Oliveira and Alice Fleerackers


We at Team DAT!Analysis want to gain a deeper understanding of the content generated by SFU’s Master of Publishing students. We are curious how—or perhaps whether—MPub writing has transformed throughout the years to reflect changing societal and industry trends. We want to know how events such as the rise of social media or the financial crisis of 2008 have altered the kind of things we write about as publishing students and the way we write about them.

We are also curious how the kind of writing published on TKBR differs from that published in SFU’s Summit Research Repository. Are the TKBR posts more blog-like? Are they more closely tied to current events? How does the sentence length in TKBR posts compare to the Project Reports? The TKBR posts also present a unique opportunity to conduct some more detailed analysis using the associated metadata. Does the time of day published affect the sentence length of posts? Are the tags students assign truly effective descriptions of their essay topics? What factors (if any) influence the number of links and sources used throughout a post?

These are just some of the questions we might investigate over the course of our analysis. We hope that the results of our investigation will provide greater insight into what is being written in the publishing program and why—findings that could be used to inform acquisition strategy and site organization for the forthcoming Book of MPub 2016.


To conduct our analysis, our team will have to execute the following steps. For efficiency’s sake, these steps can be executed concurrently (using sample data) by different team members.

Collect & Extract

We will download the past three years of Technology essays from TKBR by using either the WP JSON API or by writing an SQL script to collect directly from the database. Josh and Alison will take the lead on the TKBR scrape.

At the same time, we will collect the 297 PDFs in Summit Research Repository’s Publishing Program – Theses, Dissertations, and other Required Graduate Degree Essays Collection. We will need to write a web scraping script with Python to download the PDFs. We will then use another Python script to download the PDFs and convert them to text files for analysis. Zoe will be heading this step.


While collection is occurring, Monica will investigate how to clean up the data using a small sample of WordPress posts. She will use Google’s OpenRefine to conduct the cleanup of the TKBR essays.


Once the data has been cleaned, we will use MALLET (MAchine Learning for LanguagE Toolkit) to conduct our analysis. We will build identify common topics and examine their evolution over time. We will conduct analyses both between and across our two data sets (i.e. compare TKBR to Project Report data and investigate overall trends within MPub writing in general). Alice will take the lead on exploring MALLET’s capabilities.


We plan to create some data visualizations of our findings using Tableau. This may include building word clouds, stacked bar charts, heat maps and more to chart topic frequency over time and visualize lists of terms that frequently occur together.


We will document our work with frequent screenshots and notes which we will later compile into a formal report. We will structure this final document like a scientific report, with Introduction, Purpose, Methods, Results, and Discussion sections. It will include our most useful data visualizations, outline key findings, and offer some context for understanding our results. For example, we may choose to compare topic evolution in our dataset with the evolution of the same terms on Google NGram Viewer. At this point, we will also assess the usefulness of our analyses within the context of the Book of MPub 2016.  


We will share our process and findings with the class in a fun but informative presentation.


  1. Web scraping script for locating and downloading  PDFs from SFU’s Summit repository
  2. SQL script for extracting text from TKBR posts from database OR steps used to interact with WordPress JSON API
  3. Complete data set: both cleaned and raw data
  4. List / detailed steps for cleaning data in OpenRefine
  5. Overview of MALLET analyses performed and results
  6. Tableau visualizations of key findings
  7. Final Report


  • Mar 18 Proposal due (Alice)
  • Mar 21 Initial research due (all) and data collected (Josh & Alison, Zoe)
    • Each member has explored their assigned tool
    • We will use class to fill each other in on our findings
    • Data is collected and ready to clean
  • Mar 24 Data extracted and cleaned (Monica/all)
    • Data is ready to start MALLET analysis
    • Team has compiled list of questions to investigate
  • Mar 29 Data analysis due (Alice/all)
    • Team meets to discuss findings and conduct data visualizations
    • Start working on final report
  • April 4 Pre-final deliverables due
    • Start working on presentation
    • Revise and finesse report
  • April 11 Final document and presentation due