Overview


Methods

For this project, we began by gathering different .txt files of the scripts for The Breakfast Club, Sixteen Candles, and Ferris Bueller's Day Off. Each team member then took a screenplay and marked it up (mostly with Regular Expressions) into a clean XML structure (dictated by the group and notated in a project document). Additionally, these marked-up documents all abided by our Schema (visible on its own page), which established/enforced the general mark-up rules (given that we had source documents coming from different places); and was constructed and laid out together. Next, on the website side of things, those XML files were each run through an XSLT file that converted them to formatted, readable HTML views that we then adapted (to account for our style selections) and output onto the site. Our data collection/visualization then consisted of a few different processes, including Python word information coding (using things like NLTK and MatPlotLib), SVG/graph creation (using XSLT and otherwise), TSV tables, and Cytoscape networks based on our marked-up documents. The data (charting various things from dialogue frequencies to shot descriptions) and the code that found it are all visible on this site.

Issues

While we ran into some issues when working on this project, most of them were resolved with relative ease. All three screenplays were accessed from the internet; however, some of the Sixteen Candles screenplay had to be remarked by hand early on (even before XML), as unlike the other two sripts, it was initially sourced from a PDF scan that did not easily become a workable .txt. The screenplays also had different camera directions, scene headings, and formatting, which was difficult to track for markup consistency purposes. On our Github repo, we had a continuous issue with getting the .DS_Store file to stay permanently gone (despite multiple .gitignore reinstallations). Some of the SVG files were redone early on as well in order to make them more comprehensible (or comprehensive). We had to redo our .tsv file for the Cytoscape networks a few times until it was able to be read and understood. We also had difficulty getting the Cytoscape networks up on the site, though that was later fixed. From a coding angle, the main struggle was the sheer act of chipping away at the various Python functions (many of which were new to us) and getting them to work with our documents. One specific task that probably proved most tricky was the XSLT used to make the Ferris Bueller 4th Wall timeline visual. Besides that, the coding mostly went smoothly (oustide of crashes and finicky frustrations like not having proper libraries installed, etc.).

Example of Project XSLT (Used for Ferris Bueller Timeline) :

Conclusion

We were able to learn a lot during the course of this project. First off, it was interesting to see that while three scripts can be written by the same person, they can still differ quite a bit. Outside of normal narrative and dialogue elements present in all scripts (and those admittedly were similar; our lexical data proved only a 2% difference between all screenplays), there are various inclusions of montage descriptions, camera shots, specific cuts, opening quotations, voiceover, music direction, and more. Some of the scripts feature those kinds of things, and some don’t. The biggest takeaway there is that script format or conformity should not dictate a movie, but rather the movie should dictate the script. Hughes shows us to branch out creatively when it’s needed, and not allow yourself to get comfortable in your practices if it could mean a worse end result.

Something else that’s interesting to note is the variance of camera intentionality throughout the scripts. Ferris Bueller and Sixteen Candles feature plenty of specific shot descriptions, but Breakfast Club has none at all. The initial thinking was that the opposite would be true: a movie about a handful of people might need more precision or “control” in the script to maintain order over all these characters, and a story more driven by a single protagonist would be easier to figure out during the shoot. However, our findings proved this thinking wrong. Maybe instead, the idea is that when you have an ensemble cast, it’s better to let them learn things more freely, so that the actors’ chemistry might benefit from more of an organic arrangement (especially in a case like this where the characters go through so many different emotions together). On a similar note, Breakfast Club also has a far lower number of scenes (42) compared to the other two movies (which themselves are similar, at 205 and 231) - though all three have comparable runtimes (107, 97, 93 min.). We feel that this reflects the previous idea: less separate things happening, less calls for smaller moments are better for a big cast movie, in that it can increase the impact of each character within the minimal screen time they get.

We also learned a few things from all of our word data. Average lengths of dialogue were close (approx. 10, 11.3, 12.4 words). As mentioned earlier, the three scripts definitely seem to have been written by the same person, in that their lexical richness values were 0.17, 0.18, and 0.19. Maybe part of this can also be attributed to the fact that they were written within a few years of each other (language would have remained similar) and all of the movies are about high schoolers. From a numbers side of things, we got an idea for the breakdown of certain parts of speech in screenplays. In particular, we ran all three scripts to check verbs, adjectives, and prepositions. Surprisingly, all three scripts follow similar distribution curves for each type, no matter the length of script or amount of words.

From a technical perspective, this project also gave us some new coding/processing knowledge and abilities. Our HTML skills naturally got a bit sharper through all of the website stuff, but more importantly was all of the Python and outputs. We learned the full process of taking in a text (or collection of texts) and running it through different word functions to output desired data into various kinds of visuals. We learned how to more accurately create and display SVGs (from XSLT and otherwise; as well as the fact that external PNGs can be linked into SVGs). We gained a better proficiency and feel for Regular Expressions and the best types of ways to use them. Additionally, we got better at handling TSV files and using them for things like Cytoscape networks. Overall, this project was a good experience. Not only did we learn more about movies we love, but more about word analysis and text encoding practices. We hope you enjoyed looking through everything we put together.