Oleg Seletsky Tiger Huang William Henderson Frost Dartmouth College December 12, 2007
Many in academic world accept as fact that William Shakespeare of Stratford-upon-Avon was the pen behind such works as Romeo and Juliet King Lear etc. However there has been a debate dating back to the 18th century about whether the works attributed to the famed playwrite where actually composed by another writer. The debate thus far has been focused on the lack of historic records concerning Shakespeare s life, as well as the evidence of a level of higher learning present in Shakespeare s works inconsistant with Shakespeare s background. This paper however well use modern text-analysis and datamining techniques to analize the works attributed to Shakespeare and several leading alternaive candidates in an attempt to provide a more concrete answer to the authorship question.
The Shakespeare authorship question has mainly been approached from a historical point of view . The work attributed to Shakespeare shows a knowl- edge of geography, foreign lanugage, politics, and an immense vocabulary that many find inconsistant with what’s known about about Shakespeare’s education. Shakespeare in his will also makes no mention of his shares in the Globe theatre, books, letters, or any of the 18 unpublished works at the time of his death. However, there is no one piece of concrete evi- dence that conclusively tips the argument either way. The arguments of many Anti-Stratfordians are sim- ply based on a subjective impression of Shakespear’s work .
names that preceed spoken lines. The works were then loaded into R using the TM text mining library and run through a stemming and stopword removal algorithm as well as a conversion algorithm to map all characters to lowercase. A collection of five poems and a number of sonnets were taken from a di erent website , but were also normalized into a single text file in a similar manner.
The works of Sir Frances Bacon , Christopher Marlow , and Edward de Vere  are also available online and were also put though a similar normalizing procedure.
The purpose of this paper is then to apply modern text analysis techniques using the R statistical pack- ege to compare the works attributed to Shakespeare to those of leading alternate candidates such as Sir Frances Bacon, Christopher Marlow, and Edward de Vere. We compare writing styles by using quantifi- able measures, that is character usage, word length, and percentage of unique words.
Our data on Shakespeare’s plays was pulled from MIT’s online repository of Shakespeare’s works . The total of 37 works are available in HTML for- mat which we then ran through a Ruby script to re- move the HTML tags, stage directions, and character
The first Shakespeare candidate we will examine is Christopher Marlowe. He was born in Canterbury in 1564. His father was a shoemaker, so he was fortu- nate to recieve a scholarship to both King’s School Canterbury, as well as Corpus Christi College, Cam- bridge. Here he practiced translations, poetry, and playwrighting. After school he entered the Queen’s service instead of entering the Church . He is said to have been killed in 1593, a fact denied by those who believe the Marlovian Theory (Shakespeare was Marlowe). Those who do not believe in this murder think he went o into some sort of exile and continued to write plays as Shakespeare. Fortunately, his works published before 1593 give us some room to test this hypothesis . Plays by Marlowe used in our anal- ysis include Dido, Queen of Carthage, Tamburlaine part 1, Tamburlaine part 2, The Jew of Malta, and