Data mining in the digital humanities involves performing some kind of extraction of information from a body of texts and/or their metadata in order to ask research questions that may or may not be quantitative.
Supposing you want to compare the frequency of the word “she” and “he” in newspaper accounts of political speeches in the early 20th century before and after the 19th Amendment guaranteed women the right to vote in August 1920. Suppose you wanted to collocate these words with the phrases in which they were written and sort the results based on various factors—frequency, affective value, attribution and so on. This kind of text analysis is a subset of data mining.
Quite a few tools have been developed to do analyses of unstructured texts, that is, texts in conventional formats. Text analysis programs use word counts, keyword density, frequency, and other methods to extract meaningful information. The question of what constitutes meaningful information is always up for discussion, and completely silly or meaningless results can be generated as readily from text analysis tools as they can from any other.
A stop word list is a set of words that should be excluded from the results of a tool. Typically stopword lists contain so-called function words that don’t carry as much meaning, such as determiners and prepositions (in, to, from, etc.). STOP WORDS SCREENSHOTS
Optical character recognition/reader is the mechanical or electronic conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast).
Widely used as a form of information entry from printed paper data records – whether passport documents, invoices, bank statements, computerised receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitising printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.
Free Online OCR Software
Online Sources of Text
Online Corpus Examples
Project Gutenberg’s collection of 37 plays from William Shakespeare
Austen: Project Gutenberg’s collection of 8 novels from Jane Austen: Love And Freindship, Lady Susan, Sense and Sensibility, Pride and Prejudice, Mansfield Park, Emma, Northanger Abbey, Persuasion
Examples of Text Analysis Projects
Voyant Tools is a web-based text reading and analysis environment.
It’s designed to make it easy for you to work with your own text or collection of texts in a variety of formats, including plain text, HTML, XML, PDF, RTF, and MS Word.
What is a corpus?
A corpus is a collection of texts of written (or spoken) language presented in electronic form.
Cirrus: a word cloud that displays the highest frequency terms – the larger the term, the more frequent it is (you can hover over words to see their frequency and click on them to see additional information). It looks like a word cloud visualization.
Summary: this provides some basic information about the text(s) in the collection, including the number of words, the length of documents, vocabulary density, and distinctive words for each document. It is an overview of the corpus, including word counts and aggregate trends.
Corpus Reader: this allows you to read the text(s) in the collection – more text will appear as you scroll. You can hover over words to view their frequency and click on terms to see more information. It is a scalable text reader that can be used to scroll very large documents.
Clicking on terms in the environment will open up additional tools.
For example, if you click on a word in Cirrus, you’ll see the “Word Trends” tool appear and clicking on one of the dots in “Word Trends” will cause the “Keyword in Context” tool to open.
Word Trends is a distribution graph that shows word frequencies across multiple documents or within a single document.
Keyword in Context shows occurrences of each word in its context.
Special Tools: (accessed by click on the header or on the arrows)
Words in the Entire Corpus: this shows an ordered list of terms in all documents, including a micro-graph (sparkline) showing distribution across the corpus (when your corpus includes multiple documents).
Corpus: a grid that shows available metadata for documents in the collection.
Words in Documents: this shows frequency information for terms in each document.
You can bookmark and share URLs that refer to your collection of texts, so that you can continue to work on your project without having to reload all the documents each time.
You can export a link for the entire project (or “skin” which is the combination of tools) by clicking on the “Export” (diskette) icon in the blue bar at the top, or export a link for an individual tool by clicking on the “Export” icon in one of the tool panes.*
Examples of Voyant Projects:
In the lead-up to the 2008 US Presidential election the news media became interested in what Barack Obama and his spiritual mentor Jeremiah A. Wright Jr. had to say about race.
What if we took them at their word and looked away from the podium-and-pews drama?
What if we take them seriously and look at what they say?
What if we used our hermeneutica to try to "analyze that," interrogating and interpreting the similarities and differences between their speeches?
We chose to look at:
Barack Obama's March 18, 2008 speech “A more perfect union” given in response to the media attention. This speech has been generally considered one of Obama's finest on race and America.
Jeremiah Wright's April 27th speech to the National Association for the Advancement of Coloured People (NAACP) that follows Obama's speech; it also deals with race.
In Hume’s Dialogues Concerning Natural Religion (1779), one of the great philosophical dialogues of the eighteenth century, a fictional author-narrator named Pamphilus describes a conversation among Philo (a sceptic), Cleanthes (a theist), and Demea (a fundamentalist) about God’s existence and God’s nature.
We began our inquiry into Hume’s work by asking what we could learn about scepticism from the it using text analysis. Scepticism interested us because it is an approach that underlies much of what we think about knowing, and because much of this book is about how computational tools can help us in thinking through.
James Baker describes a project in which Baker gathered metadata for a large number of British cartoons from the 1960s and 70s and analyzed it using a variety of tools. Baker used Voyant for topic modeling, which Miriam Posner describes as “a method for finding and tracing clusters of words (called ‘topics’ in shorthand) in large bodies of texts.” (Interact with Baker’s corpus here.)
Baker anticipates the question often posed about the results of text mining: “[W]hat did I actually discover in the data?” Examining a word cloud derived from Voyant’s Cirrus tool, Baker suggests that “[t]he themes of the cartoons in the corpus track the politics of the day.” He argues that “textual content within cartoons during the same period tended toward natural language.” Baker also uses Voyant’s Word Trends graphs to track the relative frequencies of key terms in the cartoons under examination.
Sending out the URL of the current tool(s) and corpus can be helpful for bookmarking or for offering work to associates (email, Twitter, and so forth.). Albeit no assurance is influenced that he corpus to will be held uncertainly, it won’t likely be wiped out on the off chance that it has been gotten to in the previous two weeks.
The skin provides links to the tool browser and the skin builder (to create your own combination of tools) but does not provide functionality to embed code snippets or to export data. In contrast, individual tools do provide functionality for exporting data, depending on the type of tool it is. For instance, some of the tabular data tools allow the user to export comma-separated values whereas some of the visualization tools allow the user to export a static image.
Additional User Interface Elements: