“Well, well well, how the turntables…” (A humanities scholar cleaning big data)

As someone who has catalogued thousands of music performances, big data is no mystery to me. Neither is data mining or tidying data. This week, however, I learned not only some new tools and tricks of the trade, but was able to further my musicological explanations for the importance of big data in humanities work. As I explored these new (to me) methodologies for data mining, both in the theoretical and practical sense, I strengthened my ability to argue for the usefulness of these seemingly disparate methodologies.

The debates over the usefullness of data mining in humanities are ubiquitous to the work of those conducting this kind of work – if an article concerning Digital Humanities and Art History , Musicology, or any other humanities field is published, it is often polemical, reacting to or pushing away from the idea that the digital and the human are mutually exclusive. This binary view of humanities and technology, setting the emotion and experience of the human against the binary coded computer, has been disputed for nearly 50 years. Jules Prown argues for the usefulness of computational analysis in art history as far back as 1966.

The troubling human/technological binary is too much to unpack here, but it is an elephant in the room as soon as someone mentions digital humanities. It often leads to questions like “so what?” “where’s the buck?” or “how does this do anything more than what we can do already in prose?” These questions I will answer below by discussing a couple of digital humanities tools.

Text Mining

Now, text mining might sound intimidating, but it is useful for both research and educational purposes. Say, for example, your work hinges on the fact that someone was the first to coin a term, or that a term doesn’t exist until a certain time. Or, you want to confirm that a word has or had a specific usage before that might be different from what is assumed. How can you possibly know this information and be fairly certain of its accuracy? While text mining through tools like Google NGRAM and Voyant are by no means 100% accurate, they’re at least a step towards discovery and potentially validating or disqualifying claims.

Using tools like Google NGRAM and Voyant, you can enter a word or group of words and the software will show you the prevalence of that word via charts which are generated after the algorithm searches through every OCR’d word in its arsenal. This includes millions of books, more than a person could read in ten lifetimes. Now, just because a book has a word and it shows up on the usage chart, it doesn’t mean that this method isn’t problematic. To say nothing of the neglect for oral history and the language barriers of this method, it’s already extremely important to acknowledge that context is key, here – is a book using a term as a part of its vocabulary? Or saying something outdated? Is it at a time when that term is in colloquial use? Or under scrutiny? All of these contexts are essential to consider. A book with a talking dog as its main character will have a different context from the analysis of Pavlov’s dog, but both will appear in the search.

If you control the variables enough, however, this could be an immensely helpful research and educational tool. Perhaps you want to use it like they can on the DataBasic site, where they can use text mining from a spreadsheet to generate networking maps. It also can be helpful for personal use, as you can search through any OCR’d research materials ou have for key words or phrases in common, or very simply, find terms and quotes you remember vaguely but cannot seem to find on the page. Control/Command + “F” in an OCR’d PDF is just as much text mining as anything else. Overall, text mining can be helpful, with the potential to corroborate or challenge research, lead to new questions, and act as a research and educational tool.

Data Analysis and Display (Charts, oh my!)

Most forms of research are no stranger to charts. Even musicological work can include music theory form charts and conceptualization charts. These are beneficial to take in ideas and see trends in data – they often are also given a lot of trust when consumed by an uncritical reader, so we much be careful about how we present data and enter it into the chart.

Tidy Data

The cleaner the data, the better the chart. Knowing how a mapping software will find geolocations can help you format excel sheets, for example. And, knowing what fields you want to display helps you chart your course into the organization of your sheets. In other words, knowing something about where you’re going can help you have the cleanest data possible. It is unfortunate to get a third of the way through entering spreadsheet data and to realize you forgot a column with the year of a piece and have now decided you want to map the pieces chronologically. It’s all in the details.

Graphs and data, whether they be pie charts or flow charts, bar graphs or scatter plots, can make statistical arguments that corroborate claims. In an often anecdotal field like musicology or art history, sometimes it is easier to make a claim that applies to many things if you can prove that it does in fact apply just by looking at the numbers. This corroboration of specific stories through big data is legitimizing in many contexts – it does not have to make a new argument or create some breakthrough for it to be relevant, which is often the desire of the people who say “so what?” Odds are, if you’re saying “so what,” you aren’t thinking creatively enough.

How are new theories created? How are new methodologies established? When someone has the understanding that it takes a lot of thinking creatively to connect disparate themes, to tackle binaries, to completely legitimize something that once seemed impossible. We will not expand or contribute to our fields by only working with what is comfortable and relevant. I believe we should lean into these moments of seeming irrelevance to discover something truly original, something outside of the box that destroys the box altogether. In embracing our disinterest and discomfort, we may fail, but we may also discover greatly.

One Reply to ““Well, well well, how the turntables…” (A humanities scholar cleaning big data)”

  1. Emily – your posts are always so thoughtful and through – so much I want to say in response! First, I appreciate your acknowledgement of the issues within and polarizing nature of a lot of data analysis tools, and DAH in general. Like you, I think these issues are worth talking about, but I also think a lot of these tools still have their place in our work and shouldn’t be discredited because they aren’t perfect or because they can’t do exactly what we want. I’m embarrassed to admit it, but I hadn’t even considered the ramifications of using a text analysis took like Voyant for oral histories. In some instances, the particulars and uniqueness of language are essential elements to how oral histories are studied. I know from transcribing oral histories that often times you end up using words that are technically ‘real’ words. How do we deal with OCRing those and having text mining software deal with them appropriately? So much to consider! I also appreciate you validating the use of charts and data visualization, and brining up the idea that if you’re always asking, ‘so what?’ you might not be thinking creatively enough! As digital humanities are still fairly new, there is still value in these basic digital tools and we shouldn’t be so quick to dismiss them. I think tools that make data easier to digest are crucial when applying for funding or proving the worth of your research to those either outside the field or in higher up admin positions. Thanks again for all your thoughts!

Comments are closed.