Pico Safari: Active Gaming in Integrated Environments

With the recent release of Pokemon Go, I’m posting my presentation notes for a similar game called Pico Safari, a collaboration with Lucio Gutierrez, Garry Wong, and Calen Henry in late 2009, advised by Drs. Sean Gouglas, Geoffrey Rockwell, and Eleni Stroulia. There is no chest-thumping here: the concept of virtual creatures in the real world follows so nicely from the technological affordances of the past few years, with ARG-enabling technologies in our phone and the evergreen motivation of “collecting stuff” (Remember victorian-era insect collecting?), that it seemed so obvious. We were creating a game we wanted to play (I pitched the idea to the group as ‘Pokemon in the real world’) and I love that there’s now the real deal with Nintendo’s magic.

The talk below was presented at SDH-SEMI 2011, earning the Ian Lancashire Student Promise award.

Continue reading “Pico Safari: Active Gaming in Integrated Environments”

Understanding Classified Languages in the HathiTrust

lang-split.PNG

The HTRC Extracted Features (EF) dataset provides two forms of language information: the volume-level bibliographic metadata (what the library record says), as well as machine-classified tags for each page of each volume. To get a sense of when the machine tags are useful, I looked at the 1.8 billion page classifications in the dataset and where they conflict with existing language metadata.

Of the 4.8 million volumes in the dataset, there are 379,839 books where the most-likely language across all pages is different from the bibliographic language, about 8% of the collection. The reasons for these discrepancies are not always clear, and they can indicate issues with the language classifier, or the human cataloguing.

When do you trust the bibliographic record and when do you trust the machine classifier? The simple answer is neither: you trust in when they agree, and stay leery of when they don’t. Continue reading “Understanding Classified Languages in the HathiTrust”

A Dataset of Term Stats in Literature

Following up on Term Weighting for Humanists, I’m sharing data and code to apply term weighting to literature in the HTRC’s Extracted Features dataset.

Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
Continue reading “A Dataset of Term Stats in Literature”

Term Weighting for Humanists

This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll introduce one of the foundational ways that people have tried to formalize this intuition: TF-IDF.

Continue reading “Term Weighting for Humanists”

HTRC Feature Reader 2.0

I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for 4.8 million volumes, including part-of-speech tagged term counts, line and sentence counts, and counts of which characters occur in the far right and left sides of the text. The Feature Reader provides easy parsing of the dataset format and in-memory access to different views of the features. This new version works in service of the SciPy stack of data analysis tool – particularly Pandas. I’ve also transferred the code to the HathiTrust Research Center organization, and it is the first version that can be installed by pip:

pip install htrc-feature-reader

If you want to jump into using the HTRC Feature Reader, the README walks you through the classes and their methods, the documentation provides more low-level detail, and the examples folder features Jupyter notebooks with various small tutorials. One such example is how to plot sentiment in the style of Jockers’s plot arcs. The focus of this post is explaining the new version of the Feature Reader.

download (4).png
Chart from the Within Books Sentiment Trends tutorial

Continue reading “HTRC Feature Reader 2.0”

Git tip: Automatically converting iPython notebook READMEs to Markdown

A small but useful tip today, on using iPython notebooks for a git project README while keeping an auto-generated version in the Markdown format that Github prefers.

Continue reading “Git tip: Automatically converting iPython notebook READMEs to Markdown”

MARC Fields in the HathiTrust

At the HathiTrust Research Center, we’re often asked about metadata coverage for the nearly 15 million HathiTrust records. Though we provide additional derived features like author gender and language inference, the metadata is generally what arrives from institutional providers via HathiTrust. HathiTrust provides some baseline guidelines for partners, but beyond those, what you can expect is dependent on what institutional partners provide.

For a sense of what that is, below is a list of the most common MARC fields. Crunching this didn’t involve any special access through the Research Center: you could easily access the same records via HathiTrust’s Bibliographic API (and hey, some code!).

The good news is that at the scale of the HathiTrust’s collection, even a small random fraction of the full collection is sufficient to see many quantitative patterns. We see this often at the Research Center, where we work largely at aggregate levels over thousands or millions of volumes: term frequencies, topic models, and other types of distributions converge much, much earlier. The bad news is that you can’t be sure that the biases in the missing vs. included data are random. For that, you’ll have to look more closely at a field that you’re interested in.

This data is presented for curiosity with little commentary, but I will offer one pro-tip: If you’re looking to get a thematic classification of a volume, the ~50% coverage of Library of Congress call numbers is a good place to start.

Continue reading “MARC Fields in the HathiTrust”