When talking about quantitative features in text analysis the term token count is king, but other features can help infer the content and context of a page. I demonstrate visually how the characters at the margins of a page can show us intuitively sensible patterns in text.
With the recent release of Pokemon Go, I’m posting my presentation notes from for designing a similar game called Pico Safari in collaboration with Lucio Gutierrez, Garry Wong, and Calen Henry in late 2009. The concept of virtual creatures in the real world follows so nicely from the technological affordances of the past few years, with ARG-enabling technologies in our phone and the evergreen motivation of “collecting stuff” (Remember victorian-era insect collecting?).
The talk below was presented at SDH-SEMI 2011, earning the Ian Lancashire Student Promise award.
The HTRC Extracted Features (EF) dataset provides two forms of language information: the volume-level bibliographic metadata (what the library record says), as well as machine-classified tags for each page of each volume. To get a sense of when the machine tags are useful, I looked at the 1.8 billion page classifications in the dataset and where they conflict with existing language metadata.
Of the 4.8 million volumes in the dataset, there are
379,839 books where the most-likely language across all pages is different from the bibliographic language, about
8% of the collection. The reasons for these discrepancies are not always clear, and they can indicate issues with the language classifier, or the human cataloguing.
When do you trust the bibliographic record and when do you trust the machine classifier? The simple answer is neither: you trust in when they agree, and stay leery of when they don’t. Continue reading “Understanding Classified Languages in the HathiTrust”
Crunched for 235,000 Language and Literature (i.e. LCC Class P) volumes, I’ve uploaded two datasets: IDF values for the terms, and a more generally useful dataset of frequency counts.
Continue reading “A Dataset of Term Stats in Literature”
This post is about words. Specifically, an intuition about words, one which is meant to grasp at the aboutness of a text. The idea is simple: that not all words are equally valuable. Here, I’ll introduce one of the foundational ways that people have tried to formalize this intuition: TF-IDF.
I’ve released an overhaul of the HTRC Feature Reader, a Python library that makes it easy to work with the Extracted Features (EF) dataset from the HathiTrust. EF provides page-level feature counts for 4.8 million volumes, including part-of-speech tagged term counts, line and sentence counts, and counts of which characters occur in the far right and left sides of the text. The Feature Reader provides easy parsing of the dataset format and in-memory access to different views of the features. This new version works in service of the SciPy stack of data analysis tool – particularly Pandas. I’ve also transferred the code to the HathiTrust Research Center organization, and it is the first version that can be installed by pip:
pip install htrc-feature-reader
If you want to jump into using the HTRC Feature Reader, the README walks you through the classes and their methods, the documentation provides more low-level detail, and the examples folder features Jupyter notebooks with various small tutorials. One such example is how to plot sentiment in the style of Jockers’s plot arcs. The focus of this post is explaining the new version of the Feature Reader.
A small but useful tip today, on using iPython notebooks for a git project README while keeping an auto-generated version in the Markdown format that Github prefers.