Readers are idiosyncratic. Recent work with library data has underscored this fact more than ever, but it has also drawn attention to specific patterns in reading behavior – many even unexpected. What, then, is the relation between borrowing patterns and borrower idiosyncrasy at the broader level of overall library use: across many patrons’ full borrowing history?
I had the pleasure of presenting on this subject at the excellent Library Circulation Histories Workshop over the past couple weeks. I explored approaches to logistic modeling with data from the Muncie, Indiana library checkout records database What Middletown Read in order to test the degree to which patrons’ borrowing histories are predictive of whether they borrowed a book by any given author.
In the process, my goal was to evaluate the assumptions and consequences of modeling borrower behavior itself given the peculiarities of library checkout data and the mechanics of logistic regression. I suggest that we should treat models of borrowing habits or patron predictability as measures of the legibility of taste. In the recording below, I also discuss implications of:
- Treating the models' latent variable as taste (rather than utility);
- Class imbalance, the fact that borrowing a book by any given author is a statistically “rare event”;
- The relative unhelpfulness of super users and ultra-popular novels vs the relative instructiveness, in the aggregate, of moderate-volume borrowers;
- Interpreting distribution of error for cultural choices as pertaining to individual idiosyncrasy;
- False positives, which may indicate what patrons should have read (valid “recommended reading”) or did read outside the span of the data (at a different time or in a different context); and
- The necessity of understanding library and print culture when working with such data.
Here is a copy of the slides that I used as well.
Though I won’t go into the detail here that I do in my presentation, the figure below shows concluding results for a several important authors (after a few key sampling and method optimizations). Each dot represents a patron who either borrowed (orange) or did not (blue) the author named in each plot; the y-axis marks the probability assigned by the model to each case. The plots thus correspond to a classic confusion matrix: true positives in upper left, false positives in upper right, false negatives in lower left, and true negatives in lower right. I’ve printed the accuracy (proportion of total correct predictions) and sensitivity (proportion of correctly predicted positive cases) for each author model.
Based on the qualified success of prediction despite limitations outlined above, we can identify four key factors that improve the fit of models of whether patrons borrowed an author’s work – in other words, four key scenarios in which a patron’s underlying taste is more legible or internally consistent.
- Authors who were more popular as measured by more total checkouts, such as the sentimental novelist Mary Johnston and romance writer Booth Tarkington – but not extremely popular, as Louisa May Alcott was;
- Authors with many books held by the library, such as the prolific Horatio Alger despite his broad appeal and extraordinary popularity;
- Authors who wrote in a particular genre, movement, or otherwise relatively niche part of the literary marketplace, such as the realist Harold Frederic and the detective writer Anna Katharine Green;
- Authors who did not belong to the emergent popular – and thus intersectional – canon, unlike Sam Clemens aka Mark Twain, Nathaniel Hawthorne, and Charles Dickens (not pictured).
Several co-participants and I subsequently had a great forum discussion that touched on some additional issues surrounding multicollinearity and temporal unevenness when working with cultural data that I could only briefly address in my presentation. There’s clearly a lot more exciting work to be done in this area (and if anyone reading this is dong it, I hope you’ll drop me a line).