Monday, July 18, 2016

Doing Digital Research

One of the benefits of twenty-first century technology is the availability of texts online in digital format. For printed government records or antiquarian books long out of print or copyright, one of the best repositories is the Internet Archive, "a non-profit library of millions of free books, movies, software,
music, websites, and more." The site is very user-friendly, and there are literally billions of resources available, from the materials I'm looking for -- .pdf scans of eighteenth century Acts and Resolves of the Rhode Island General Assembly and John Russell Bartlett's Records of the Colony of Rhode Island and Providence Plantations -- to an extensive audio library of thousands of Grateful Dead concerts, 2.3 million book titles from dozens and dozens of American Libraries, and over 491 billion web pages in an Internet "Wayback Machine," curating an important component of recent history that would otherwise be lost in cyberspace.

Of course with all that cool stuff just one click away, one must be disciplined and not begin exploring all the rabbit holes at the Internet Archive...

Google Books is another online resource for digital documents but it is, in my humble opinion, less useful than the Internet Archive. Their downloadable scans are image scans rather than OCR scans, so they are not keyword searchable (more on OCR later). And since the last time I have done any serious digital research (I have purposefully taken the last two summers off from pursuing any new research projects to work on other things), Google appears to have taken a lot of documents that were previously downloadable and put them into a viewer system that I find cumbersome and difficult to navigate. While these are keyword searchable, in my experience serendipity plays a larger role than one might suspect -- I like to see the entire page rather than just the narrow slice of text in the viewer. One never knows what is right before or after the text that comes up in a keyword search -- often it is of little interest, but enough times it happens that the rest of the page turns out to be more important than the search term... Of course, all of this -- the unsearchable document scans, the snippets in the viewer, are due to Google being sued in Authors Guild v. Google and the resulting decision that found in favor of Google in large part because of their "snippets" policy.

Interestingly, while the Internet Archive and its Wayback Machine have, like Google, been targeted by lawsuits contending copyright infringement, the Internet Archive as a member of the Open Book Alliance, was one of "the most outspoken critics of the Google Book Settlement" and (unsuccessfully) challenged the court ruling that allowed Google Books to continue.

Then comes researching the texts of the .pdf files I have downloaded from the Internet. For this phase, my weapon of choice is the PDF-XChange Viewer. Unlike Adobe, which costs boku bucks and is constantly spamming unfortunate users with its the latest "security update," PDF X-Change is free and doesn't relentlessly bug users to update it, In fact, it has never bothered me to do anything ever after I installed it, though there is an commercial upgrade, the PDF-XChange Editor. It has some very useful functionality and, at $43, it is far less cheddar than Adobe's cheapest .pdf-editing program, which starts at $119. (Disclaimer: I bought a copy of it for the WRICHS Archive PC, and it has been a great tool for us that didn't break the bank.)

Now that I have downloaded my sources and opened them in the .pdf editing program, the next step is to use the editor's OCR (optical character recognition) to "rasterize" the document. This is a CPU intensive task and fairly time-consuming, even on a relatively new computer. For instance, as I type this I am having PDF-XChange OCR Volume IV of John Bartlett's Records of the Colony of Rhode Island and Providence Plantations, usually abbreviated as RICR. At 636 pages and taking 15-20 seconds per page, it will be 15 to 20 minutes before the file is rendered searchable (longer if I opt to use my computer while it rasterizes in the background -- such as writing this blog entry about digital research). When the OCR is done, I will be able to enter a search term and find all the instances where it appears.

In this case, the term I will be looking for in RICR Volume IV is pox, for an article I am writing about smallpox in 17th and 18th century Rhode Island. Once the .pdf has been rasterized, I'll type the term "pox" into the search window, and if it is anywhere in the text, it will take me to each page that "pox" appears in the text, starting with the first instance. Then I can screenshot the page using Irfan View (another great free program useful for quickly editing images like screenshots) and I have a Word .doc open where I then paste the screenshot. When I am finished, I will have a repository with all the references to smallpox from Bartlett in one place. If I decide I would like to quote from the original .pdf, I can manually transpose it or I can use the copy function in PDF-XChange to highlight, ctrl-c and ctrl-v the text right into the draft of my article. Note that the idiosyncrasies of eighteenth-century typeface don't always translate 100% with ye old "cut and paste" from a rasterized source.

So far, my searches have identified no references to smallpox in Bartlett earlier than 1690, when a serious outbreak struck Rhode Island that crippled the the colony's legislature and court system and left several town and colony officials dead. Thereafter, references to smallpox become more frequent. The colony eventually addressed the problem by passing strict quarantine laws for both towns and ships in the first quarter of the eighteenth century.

One question that emerges -- why are there no references to smallpox in Bartlett's RICR before 1690? Certainly, smallpox did not appear in Rhode Island for the first time in 1690. Several possible answers come to mind. First, colonists did not travel much in the early years of the colony. Rhode Island utterly lacked what would be considered passable roads, relying on "Indian paths" until the King's Highway was surveyed and and built after 1703. Also, since Rhode Islanders were regarded as religious and social pariahs by the Puritans in neighboring Massachusetts and Connecticut, few Englishmen from neighboring colonies desired to travel through the colony. In any event it was far easier to travel around Rhode Island by water than through it by land in the 1600s, which limited the colony's disease vector vis-à-vis travelers introducing the infection. Likewise, Newport's mercantile economy did not emerge until the 1690s, so opportunities for smallpox to enter the colony through trade was far less in the seventeenth century than they would become once Newport and later Providence became centers of Atlantic commerce.

Second, the majority of people living in Rhode Island before 1675 were not Englishmen but rather the Narragansett. It is unlikely that the laconic English records would have noted outbreaks of smallpox
among the Indian population, even if they were quite severe. Perhaps the worst outbreak of smallpox among the native population in southern New England occurred from 1632-1634; the Narragansett experienced an epidemic in 1633 and another in 1635 that killed hundreds of tribal members, ending before Roger Williams founded Rhode Island in 1636.

In the wake of the mass movement of both Natives and English during King Phillip's War, a smallpox epidemic struck southern New England, as noted in Boston records. But given that nearly every building on the mainland in Rhode Island had been damaged or destroyed during the war, it is not surprising that an outbreak of smallpox was overlooked (or records of it lost) at a time when so many inhabitants were homeless and the colony nearly destroyed. It is important to note that Rhode Island's seventeenth-century records are spotty even in times of health and prosperity. This pattern continued well into the eighteenth century; for instance it did not occur to Rhode Island's government to bind all its laws into a single manuscript until 1705, and the laws remained unprinted and inaccessible to the public until 1719.

Finally, the RICR are themselves notoriously incomplete -- if Bartlett did not consider a particular fact "important" enough in the original hand-written records he was working from, he did not transcribe and include it. Historians have noted such discrepancies between his printed transcriptions and the original handwritten manuscripts (referred to as Colony Records) in the Rhode Island State Archive. However, this issue is more common the later (and more voluminous) the original manuscripts were. Rhode Island also began having the General Assembly's hand-written records transcribed and professionally printed circa 1750. For the years where the there are printed Acts and Resolves of the General Assembly (also all scanned and available on the Internet Archive) it is useful to supplement Bartlett with those sources.

Another notable problem is the weak indexing of colonial-era government records and other antiquarian sources. A word to the wise -- do not rely on the index to find information! Each volume of Bartlett's Records of the Colony of Rhode Island and Providence Plantations has an index, but most of the references to a particular term in the text are not found there. In fact, a keyword may not appear in the index at all despite appearing repeatedly in the text. For example, the index in RICR Volume VI has a single listing for smallpox -- that the General Assembly passed a smallpox inoculation act (see below; note the highlighting of the keyword in the text by the OCR). However, a digital keyword search for pox in Volume VI turned up five discrete instances of the use of the term, including a lengthy obituary for former Rhode Island governor Samuel Ward, who died of smallpox in Philadelphia in March 1776 while representing the state in the Continental Congress, a 1772 resolution allowing a lottery to fund the rebuilding of Newport's smallpox hospital on Coaster's Harbor Island, and another resolution during the Revolutionary War ordering eleven towns across the state to designate smallpox inoculation hospitals.

In any event, working from home beats driving to Providence and pulling these same sources off the shelf or loading them into a microfilm viewer (though the Rhode Island State Archives ARE air-conditioned, unlike my house...) Ultimately, keyword searches are far more efficient than reading through literally thousands pages of irrelevant (and often distracting) text to find (or just as likely, miss) that first reference to smallpox in 1690, 54 years into the records. Software simply cannot make the errors that human beings may, with the result being that digital research is more thorough than would be otherwise humanly possible.

Somewhere near obsession...

Somewhere near obsession...