Photographing Cabeza de Vaca’s La Relacion

Erin and I have had a blast visiting the Wittliff Collections on the 7th floor of Alkek Library while we photographed Cabeza de Vaca’s La Relación this week.

Digital & Web Services is partnering with the Wittliff Collections for a forthcoming update to the current online exhibit.

Here are a few behind-the-scenes photos I made while we worked today.

Jeremy Moore

jeremy.moore@txstate.edu

Old Main as seen from the 7th floor of Alkek Library. Photograph by Jeremy Moore

Old Main as seen from the 7th floor of Alkek Library

Erin Mazzei photographing Cabeza de Vaca’s La Relación.

Erin Mazzei photographing Cabeza de Vaca’s La Relación.

Photographing Cabeza de Vaca’s La Relación.

Photographing Cabeza de Vaca’s La Relación.

Cabeza de Vaca’s La Relación

Cabeza de Vaca’s La Relación

Searching the Library of Congress with Python and their new JSON API, Part 2

This post builds on my last one: Searching the Library of Congress with Python and their new JSON API, which is why I’ve added Part 2 to the end of the title. Before we dive back into the Library of Congress‘s JSON API, some housekeeping items:

  1. Even though the Library of Congress’s website is loc.gov, the abbreviation for Library of Congress is LC
    1. I tried to find a press-ready image of NBC‘s old The More You Know logo I could add here, but
      1. the updated logo doesn’t make me hear the jingle in my head
      2. I did find Megan Garber‘s 2014 article covering the PSA series for The Atlantic that has some classic video I enjoyed
  2. As of October 2017, LC has expressly stated in a disclaimer that their JSON API is a work in progress. Use at your own risk!! We might (will likely) change this!

Recap on Stryker’s Negatives Project

I recently came across Michael Bennett‘s article Countering Stryker’s Punch: Algorithmically Filling the Black Hole in the latest edition of code4lib: <– GREAT STUFF!

He’s using Adobe Photoshop  and GIMP to digitally restore blank areas in images due to a hole punch having been taken to the physical negative.

Current Task

Use the Library of Congress’s JSON API to download all of the hole punch images and their associated metadata.

Continue reading

Searching the Library of Congress with Python and their new JSON API

I recently read Michael J. Bennett‘s article, Countering Stryker’s Punch: Algorithmically Filling the Black Hole, in the latest edition of code4lib and wow, great stuff!

Bennett is comparing Adobe Photoshop’s Content Aware Fill tool and a GIMP technique I haven’t seen before as part of a workflow to digitally restore areas in photographs with no picture due to a hole punch having been taken to the physical negative. Part of Roy Stryker’s legacy at the FSA. You can read more about the negatives in Bennett’s article, this recent feature by Alan Taylor in The Atlantic and in Eric Banks’s review of what sounds like an amazing gallery show by William E. Jones in NY in 2010 in The Paris Review.

Library of Congress Search Results totaled 2,474

Library of Congress Search Results

I have been dipping my toes into computer vision recently and thought expanding upon Bennett’s work seems a good and educating challenge. A quick search at the Library of Congress returns nearly 2,500 results so this would also be a pretty good-sized data set to work on for a beginner with a bunch of exceptions and tweaking for edge-cases, but also a LOT of images and metadata to gather!

Luckily Coincidentally, the Library of Congress, who have digital scans of the punched negatives, recently advertised a new JSON API to access their materials so one could learn how to hook Python into their JSON API and automate data gathering.

I highly suggest starting with the Library of Congress’s Github account and their data-exploration repository, specifically the LOC.gov JSON API.ipynb Jupyter Notebook. If you’re new to working with code they make getting started much simpler. A Jupyter Notebook runs in your browser and consists of a column of cells. Each cell is a text box that can contain Markdown or code. Any code you write you can run right in the page and it will print any output to the page. You can display multi-media, too. I think of it like a blog post with in-line code and use it as a sandbox for building ideas.

I need to get a handle on version control and backing my code up to Github so it’s easily accessible, but the following is a very edited account of my recent travels in cyberspace.

Getting JSON back from a query

# import libraries
 from __future__ import print_function
  import requests

# set our search url and payload as shown in the quick start guide for requests
  # http://docs.python-requests.org/en/master/user/quickstart/
  url = 'https://loc.gov/pictures/search/'

# payload info is pulled from: https://www.loc.gov/pictures/api
  payload = {'q': 'Negative has a hole punch made by FSA staff to indicate that the negative should not be printed.', 'fo': 'json'}

response_json = requests.get(url, params=payload).json()
 print(response_json) # print response

Output

{u'search': {u'hits': 2458, u'focus_item': None, u'sort_by': None, u'field': None, u'sort_order': None, u'do_facets': True, u'query': u'Negative has a hole punch made by FSA staff to indicate that the negative should not be printed.', u'type': u'search'}, u'links': {u'json': u'//loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&fo=json', u'html': u'//loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.', u'rss': u'//loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&fo=rss'}, u'views': {u'current': u'list', u'list': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.', u'grid': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&st=grid', u'gallery': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&st=gallery', u'slideshow': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&st=slideshow'}, u'facets': [{u'type': u'displayed', u'filters': [{u'count': 2458, u'on': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&fa=displayed%3Aanywhere&sp=1&fo=json', u'selected': False, u'off': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&sp=1&fo=json', u'title': u'Larger image available anywhere'}, {u'count': 0, u'on': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&fa=displayed%3Alc_only&sp=1&fo=json', u'selected': False, u'off': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&sp=1&fo=json', u'title': u'Larger image available only at the Library of Congress'}, {u'count': 0, u'on': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&fa=displayed%3Anot_digitized&sp=1&fo=json', u'selected': False, u'off': u'//www.loc.gov/pictures/search/?q=Negative+has+a+hole+punch+made+by+FSA+staff+to+indicate+that+the+negative+should+not+be+printed.&sp=1&fo=json', u'title': u'Not Digitized'}], u'title': u'displayed'}], u'suggestions': {u'possible': []}, u'focus': None, u'collection': None, u'results': [{u'source_created': u'1997-05-08 00:00:00', u'index': 1, u'medium': u'1 negative : nitrate ; 35 mm.', u'reproduction_number': u'LC-USF33-002983-M3 (b&w film nitrate neg.)\nLC-DIG-fsa-8a10673 (digital file from original neg.)', u'links': {u'item': u'//www.loc.gov/pictures/item/2017724471/', u'resource': u'//www.loc.gov/pictures/item/2017724471/resource/'}, u'title': u'[Untitled photo, possibly related to: U.M.W.A. (United Mine Workers of Ameri

That was only 10% of the whole output!

Well, it was actually the number of characters received divided by 10 without the remainder.* The JSON is pretty readable, but still a lot to digest.

Two things in the output I notice right away are the word hits and the number 2458. This is probably the total hits returned on our query. The total number of results, or hits, is a metric I probably want to know for just about every search so some baby functions to save time typing later.

# import libraries
from __future__ import print_function # I'm using Python 2.7
import requests

# define functions
# get the json
def get_loc_json(url, payload):
    return requests.get(url, params = payload).json()

# search function with optional collection
def query_loc_json(query, collection=None):
 
    # we'll try everything in https
    url = 'https://www.loc.gov/pictures/search/'
 
    # we need a payload of parameters for our search function
    # that are in key: value pairs
    # Code for payload comes from the loc.gov API page 
    if collection is not None:
        payload = { \
        'q': query,
        'co': collection,
        'fo': 'json'
        }
    elif collection is None:
        payload = { \
        'q': query,
        'fo': 'json'}
 
    # give us back json! 
    return get_loc_json(url, payload)

# how many hits for a query do we receive
def get_loc_hits(response_json):
    return response_json['search']['hits']

query = 'Negative has a hole punch made by FSA staff to indicate that the negative should not be printed.'
response_json = query_loc_json(query) 
loc_hits = get_loc_hits(response_json)

print("Your search '{}' returned {} hits.".format(query, loc_hits))

Output

Your search 'Negative has a hole punch made by FSA staff to indicate that the negative should not be printed.' returned 2458 hits.

 How about titles?

Yep, we can get those too, but instead of being under search like hits is, the information we want now is the actual title for items returned by our query. Looking at the API documentation we can see that items are returned in results and each result has a title. I am appending a new line to the end of each title by adding '\n' to item['title'] just so the output is a bit easier to read.

for item in response_json['results']:
    print(item['title'] + '\n')

Output

[Untitled photo, possibly related to: U.M.W.A. (United Mine Workers of America) official, Herrin, Illinois. General caption: Williamson County, Illinois once produced 11,000,000 tons of coal a year, and led the state in output.  Since 1923 output has steadily declined until now it falls short of 2,000,000 tons. At one time, sixteen mine whistles blowing to work could be heard from the center of Herrin. Now only two mines are working and these two will probably be abandoned within the next year. The Herrin office of the U.M.W.A. was once the most active in the state. Today it is no longer self sustaining. These pictures were taken in the Herrin U.M.W.A office on a day when the mines were not working]

[Untitled photo, possibly related to: Sunday school picnic. Much of the food brought into abandoned mining town of Jere, West Virginia by "neighboring folk" from other parishes. There is a great deal of "hard feelings" and many fights between Catholics and Protestants. Miners as a whole are not very religious, many not having any connections with church, though they may have religious background. Hard times has caused this to a certian extent. As one said, "The Catholic church expected us to give when we just didn't have it"]

[Untitled photo, possibly related to: Loafers' wall, at courthouse, Batesville, Arkansas. Here from sun up until well into the night these fellows, young and old, "set". Once a few years ago a political situation was created when an attempt was made to remove the wall. It stays. When asked what they do there all day, one old fellow replied: "W-all we all just a 'set'; sometimes a few of 'em get up and move about to 'tother side when the sun gets too strong, the rest just 'sets'."]

[Untitled photo, possibly related to: Loafers' wall, at courthouse, Batesville, Arkansas. Here from sun up until well into the night these fellows, young and old, "set". Once a few years ago a political situation was created when an attempt was made to remove the wall. It stays. When asked what they do there all day, one old fellow replied: "W-all we all just a 'set'; sometimes a few of 'em get up and move about to 'tother side when the sun gets too strong, the rest just 'sets'."]

[Untitled photo, possibly related to: Loafers' wall, at courthouse, Batesville, Arkansas. Here from sun up until well into the night these fellows, young and old, "set". Once a few years ago a political situation was created when an attempt was made to remove the wall. It stays. When asked what they do there all day, one old fellow replied: "W-all we all just a 'set'; sometimes a few of 'em get up and move about to 'tother side when the sun gets too strong, the rest just 'sets'."]

[Untitled photo, possibly related to: Wife of a prospective client, Brown County, Indiana. Husband and wife will be resettled on new land when their property has been purchased by the government]

[Untitled photo, possibly related to: Wife of a prospective client, Brown County, Indiana. Husband and wife will be resettled on new land when their property has been purchased by the government]

[Untitled photo, possibly related to: Wife of a prospective client, Brown County, Indiana. Husband and wife will be resettled on new land when their property has been purchased by the government]

[Untitled photo, possibly related to: Wife of a prospective client, Brown County, Indiana. Husband and wife will be resettled on new land when their property has been purchased by the government]

[Untitled photo, possibly related to: Jefferson furnace-made iron for "Monitor" in Civil War, not far from Jackson, Ohio]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: Home of a family of ten that has been on relief for eighteen months, Brown County, Indiana]

[Untitled photo, possibly related to: The "water company" was formed by the people in abandoned mining town of Jere, West Virginia, after the coal company cut off all public services because they were abandoning the mine. The coal company used to charge one dollar and twenty-cove cents per month for water. The present "people's" water company charges its members twenty-five cents per month and makes money at that even when everyone can't pay dues. There are dividends of flour, sugar, lard, etc. The lock is necessary to keep people from other camps from stealing the water, which is very scarce. It's still necessary to change the lock about every four months]

[Untitled photo, possibly related to: One of the more elaborate gopher holes, equipped with a screen shaker, Williamson County, Illinois]

Long titles!

Yes, and how many titles did we receive? We could count, but we can quickly find the number of items in results with len()

len(response_json['results'])

Output

20

How long are the individual titles?

Good thinking! If we can find the number of items in results 20 and we have already located the title for each respective result can we figure out how long each of those titles are?

Can we also identify the shortest and longest titles out of our 20 while we’re at it?

Could we Tweet full titles?

# we will use enumerate to get an index number for each result
# https://docs.python.org/2/library/functions.html#enumerate
for index, result in enumerate(response_json['results'], start=1): # default start=0

    title = result['title'] # title for each result
    title_length = len(title) # length of title for each result

    # is it the longest?
    if longest_title == None: # If None titles then
        longest_title = title  # title must be longest
    elif title_length > len(longest_title): # else if our title is longer than the longest title
        longest_title = title # our title is the new longest_title

    # is it the shortest?
    if shortest_title == None:
        shortest_title = title
    if title_length < len(shortest_title):
        shortest_title = title

    # is it Tweetable?
    if title_length <= 140: # is it less than or equal to 140?
        is_tweetable = True # then yes it is!
    else:
        is_tweetable = False # Otherwise, nope

    print('Title {}: {} characters.'format(index, title_length))
    print('Is tweetable: {}'.format(is_tweetable)

print('\n' + 'Longest title contains {} characters.'.format(len(longest_title))) # add new line to it's easier to read
print(longest_title)

print('\n' + 'Shortest title contains {} characters'.format(len(shortest_title)))
print(shortest_title)

Output

Title 1: 708 characters.
Is tweetable: False
Title 2: 531 characters.
Is tweetable: False
Title 3: 485 characters.
Is tweetable: False
Title 4: 485 characters.
Is tweetable: False
Title 5: 485 characters.
Is tweetable: False
Title 6: 195 characters.
Is tweetable: False
Title 7: 195 characters.
Is tweetable: False
Title 8: 195 characters.
Is tweetable: False
Title 9: 195 characters.
Is tweetable: False
Title 10: 121 characters.
Is tweetable: True
Title 11: 129 characters.
Is tweetable: True
Title 12: 129 characters.
Is tweetable: True
Title 13: 129 characters.
Is tweetable: True
Title 14: 129 characters.
Is tweetable: True
Title 15: 129 characters.
Is tweetable: True
Title 16: 129 characters.
Is tweetable: True
Title 17: 129 characters.
Is tweetable: True
Title 18: 129 characters.
Is tweetable: True
Title 19: 665 characters.
Is tweetable: False
Title 20: 137 characters.
Is tweetable: True

Longest title contains 708 characters.
[Untitled photo, possibly related to: U.M.W.A. (United Mine Workers of America) official, Herrin, Illinois. General caption: Williamson County, Illinois once produced 11,000,000 tons of coal a year, and led the state in output.  Since 1923 output has steadily declined until now it falls short of 2,000,000 tons. At one time, sixteen mine whistles blowing to work could be heard from the center of Herrin. Now only two mines are working and these two will probably be abandoned within the next year. The Herrin office of the U.M.W.A. was once the most active in the state. Today it is no longer self sustaining. These pictures were taken in the Herrin U.M.W.A office on a day when the mines were not working]

Shortest title contains 121 characters.
[Untitled photo, possibly related to: Jefferson furnace-made iron for "Monitor" in Civil War, not far from Jackson, Ohio]

Where are the other 2,438 items?

This is where the Library of Congress’s JSON API examples were very helpful: moving from page to page. Though, my code ended up a bit different from theirs.

response_json = query_loc_json(query)

while True: # only get a next page if there is one!
    for result in response_json['results']:
        print(result['index']) # we didn't have to enumerate before as LOC provides us with an index for each item in the JSON
    next_page = response_json['pages']['next'] # this line differs from LOC JSON API guide
    if next_page is not None:
        url = 'https:' + next_page # the JSON does not supply the https: for us so we need to add it
        payload = {'fo': 'json'} # and we still need to supply format to get the JSON
        response_json=get_loc_json(url, payload)
    else:
        break

Output

1
2
3
4
5
6
7 ...
... 2455
2456
2457
2458

2458!

Yep, we actually have 2,458 titles, 1 for each hit, though I cut a few so you didn’t have to scroll through them all. In addition to the next page, response_json['pages'] can tell us how many results we have per page, our current page, and what the total number of pages are. Here’s the code to see what fields are exposed to us at point pages.

print("Fields in response_json['pages']\n")

for field in response_json['pages']:
    print(field)

Output

Fields in response_json['pages']

perpage
last
results
first
next
current
page_list
total
previous

Lists & Key Pairs

We have a list of keys that are paired with 1 or more values, though one of those possible values is None. So what is contained in each of the response_json['pages']['fields']?

Let’s run this one with a new query in case you’ve already surmised the total for Stryker’s Negatives. And I want to double-check that the code is working for more than just one query..

# input new entry
query = 'Texas'

# fill the payload with the new search term[s] and get a response in JSON
response_json = query_loc_json(query)

print("Your search '{}' returned {} hits.".format(query, get_loc_hits(response_json))

# What is the current page?
current_page = response_json['pages']['current']
print('\ncurrent page: {}'.format(current_page))

# What about total, per page, results, and next?
fields = ['total', 'per_page', 'results', 'next']

for field in fields: # if we have a field in fields
    if response_json['pages'][field]: # and it exists here
        print('\n{}: {}'.format(field, response_json['pages'][field])


Output

Your search 'Texas' returned 31550 hits.

current: 1

total: 1578

perpage: 20

results: 1 - 20

next: //www.loc.gov/pictures/search/?q=Texas&sp=2

1578 total pages

If we do the math with 20 per page that’s a possible 31,560 items. We have 31,550 hits, so there must only be 10 hits on the last page. We can find the number of items on the last page quickly using len() and last.

last_page = response_json['pages']['last']

last_page_url = 'https:' + last_page + '&fo=json'

last_page_json = requests.get(last_page_url).json()

len(last_page_json['results'])

Output

10

Let’s Jump Forward

And we’re now able to output our more metadata in a more controlled fashion. For instance, getting the name of the photographer, the date, and the location associated with the first 25 images from the negatives Roy Stryker had hole punched.

Output

image 1
photographer: Rothstein, Arthur, 1915-1985
date: [1939 Jan.]
location: Illinois--Herrin

image 2
photographer: Wolcott, Marion Post, 1910-1990
date: [1938 Sept.]
location: West Virginia--Jere

image 3
photographer: Mydans, Carl
date: [1936 June]
location: Arkansas--Batesville

image 4
photographer: Mydans, Carl
date: [1936 June]
location: Arkansas--Batesville

image 5
photographer: Mydans, Carl
date: [1936 June]
location: Arkansas--Batesville

image 6
photographer: Jung, Theodor, 1906-1996
date: [1935 Oct.]
location: Indiana--Brown County

image 7
photographer: Jung, Theodor, 1906-1996
date: [1935 Oct.]
location: Indiana--Brown County

image 8
photographer: Jung, Theodor, 1906-1996
date: [1935 Oct.]
location: Indiana--Brown County

... ...

image 23
photographer: Rothstein, Arthur, 1915-1985
date: [1935 Oct.]
location: Virginia--Shenandoah National Park

image 24
photographer: Rothstein, Arthur, 1915-1985
date: [1935 Oct.]
location: Virginia--Shenandoah National Park

image 25
photographer: Rothstein, Arthur, 1915-1985
date: [1935 Oct.]
location: Virginia--Shenandoah National Park

Skipped entries for brevity’s sake

This is great info, but to get a clearer picture to start with, let’s remove some of it to get a better sense of our data.

#import libraries
from __future__ import print_function # this is Python 2.7 code
import requests
from time import sleep
import os

# set our search url and payload as shown in requests's quick guide:
# http://docs.python-requests.org/en/master/user/quickstart/
url = 'https://loc.gov/pictures/search/' # try to always use https
# payload info from:
# https://www.loc.gov/pictures/api#Search_15626900281916867_13804_7288827267039479
payload = {'q': 'Negative has a hole punch made by FSA staff to \
indicate that the negative should not be printed.', 
    'fo': 'json'}

response_json = requests.get(url, params=payload).json()

page_count = 0

while page_count < 2 and True: # only get a next if there is one!
    for item in response_json['results']:

        # let's limit to a max index of 25
        item_index = item['index']
        if item_index < 26:
            print('image {}'.format(item['index']))
            
            # answer more questions with metadata from the item page
            item_url = 'https:' + item['links']['item'] + '?&fo=json'
            item_json = requests.get(item_url).json()

            for field in item_json['item']['place']:
                location = field['title']

            # photographers name is found under item > creators > title
            for field in item_json['item']['creators']:
                name = field['title']

            # and date right under date
            date = '{}'.format(item_json['item']['date'])

            # first print of data
            #print('photographer: {}\n'.format(name) + \
            #      'date: {}\n'.format(date) + \
            #      'location: {}\n.format(location))

            # can we simplify this more?
            # image 12
            # photographer: Jung, Theodor, 1906-1996
            # date: [1935 Oct.]
            # location: Indiana--Brown County

            # for name we can split on commas
            # NOTE: some entries do NOT have dates!
            try:
                last, first, life = [x.strip() for x in name.split(',')]
                name = first + ' ' + last
            except ValueError:
                last, first = [x.strip() for x in name.split(',')]
                name = first + ' ' + last

            # keep it simple and just find the digits
            # then cat them together to get the date
            digits = [x for x in date if x.isdigit()]
            date = ''.join(digits)

            # state before county or park
            state, county = [x.strip() for x in location.split('--')]
            location = state

            # 2nd time printing
            print('photographer: {}\n.format(name) + \
                  'date: {}\n'.format(date) + \
                  'location: {}\n'.format(location))


    next_page = response_json['pages']['next']
    if next_page is not None:
        url = 'https:' + next_page # JSON does not supply https:
        payload = {'fo': 'son'} # and need to request JSON
        response_json = requests.get(url, params=payload).json()
        page_count += 1 # increment page counter

        sleep(2) # sleep for 2 seconds between pages
    else:
        break

Output

image 1
photographer: Arthur Rothstein
date: 1939
location: Illinois

image 2
photographer: Marion Post Wolcott
date: 1938
location: Illinois

image 3
photographer: Carl Mydans
date: 1936
location: Illinois

image 4
photographer: Carl Mydans
date: 1936
location: Illinois

image 5
photographer: Carl Mydans
date: 1936
location: Illinois

image 6
photographer: Theodor Jung
date: 1935
location: Illinois

image 7
photographer: Theodor Jung
date: 1935
location: Illinois

... ...

image 24
photographer: Arthur Rothstein
date: 1935
location: Illinois

image 25
photographer: Arthur Rothstein
date: 1935
location: Illinois

Pretty powerful stuff

This is a good point at which to end this blog post. This has been a quick introduction to accessing the Library of Congress’s website with Python through their new JSON API. Below is a graph and some fiddly bits I learned about copying to and pasting from my clipboard in Terminal on macOs. I don’t know if comments have been turned on or off for this blog, but you can find me at jeremy.moore@txstate.edu

#import libraries
from __future__ import print_function # this is Python 2.7 code
import requests
from time import sleep
import pandas as pd
%matplotlib inline
from collections import Counter

location_counter = Counter()

# set our search url and payload as shown in requests's quick guide:
# http://docs.python-requests.org/en/master/user/quickstart/
url = 'https://loc.gov/pictures/search/' # try to always use https
# payload info from:
# https://www.loc.gov/pictures/api#Search_15626900281916867_13804_7288827267039479
payload = {'q': 'Negative has a hole punch made by FSA staff to \
indicate that the negative should not be printed.', 
    'fo': 'json'}

response_json = requests.get(url, params=payload).json()

page_count = 0

while page_count < 2 and True: # only get a next if there is one!
    for item in response_json['results']:

        # let's limit to a max index of 25
        item_index = item['index']
        if item_index < 26:
            print('image {}'.format(item['index']))
            
            # answer more questions with metadata from the item page
            item_url = 'https:' + item['links']['item'] + '?&fo=json'
            item_json = requests.get(item_url).json()

            for field in item_json['item']['place']:
                location = field['title']

            # state before county or park
            state, county = [x.strip() for x in location.split('--')]
            location = state

            location_counter[location] += 1

    next_page = response_json['pages']['next']
    if next_page is not None:
        url = 'https:' + next_page # JSON does not supply https:
        payload = {'fo': 'son'} # and need to request JSON
        response_json = requests.get(url, params=payload).json()
        page_count += 1 # increment page counter

        sleep(2) # sleep for 2 seconds between pages
    else:
        break

location = pd.Series(location_counter)
location.sort_values(ascending=True).plot(kind='bath', figsize=(8,8))
for value in location.sort_values(ascending=False):
    print(value)

Output

12
4
3
3
2
1

graph

Graph of locations in first 25 results returned by our query

  • NOTE: How I quickly calculated the first 1/10 of my output using the Terminal on my MacBook:

pbpaste and pbcopy will paste from and copy to your Mac’s clipboard, so I selected all of the text from our output and entered the command below in Terminal. The ‘|‘ pipe character (it’s found with the key under delete on my MacBook keyboard) takes output from one program and sends it to another. If you think about Mario and Luigi, but instead of a plumber moving between levels, it’s your data moving between programs.

$ some stuff is what I have entered on the command line (everything after ‘$ ‘) and the line below it is my output.

$ pbpaste | wc -c
 32052
Microsoft Word character count

Microsoft Word character count

The above code took all of the text in my clipboard and sent it to program wc with option -c and returned 32052. wc is is a word counter and I chose to use option -c that counts the number of bytes in the input. With the alphabet I am using 1 byte = 1 character. I pasted it into Microsoft Word and took a screenshot of our word count.

I knew I only wanted to the first 1/10 of the output as it was just too much text to include in a blog so I loaded everything into a variable. We can load a variable by typing our previous command inside of $() and setting that equal to a container for the data, in this case signified by the letter x

$ x=$(pbpaste | wc -c)

If we echo x we just get back x.

$ echo x
x

We don’t want x, but the variable we set to x. We call it back by putting  with a $ in front of the variable like $<variable> or $x.

$ echo $x
32052

Now we can do math on our variable. We can evaluate an arithmetic expression by enclosing it within 2 sets of parentheses preceded by a dollar sign. It takes longer to explain than to show: $((<math>))

$ echo $(($x/10))
3205

Note that our output is 3205 and not 3205.2 because we are doing integer math and integers are whole numbers. We can use a modulo operation to find the remainder by using the a percent sign % in our arithmetic expression, but that’s really getting far off-topic.

$ echo $(($x%10))
2

I used the program head that returns the beginning of a file and said I only wanted 3205 bytes with option  -c and piped that back to my clipboard with pbcopy to paste into WordPress.

$ pbpaste | head -c 3205 | pbcopy

So now if I call pbpaste, it only contains what was last copied into it, the first 3205 bytes:

$ pbpaste | wc -c
3205

 

San Marcos Daily Record Negatives

Collage

Top row: SMDR_1950s-SF-11_May 16 2017_13-38-29, SMDR_1930s-56_027 Bottom row: SMDR_1940s50s-88_001, SMDR_1930s-58_004, SMDR_1930s-26_016

In January of 2016 University Archives received an estimated 800,000 photo negatives, transparent strips of film that depict an image with the colors inverted, from the San Marcos Daily Record. This collection contains images spanning from the 1930s to the 2000s. The negatives consist of a mixture of nitrate and safety film. Nitrate film, a flexible, plastic film base, was created in the late 1800s as a replacement to glass plates and safety film was created as a substitute for nitrate. Nitrate film is the same film used in motion pictures which caused many devastating fires during film screenings in the early 1900s. This film becomes less stable and more likely to auto-ignite as it deteriorates. Safety film, as the name suggests, is much safer to use and store, however, the film still degrades over time.

Both the flammability and the dilapidated state of the oldest negatives forced us to move quickly with plans to digitize the entire collection. Upon receiving a Texas State Library and Archives TexTreasures Grant, funded by IMLS, we started looking for the right equipment with the hopes of creating a new system of digitization. Our previous digitization process involved flatbed scanners with which the scanning process alone can take five minutes per image.

We forged this new process with the goal that this system would digitize the negatives more quickly and produce higher quality images than the scanners we had been using. While there have been a few hiccups the project as a whole has been a success. We began digitizing in April of 2017 and as of now, we have over 6,000 images. Regarding the quality of our images, it can feel a bit like overkill as our process captures more information than exists in the earliest negatives. However, this process allows us to preserve as much of the detail in the negative as possible. These silhouette images, for instance, likely appeared to be more contrasted like a typical silhouette due to the printing process in the 1930s but with our system, we can see much more detail than just their silhouettes.

Silhouettes

SMDR_1930s_67_001, SMDR_1930s_67_004, SMDR_1930s_67_003

Through research of the San Marcos Daily Record’s microfilm, we’ve discovered many engaging stories from the negatives. Just a brief flip through some of the old newspapers added quite a bit to our understanding of the collection. While we did find articles that recounted events seen in the negatives very few of the earliest images we have were included in the paper. This is likely a result of the image printing process being too slow to include the photos in the paper.

At the start of this collection, we can see the photographers are not only documenting big events but they are capturing the realities of living in Central Texas which resulted in some interesting photos.

SMDR_1930s-40_003

SMDR_1930s-40_003

SMDR_1930s-56_052

SMDR_1930s-56_052

 

 

 

 

 

 

 

 

 

SMDR_1930s-58_001

SMDR_1930s-58_001

SMDR_1930s-176_005

SMDR_1930s-176_005

 

 

 

 

 

 

 

SMDR_1930s-29_001

SMDR_1930s-29_001

SMDR_1930s-87_002

SMDR_1930s-87_002

 

 

 

 

 

 

 

 

The fact that their pictures were not included in the newspaper also meant many of the earliest images were more of a family photo album than a periodical. This “photo album” is where we get the stories like that of Tommy and his many adventures.

SMDR_1930s-88_003

“Tommy in Airplane, Car” – SMDR_1930s-88_003

SMDR_1930s-87_013

“Tommy Playing Outdoors at Home” – SMDR_1930s-87_013

Tommy at Farm Misc., Outdoors

“Tommy at Farm Misc., Outdoors” – SMDR_1930s-103_12

This project is invaluable in terms of the research material that will be provided and the chance to connect with people of San Marcos while recreating the local history. Our goal is to make these images accessible not only to researchers but to the San Marcos community. We hope to tell the story of San Marcos with the help of those that know it best.

Check here for more updates and stories related to this project.


The provided images are a preview of work by the Digital & Web Services Department in conjunction with the University Archives.

 

Catalog Cards and Fiche

Texas State University has recently completed building an Archives & Research Center which will serve as an offsite storage facility for collections. In preparation, units have been weeding the collection and identifying items to be moved offsite.

The old backup microfiche copy of the library online catalog is one such collection identified for weeding.

Back in the day, this microfiche provided an alternative access to the collection if the OPAC was down or the building lost power after the card catalog was retired. Although it was not up to date, it still proved useful to me on several occasions as a Reference Librarian when I worked the public services desk.

dobbie9

 

dobie1 dobbie3

 

Seeing the microfiche reminded me that I had been present when the original card catalog was retired in 1999.

 

 

 

I saved a few cards as souvenirs.

 

 

There still are other remnants of older library technology which can be found in the building. You can still find punch cards from the first computerized circulation system in the backs of some books.

dobbie4

The National Tour of Texas

reavis_gato3

 

This unique project brought together specialists in GIS, web development, and digitization to create an interactive website based on Dick J. Reavis ’s 1987 year long journey driving every highway in Texas.

Dick J. Reavis is the award-winning author of seven books, including The Ashes of Waco: An Investigation; If White Kids Die: Memories of a Civil Rights Movement Volunteer; Catching Out: The Secret World of Day Laborers and more. He has been a Nieman Fellow at Harvard University, a Senior Editor at Texas Monthly, and a finalist for a National Magazine Award. He is now Emeritus Associate Professor of English at North Carolina State University.  While on assignment for Texas Monthly in the late 1980s, Mr. Reavis got lost and pulled from his glove box his official Texas Highway map. Looking at the map, the thought occurred to him, “there sure are a lot of roads on that map … I wonder if anyone has ever driven them all.” He decided he would try.

He proposed to Texas Monthly that he spend a year driving every single road on the map and write a series of articles for the magazine about his experiences. Mr. Reavis divided Texas into 48 Regions and drove over 117,000 miles in a Chevy Suburban. The Dick J. Reavis Papers at the Wittliff Collections contain thousands of photographs, postcards, notes and even a logbook chronicling each day’s driving. Mr. Reavis published fourteen articles in Texas Monthly throughout his tour.

Our exhibit contains digitized photographs, postcards, notes and copies of the articles as they appeared in Texas Monthly. We decided to digitize his original hand-shaded map, overlay it on Google Maps, and use that as the foundation for navigating the online exhibit.  The exhibit also features a new video interview with Mr. Reavis and a collection of photographs from 1987 matched with corresponding images from Google Maps demonstrating the changing landscape of rural Texas during the last three decades.

We decided to use Omeka as the primary platform and use the Google Maps API and JavaScript to integrate the geospatial  features. The first part of the process was digitizing Reavis’s 67 x 90 cm Texas Highway map with the RCAM and Phase One digital back. Nathaniel Dede-Bamfo, GIS Specialist, used this to create KML layers. The difficulty in this part of the project is that Reavis created his own regions. Although they often followed county boundaries and rivers, they often included only parts of counties. Nathaniel pulled data from the US Census Bureau and the Texas Natural Resources Information System (TNRIS) into ArcGIS to perform Geo referencing, Editing & Dissolving, and Symbolizing. Nathaniel then converted the layers to KML files.

georeferencing

Geo referencing

Symbolizing

Symbolizing

 

Our programmer Jason Long, then used two KML files created by Nathaniel to create a custom page in Omeka using the Google Maps API. The first KML was an image overlay of Reavis’s digitized shaded map and the second KML was a series of polygons for Reavis’s regions.Since the Google API does not generate a mouse over event for polygons when loading a KML layer, Jason used the geoxml3 extension as a KML processor for use with Version 3 of the Google Maps JavaScript APS. First he generated a map with the Google API with javascript. Next he parsed the KML polygon layer with geoxml3 and finally used AJAX to create links to items in the Omeka Exhibit based on mouse events.

Putting it all together, Jason created a custom theme in Omeka and used the Unite Gallery javascript for the slide shows.

We thought it would be interesting to try to show some of the changes since Reavis’s tour. We pulled some excerpts from his tour notes and added links to photos he took at the time and also links to current Google Maps Street Views of the same locations.

notesOne of Dick Reavis’s motivations for making the tour was to try to view a part of Texas that he knew would soon disappear. The then and now photos show that he was unfortunately correct in his prediction.

 

History of Spring Lake

mermaid-gato1

In a previous post I discussed a project to digitize and create an exhibit related to the history of Aquarena Springs. Entitled The “History of Spring Lake”, the online exhibit is now available. Only a portion of the archives’ materials are included in this exhibit.

This “History of Spring Lake” exhibit was initially planned and constructed by Jason Crouch, a Graduate Student in the Public History Program at the Texas State University Center for Texas Public History. Digitization support was provided by Digital Media Specialist, Jeremy Moore. Programming support and customization of the Omeka site was provided by Jason Long. Additional support provided by Todd Peters, Head, Digital & Web Services.

This exhibit was edited and revised to feature a variety of primary source materials from the University Archives. The purpose of this exhibit is to provide a brief history of Spring Lake; it is not meant to be an exhaustive history of the people, places, or details.

The University Archives would like to thank Anna Huff and John Fletcher for providing content representing The Meadows Center, as well as the local repositories and local collections that allowed us to feature their materials in this exhibit.

And the earth did not swallow him

dvd-coverWe recently completed a fun project that is notable for a few reasons. The first is because the subject of the project was creating on online exhibit on the making of Severo Perez’s beautiful film, … and the earth did not swallow him, based on Tomás Rivera’s classic 1971 Chicano novel, …y no se lo tragó la tierra, which is a semi-autobiographical novel that recounts the life of workers and families of the migrant camps where his family stayed while doing farm work. In 1995 Severo Perez wrote an English screenplay, using his own translation, produced, and directed a film version of the novel. The film was well received and received critical acclaim and several film awards.

Severo Perez

Severo Perez – 2014

In 2014 Severo Perez donated the production archives from the film to the Wittliff Collections at Texas State University. The Severo Perez Archive is a comprehensive collection that traces the development of all of his major works, from the first drafts to the finished productions. Included are scripts, correspondence, location photos, storyboards, animation cells, casting photos, production forms, continuity photos, rough cuts, outtakes, master reels, sound reels, editing logs, artifacts, and publicity materials.

The second reason the project is notable is the participation of the donor and filmmaker in the project. During the Fall 2015 semester, Severo Perez was an artist-in-residence at Texas State, sponsored by the Center for the Study of the Southwest and its director, Dr. Frank de la Teja. Severo Perez’s presence on campus coincided with the university’s 2015-2016 Common Experience theme: “Bridged Through Stories: Shared Heritage of the United States and Mexico, an Homage to Dr. Tomás Rivera.”

He offered his time to help with the online exhibit and went through his archives with Steve Davis, the curator of the Southwestern Writer’s Collection.

The third and possibly most interesting feature of the exhibit is its arrangement. We were able to create a unique web resource which not only provides information on the making of this film but also explores the film making process in general process from beginning to finish, drawing from the extensive materials in the Severo Perez collection. With the assistance of the Texas State Instructional Technology Services Video Production team, we conducted a new video interview of Mr. Perez talking about the making of the film. Steve Davis reviewed the video and took notes and included time codes with each note.

notesThe next step is probably the most important part of the process. Mr. Davis rearranged his notes by topic and added topic headings based on his knowledge of the content and arrangement of the collection. These rearranged notes then became an outline which was used to construct the framework for the exhibit in Omeka by our programmer, Jason Long.
Todd Peters used the outline with time codes to create 109 short video clips from the 2.5hr interview and Jeremy Moore, the Digital Media Specialist, digitized objects selected by the Curator. Jason put everything together into the site and we went live in late Spring 2016.

We hope you enjoy the site.

Omeka site

http://exhibits.library.txstate.edu/thewittliffcollections/exhibits/show/severo-perez/

Then and Now

Several websites that feature juxtaposed historic and recent images have appeared over the last year. It is a fun way to showcase historic images in an archive. The Knight Lab at Northwestern University has created an easy tool to create photo juxtapositions. The software allows a user to move a slider to swipe between the two versions of an image.

We decided to use the JuxtaposeJS software to create a few test images to learn more about the process. We took a couple of prints of historic photos of Old Main we had recently digitized, and tried to find the locations from where they were shot.  For this pilot project, we did not use the high resolution PhaseOne digital camera. We wanted to keep the amount of equipment we needed to carry to a minimum on our first attempt, so Jeremy used a smaller Olympus OM-D E-M5II capable of stitching together 40mb images from several shots, and a tripod.

The resulting images were scaled and visual reference points lined up in Adobe Photoshop. The JuxtaposeJS  website automatically creates the HTML embed code to insert into a website.

oldmainhalfandhalf-sm

Click to view juxtaposed Then and Now images of Old Main.

 

Old School Work Study

thacher_gary_1956

During the early part of the University’s history, “work study” apparently meant something different than it does today. While listening to  a digitized 1974 oral history, I stumbled upon an amusing recollection from Biology professor Thacher R. Gary and his wife Nawona (both also 1940 graduates of Southwest Texas State Teachers College.)

 

 

 

jessie_claude_kellam

Jessie and Claude Kellam, 1923

 

 

They recalled the hardships on students in the 1920s and 1930s and that most students worked and many could only afford one meal a day. In recalling that some would do anything to stay in school, they recount that several students had kept cows on campus in the early years, including J. C.  Kellam and his brother, and they would sell the milk to other students to help earn money.