Walker Sampson

walker[dot]sampson[at]icloud[dot]com

Category: repository

Repercussions of Amassed Data

I had the pleasure of meeting Mél Hogan while she was doing her postdoctoral work at CU Boulder. I think her research area is vital, though it’s difficult to summarize. But that won’t stop me, so here goes: investigating how one can “account for the ways in which the perceived immateriality and weightlessness of our data is in fact with immense humanistic, environmental, political, and ethical repercussions” (The Archive as Dumpster).

Data flows and water woes: The Utah Data Center is a good entry point for this line of inquiry. The article explores the above quoted concerns (humanistic, environmental, political, and ethical) at the NSA’s Utah Data Center, near Bluffdale. It has suffered outages and other operational setbacks since construction. These initial failures are themselves illuminating, but even assuming such disruptions are minimized in the future, the following excerpt clarifies a few of the material constraints of the effort:

Once restored, the expected yearly maintenance bill, including water, is to be $20 million (Berkes, 2013). According to The Salt Lake Tribune, Bluffdale struck a deal with the NSA, which remains in effect until 2021; the city sold water at rates below the state average in exchange for the promise of economic growth that the new waterlines paid for by the NSA would purportedly bring to the area (Carlisle, 2014; McMillan, 2014). The volume of water required to propel the surveillance machine also invariably points to the center’s infrastructural precarity. Not only is this kind of water consumption unsustainable, but the NSA’s dependence on it renders its facilities vulnerable at a juncture at which the digital, ephemeral, and cloud-like qualities are literally brought back down to earth. Because the Utah Data Center plans to draw on water provided by the Jordan Valley River Conservancy District, activists hope that a state law can be passed banning this partnership (Wolverton, 2014), thus disabling the center’s activities.

As hinted at in a previous post on Lanier, I often encounter a sort of breathlessness invoked when descriptions of cloud-based reserves of data and computational prowess are discussed. Reflecting on the material conditions of these operations, as well as their inevitable failures and inefficiencies (e.g. the apparently beleaguered Twitter archive at the Library of Congress, though I would be more interested in learning about the constraints and stratagems of private operations) is a wise counterbalance that can help refocus discussions on the humanistic repercussions of such operations. And to be sure, I would not exclude archives from that scrutiny.

Advertisements

Help the Digital Preservation Q&A at StackExchange

Stack Exchange Q&A site proposal: Digital Preservation

Join us.

I’ve recently committed to the Digital Preservation Q&A proposal at StackExchange. This is a resource I really hope  comes to fruition, as there’s a lack of sites to support exchange of strategies and advice for people involved in digital preservation, as well as to field questions from persons familiarizing themselves with the practice.

This latter audience has been on my mind particularly since leaving the DPOE program last year. Although we have fielded questions over an email listserv, this venue has a few significant weaknesses:

  • It’s difficult to bookmark or reference back to advice or information within a thread.
  • The email body and thread is not friendly to text formatting, links, and other formatting that would make information more readable, digestible and inclusive.
  •  The information is unstructured — one can not apply tags, select a topic as a favorite, vote up a discussion, or track edits in any systematic way.

By contrast, the StackExchange approach is a mix between a question-and-answer site and Wikipedia, with some reward elements to provide incentive for good contributions. There are a host of topics covered under the network, from gardening to LEGOs to electrical engineering. The network hosts an Area 51 site, which maintains all the topics proposed presently that users are interested in, but which are not yet formal sites. There’s a lot there, and you’d likely be interested in a few.

Why StackExchange? It features all the methods to structure information I described above. I really can’t imagine a better format (at least, not one already set up and sorted out) for building up a knowledge base in digital preservation, and one that can adjust with time. Digital preservation is a practice that will change immensely with time. There will be an assortment of questions and procedures, ranging from the obscure rescue efforts to large scale and contemporary migration processes.

As part of the state archives here in Mississippi, I do a good bit of training to state employees on electronic records management and preservation. Required retention periods for born-digital objects can range from three to fifteen or more years, while many are marked for permanent retention and will be deposited here at the archives. Considered planning for digital content repeatedly comes up. A single good resource to point them to would be very welcomed.

Consider committing if the topic interests you. It’s especially helpful if you’re already engaged in other StackExchange sites, and as noted there are a whole lot of topics to join, so there’s ample opportunity to get involved with StackExchange. Any interest does help!

From My Archives: Derrida’s Archive Fever

Green Fire

Below is a review of Derrida’s Archive Fever. The idea was to relate the lecture to practicing archivists and record managers. This was a really engaging read, and I think Derrida successfully articulates the archive impulse, with all its attendant richness and strangeness.

Archive Fever: A Freudian Impression. Jacques Derrida. Chicago: University of Chicago Press, 1998. Translated by Eric Prenowitz. 113 pages. ISBN 0-226-14367-8 paper. $14.98.

French philosopher Jacques Derrida (1930-2004) is most commonly known as the founder of deconstruction, an investigative thinking that identifies contradictions in a subject and demonstrates the essentialness of this contradiction to the meaning of the subject. For a thinker so adept at analyzing the valences of meaning in language, Derrida was unsurprisingly hesitant about the broad appeal and use of the deconstruction term, and no doubt would find fault with an overly mechanistic summation as perhaps written here. In Archive Fever, Derrida applies his intensely critical thought and evaluation to the notion of the archive as it is manifested in Sigmund Freud’s oeuvre.

Archive Fever: A Freudian Impression is a translation from the French of a published lecture Derrida delivered in 1994, and is divided into six parts: an opening note, an exergue, a preamble, foreword, theses and postscript. Derrida delivered this lecture to an international colloquium entitled “Memory: The Question of the Archives.” This leads to two caveats for the interested reader. Although blurbs on the paperback reference Derrida’s discussion of electronic media and more broadly the role of inscription technology in the psyche and in the archives, this is not the focus of his discussion, but is only part of a larger examination of the archive notion in Freud’s works. The reader should also know that this a later work of Derrida, and as such references ideas and investigations discussed in earlier works, particularly the essay Freud and the Scene of Writing (1972). This means some of Derrida’s passages can be disorienting if the reader is not familiar with the works of Derrida and Freud. Thankfully Derrida takes pains to convey his meaning through multiple expressions, so the reader has many opportunities to understand the ideas at play.

Read the rest of this entry »

Handling Results in Fedora’s REST API

Lately I’ve been working to put in more development time with the Fedora repository at the Goodwill Computer Museum.

A PHP ingest interface we’ve set up is certainly the most developed of the our repository’s services, but there’s a strong need to relate one object to another as it is being ingested. To do this I want to provide the user with a drop down menu of objects in the repository which fulfill some criteria (say, the object represents a donator or creator). The user can select one during the ingest phase, relating the ingested object to this other object. That relationship would be recorded in the RELS-EXT datastream as RDF/XML, creating a triple. The predicate of that triple will come from either Fedora’s own ontology [RDF schema] or another appropriate namespace.

Below is PHP code using the cURL client library to call Fedora’s REST API and get this list of relevant objects. I encountered a few stumbling blocks putting this together, so I thought I’d share in case others were curious or looking at a similar problem.

The first step is to compose your query, and then initiate a cURL session with the query.


<?php
$request = "http://your.address.domain:port/fedora/objects?query=yourQuery&resultFormat=xml";
$session = curl_init($request);

curl_setopt($session, CURLOPT_RETURNTRANSFER, true);

$response = curl_exec($session);
$responseResult = simplexml_load_string($response);
$resultsArray = array();

foreach ($responseResult->{'resultList'} as $result) {
     foreach ($result->{'objectFields'} as $entry) {
          foreach ($entry as $value) {
               $resultsArray[] = $value;
          }
     }
}
curl_close($session);

while (!empty($token)) {
     $nextQuery = "http://your.address.domain:port/fedora/objects?sessionToken=" . urlencode($token) . "&query=yourQuery&resultFormat=xml";
     $nextSession = curl_init($nextQuery);

     curl_setopt($nextSession, CURLOPT_RETURNTRANSFER, true);

     $nextResponse = curl_exec($nextSession);
     $nextResponseResult = simplexml_load_string($nextResponse);

     foreach ($nextResponseResult->{'resultList'} as $result) {
          foreach ($result->{'objectFields'} as $entry) {
               foreach ($entry as $value) {
                    $resultsArray[] = $value;
               }
          }
     $token = $nextResponseResult->{'listSession'}->{'token'};
     print "$token<br />\n";

     curl_close($nextSession);

} //while
?>

On line 2 I’ve specified my query results to be returned as XML and not HTML (resultFormat=xml). This is because I don’t want a simple browser view of the results — I want to work with them some first, so XML is appropriate.

On line 5 the cURL option CURLOPT_RETURNTRANSFER to ‘true’. This directs cURL to deliver the return of its Fedora query as a string return value to the curl_exec() variable, in this case $response.

On line 8 $response, now an XML structure, is loaded into $responseResult as a PHP5 object. The object is a tree structure containing arrays for the result list, the entries, and the entries’ value arrays, all of which we can work through to get to the record values of interest. The specific contents will depend on your query. You can get a good look at the object with print_r():

print_r($responseResult);

The two Fedora REST commands used are findObjects and resumeFindObjects. We need both of these commands because findObjects will not return more than 100 results, regardless the value you set on maxResults.

Instead it returns the results along with a token. This token is a long-ish string you can then supply to resumeFindObjects, which will continue retrieving your results for you. Just like findObjects, resumeFindObjects will never return more than 100 results, instead giving you another unique token. Once again, you can supply that token to a new resumeFindObjects command to continue getting your results.

The two loops for each of these commands should fill resultsArray[] with all the results available in the repository.

You can use this array in a HTML drop down:


<?php
echo "<select name=\"donators\">";
foreach ($responseResult->{'resultList'} as $result) {
	foreach ($result->{'objectFields'} as $entry) {
		$pid    = (string) $entry->pid;
		$title  = (string) $entry->title;
		echo "<option value=\"$pid\">$title</option>";
	}
}
echo "</select>";
?>

Keep in mind that values like $entry->pid and $entry->title are only going to be in the results if those fields have been requested in your queries.

This approach has given me a good understanding of calling and manipulating objects in Fedora through PHP. I have found that setting maxResults to a smaller number (say 5, 10, or 20) is faster than setting it to its maximum 100. And of course, if you are going to be fetching hundreds or thousands of objects, it’s best not to dump them all in a drop down or to fetch them all at once.

Migrating Data from MySQL

This is a simple PHP script to write data from a CSV file into a new FOXML file for ingest into the repository.  The process has been straightforward thus far. MySQL provides a quick way to export data from a table into a makeshift CSV file, and PHP’s fgetcsv function helps parse this file into a series of arrays which can be stepped through, value by value. This script has parsed the ~1000 rows of tabular data from the Hardware table into FOXML files. These files have been ingested into Fedora through the batch ingest tool available on the Java administration client.

Having this amount of records present in the repository has helped clarify how much functionality the base instance of Fedora possesses. The default search tool is fairly capable, and runs through several prominent Dublin Core fields specified in the FOXML file. Except for some notable absences, such as model number (for which there is no analogous DC term), the hardware records are as searchable and discoverable as they are on the MySQL/PHP interface. The major deficiency thus far is the lack of services associated with the RELS-EXT datastreams defined in the FOXML files. Associated these records with images and other parts is key, so this will continue to need to be developed.

Read the rest of this entry »