I work on a system at Talis which posts MARC records from customer library databases into a MARC to RDF transformer. The resulting RDF generated from the MARC is sent into the Talis Platform, where it's used to power Prism.
Over the last day or so I've been working on a bug which has prevented some records going correctly through this process. Along the way, I noticed another bug occurring somewhere between the post from the customer site into our MARC to RDF transformer. It looked as if line break characters in the original MARC record were being lost somewhere in the process. Consequently, when the MARC was pushed into the transformer, the record got spat out as invalid, as the length specified in the MARC leader didn't correspond to the length of the record (now it had lost its line break characters). (By the way, working directly with byte streams is the only way to work with MARC, for precisely this reason.)
I had a sudden insight on the way home, triggered by remembering issues I'd had with curl (the command line HTTP client) working on another personal project. On that project, I'd been trying to post RDF triples in ntriple format into my application using curl. However, the application only seemed to recognise the first RDF triple in the posted file. I couldn't understand why.
Then, when I echoed the body of the HTTP request, as received by my app from curl, I realised the issue: curl was sending the body of the request WITHOUT LINE BREAKS. As line break characters act as the delimiter between triples in RDF ntriple format, my app was only seeing a single RDF ntriple. When I tried an alternative tool to send the posts (the extremely useful Poster add-on for Firefox), the ntriples were received correctly.
Once I remembered this, I decided to do some debugging of the kind of requests curl would send if it were posting MARC records. My hypothesis was that curl was stripping line break characters from the MARC record (which is bad, as they are valid characters in MARC), and hence causing the record to be shorter than the leader said it should be.
First step was to put together something to echo and/or save HTTP request bodies. Rack is ideal for this sort of thing, so I used this little Rack web server program:
require 'rubygems'
require 'rack'
def save_body(body)
File.open('last_raw_request', 'w') {|f| f.write(body)}
body
end
Rack::Handler::WEBrick.run(lambda {|e| [200, {}, save_body(e['rack.input'].read)]}, :Port=>7777)
This saves the raw request body to a file called "last_raw_request".
I first posted a MARC file with line breaks in it (attached) using Poster (with Content-Type set to application/marc21) through Firefox. The MARC file came through intact and still valid.
I then posted a MARC file with line breaks in it using curl:
curl -d @marcfile.mrc -H "Content-Type:application/marc21" http://localhost:7777/
Which produced an invalid MARC file with line breaks missing.
The solution is to use the --data-binary switch when using curl to send binary data, which we're not doing when sending MARC from the customer site. Mostly this doesn't matter, but it does when the MARC record contains line break characters.
Namely:
curl --data-binary @marcfile.mrc -H "Content-Type:application/marc21" http://localhost:7777/