Pre-Jane-athon testing

I wanted to see how things might go on The Day after the inputting, etc., was over, especially re: the use of Windows explorer to join the files from each user/table, so I constructed the following test scenario.

First, I searched these MARC sources to gather some sample data

  1. LC MARC English (7M records)
  2. LC MARC Foreign (5M records)

The search was for any occurrence in any field of 'Jane Austen' or 'Austen, Jane'.
I took the matching records, concatenated them together, and deduped them on LCCN.
In the end I assembled a file of 1188 MARC records in a file called 'jane-all-deduped.mrc'

Next I created 11 folders in my RIMMF data area named: jata … jatk
Then I copied Deborah's 75 janebase records into each folder.

Next, I repeated the following steps, once for each folder (11 times):

  1. opened the rimmf .ini file
  2. changed the record prefix: (1. jaia … 11. jaik)
  3. changed the data folder: (1. jata … 11. jatk)
  4. saved the .ini
  5. started RIMMF, went to F3
  6. batchloaded* the next 100 records from the jane.mrc file
  7. closed the program

Notes: For the last folder, I batchloaded1) 188 records instead of the usual 100.
The time to run 100 records through F3 ranged from 15-18 minutes (sat. afternoon)

The results of this processing:

folder	prefix	jbase	marc	import	folder total
jata	jaia	75	100	380	455
jatb	jaib	75	100	336	411
jatc	jaic	75	100	369	444
jatd	jaqid	75	100	386	461
jate	jaie	75	100	326	401
jatf	jaif	75	100	338	413
jatg	jaig	75	100	326	401
jath	jaih	75	100	350	425
jati	jaii	75	100	377	452
jatj	jaij	75	100	365	440
jatk	jaik	75	188	588	663

tot		825	1188	4141	5329

In addition, since it was the main point of the test, I tracked the links too2):

folder	prefix	J2J	O2J	total links
jata	jaia	148	63	211
jatb	jaib	148	69	217
jatc	jaic	148	87	235
jatd	jaqid	148	77	225
jate	jaie	148	65	213
jatf	jaif	148	92	240
jatg	jaig	148	129	277
jath	jaih	148	110	258
jati	jaii	148	84	232
jatj	jaij	148	112	260
jatk	jaik	148	178	326

tot		1628	1066	2694

J2J are the number of 'janebase' to 'janebase' links
O2J are the number of 'jai.' to 'janebase' links, i.e. links added to a folder by the MARC import processing

Now run the test of the janeathon 'merge'.

  1. create a new folder named 'jane1'
  2. copied the 75 records from the janebase into it
  3. for each of the 11 'jat.' folders dragged and dropped their contents into 'jane1'.

When asked how to handle the 75 janebase records (ie. overwrite or skip), elect: 'Don't copy.

Next: load 'jane1' into RIMMF.

The first time I thought it was not going to work. It took a long time–a very long time. Well over an hour, maybe an hour and a half. Not looking too good.

But the next day I started looking for bottlenecks. The main one was the 'feature' whereby RIMMF reconstructs missing links when it loads the EI. Completely rewriting that routine knocked off an hour. Getting better. Which then made it possible to step through the code and find more bottlenecks, and fix them.

The stats for the resulting EI are

folder	jbase	marc	import	records	J2J	O2J	total links
jane1	75			4216	148	1066	1214

The cumulated 'O2J' count seems to indicate that the merge using explorer will work.

Currently, the RIMMF start up time is 2 mins 30 secs. This includes adding all the missing links, plus making the EI and all of the support tables.

This isn't great, but workable. Knock a minute off that time on a fast computer, but what about a laptop? I will try it on my laptop this week and see how much slower it is.

I have also added support for EI caching to RIMMF3 (which we had in RIMMF2, but which has been turned off for RIMMF3 thus far).

Thus, with caching, after the initial build, the program will be ready in about 2 or 3 seconds the next time it is started. The cache files can be distributed with the results if necessary (they are machine-independent)

In addition to the above, note that running the program with the 4200 records is noticeably slower in some areas (all of which need to be addressed):

  • F3/Import (the EI dupe-checking routines are slow)
  • Loading an RTree that includes Jane Austen ;-)
  • Discarding a set of imported records from the RTree

I exported the 'jane1' EI as RDF/ntriples (export to RDF process was extremely slow, and was later optimized down to about a minute and a half; but still needs improvement), and tried to run it through various web validators. None of these tools seem to be able to handle a file of any size, let alone produce a visual graph–which is what I was hoping for.

By way of validation I ran it through raptor conversions on our server, submitted it to an RDF distiller, loaded it into an in-house triplestore (allegrograph), etc. There are about 72K statements excluding the reifications and 'confusing' metadatas, but including RDA classes, rimmfIdentifiers, and labels for them.

One problem found during validation was based on importing tag 670 from the LC authority records (which we map to 'sourceConsulted…'). These MARC fields might contain Urls; but in a few cases, the way they are entered in the MARC fields confused RIMMF, so that the resulting RDF object was output as a uri instead of a string containing a url.

Here's an example from the MARC3):

670 $a http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:

and the resulting RDF object in RIMMF:

<http://rimmfdata.com/r/jaie00004048> <http://rimmf.com/vocab/sourceConsultedPerson> 
  <http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:>
 

This was fixed (as the validators trip on it) by inserting a label at the head of the text during mapping, so that it now looks like this:

<http://rimmfdata.com/r/jaie00004048> <http://rimmf.com/vocab/sourceConsultedPerson> 
  "Url: http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:"

Data available for download

1)
batchload refers to a function I am working on; in brief, the user can direct the F3 form to take its searches from a file (of ISBNs, LCCNs, or in this case, MARC records). The program then iterates the file, constructs a search term from each item, and performs the standard F3 processing from there without asking the user for any confirmations
2)
the following table is revised and corrected from the original
3)
in my example the 670 $b is omitted, but in actual practice its included in the string directly after the 670 $a
janetest.txt · Last modified: 2023/06/07 20:39 by 127.0.0.1
Back to top
CC Attribution-Share Alike 4.0 International
Driven by DokuWiki