Differences

This shows you the differences between two versions of the page.

@@ Line 1: / Line 1: @@
+====== Pre-Jane-athon testing ======
+I wanted to see how things might go on **The Day** after the inputting, etc., was over, especially re: the use of Windows explorer to join the files from each user/table, so I constructed the following test scenario.
+First, I searched these MARC sources to gather some sample data
+	-LC MARC English (7M records)
+	-LC MARC Foreign (5M records)
+The search was for any occurrence in any field of 'Jane Austen' or 'Austen, Jane'. \\
+I took the matching records, concatenated them together, and deduped them on LCCN.\\
+In the end I assembled a file of 1188 MARC records in a file called 'jane-all-deduped.mrc'
+Next I created 11 folders in my RIMMF data area named: jata ... jatk \\
+Then I copied Deborah's 75 [[http://rballs.info/zip/JaneBase-150106.zip|janebase records]] into each folder.
+Next, I repeated the following steps, once for each folder (11 times):
+	-opened the rimmf .ini file
+	-changed the record prefix: (1. jaia ... 11. jaik)
+	-changed the data folder:   (1. jata ... 11. jatk)
+	-saved the .ini
+	-started RIMMF, went to F3
+	-batchloaded* the next 100 records from the jane.mrc file
+	-closed the program
+Notes:
+For the last folder, I batchloaded((batchload refers to a function I am working on; in brief, the user can direct the F3 form to take its searches from a file (of ISBNs, LCCNs, or in this case, MARC records). The program then iterates the file, constructs a search term from each item, and performs the standard F3 processing from there without asking the user for any confirmations)) 188 records instead of the usual 100.\\
+The time to run 100 records through F3 ranged from 15-18 minutes (sat. afternoon)
+The results of this processing:
+<code>
+folder	prefix	jbase	marc	import	folder total
+jata	jaia	75	100	380	455
+jatb	jaib	75	100	336	411
+jatc	jaic	75	100	369	444
+jatd	jaqid	75	100	386	461
+jate	jaie	75	100	326	401
+jatf	jaif	75	100	338	413
+jatg	jaig	75	100	326	401
+jath	jaih	75	100	350	425
+jati	jaii	75	100	377	452
+jatj	jaij	75	100	365	440
+jatk	jaik	75	188	588	663
+tot		825	1188	4141	5329
+</code>
+In addition, since it was the main point of the test, I tracked the links too((the following table is revised and corrected from the original)):
+<code>
+folder	prefix	J2J	O2J	total links
+jata	jaia	148	63	211
+jatb	jaib	148	69	217
+jatc	jaic	148	87	235
+jatd	jaqid	148	77	225
+jate	jaie	148	65	213
+jatf	jaif	148	92	240
+jatg	jaig	148	129	277
+jath	jaih	148	110	258
+jati	jaii	148	84	232
+jatj	jaij	148	112	260
+jatk	jaik	148	178	326
+tot		1628	1066	2694
+</code>
+**J2J** are the number of 'janebase' to 'janebase' links\\
+**O2J** are the number of 'jai.' to 'janebase' links, i.e. links added to a folder by the MARC import processing
+Now run the test of the janeathon 'merge'.
+  - create a new folder named 'jane1'
+  - copied the 75 records from the janebase into it
+  - for each of the 11 'jat.' folders dragged and dropped their contents into 'jane1'. \\
+When asked how to handle the 75 janebase records (ie. overwrite or skip), elect: 'Don't copy.
+Next: load 'jane1' into RIMMF.
+The first time I thought it was not going to work. It took a long time--a very long time. Well over an hour, maybe an hour and a half. Not looking too good.
+But the next day I started looking for bottlenecks. The main one was the 'feature' whereby RIMMF reconstructs missing links when it loads the EI. Completely rewriting that routine knocked off an hour. Getting better.  Which then made it possible to step through the code and find more bottlenecks, and fix them.
+The stats for the resulting EI are
+<code>
+folder	jbase	marc	import	records	J2J	O2J	total links
+jane1	75			4216	148	1066	1214
+</code>
+The cumulated 'O2J' count seems to indicate that the merge using explorer will work.
+Currently, the RIMMF start up time is 2 mins 30 secs. This includes adding all the missing links, plus making the EI and all of the support tables.
+This isn't great, but workable. Knock a minute off that time on a fast computer, but what about a laptop? I will try it on my laptop this week and see how much slower it is.
+I have also added support for EI caching to RIMMF3 (which we had in RIMMF2, but which has been turned off for RIMMF3 thus far).
+Thus, with caching, after the initial build, the program will be ready in about 2 or 3 seconds the next time it is started. The cache files can be distributed with the results if necessary (they are machine-independent)
+In addition to the above, note that running the program with the 4200 records is noticeably slower in some areas (all of which need to be addressed):
+  * F3/Import (the EI dupe-checking routines are slow)
+  * Loading an RTree that includes Jane Austen ;-)
+  * Discarding a set of imported records from the RTree
+I exported the 'jane1' EI as RDF/ntriples (export to RDF process was extremely slow, and was later optimized down to about a minute and a half; but still needs improvement), and tried to run it through various web validators. None of these tools seem to be able to handle a file of any size, let alone produce a visual graph--which is what I was hoping for.
+By way of validation I ran it through raptor conversions on our server, submitted it to an RDF distiller, loaded it into an in-house triplestore (allegrograph), etc. There are about 72K statements excluding the reifications and 'confusing' metadatas, but including RDA classes, rimmfIdentifiers, and labels for them.
+One problem found during validation was based on importing tag 670 from the LC authority records (which we map to 'sourceConsulted...'). These MARC fields might contain Urls; but in a few cases, the way they are entered in the MARC fields confused RIMMF, so that the resulting RDF object was output as a uri instead of a string containing a url.
+Here's an example from the MARC((in my example the 670 $b is omitted, but in actual practice its included in the string directly after the 670 $a)):
+$a http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:
+and the resulting RDF object in RIMMF:
+  <http://rimmfdata.com/r/jaie00004048> <http://rimmf.com/vocab/sourceConsultedPerson>
+    <http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:>
+This was fixed (as the validators trip on it) by inserting a label at the head of the text during mapping, so that it now looks like this:
+  <http://rimmfdata.com/r/jaie00004048> <http://rimmf.com/vocab/sourceConsultedPerson>
+    "Url: http://www.english.ox.ac.uk/about-faculty/faculty-members/research-centre-college-staff/byrne-dr-sandie, May 9, 2014:"
+__Data available for download__
+  * http://rimmf.com/data/jane-all-deduped.mrc	--1188 MARC records about Jane
+  * http://rimmf.com/data/jane-4k.nt		--the jane1 folder exported as RDF
+  * http://rimmf.com/data/jane-rimmf-test.zip	--the 11 inputs and the 'jane1' result folder; unzip to RIMMF3\data
+  * http://rimmf.com/data/jane1-ei-cache.zip	--a pre-built EI for the jane1 folder; unzip to RIMMF3\tables
+  * http://www.marcofquality.com/sft/setupRimmf3-150119b.exe  --RIMMF update that includes the DB improvements mentioned above
+  * http://www.marcofquality.com/sft/setupRimmf3-150119b.zip

Back to top