Differences
This shows you the differences between two versions of the page.
— | janetest [2023/06/07 20:39] (current) – created - external edit 127.0.0.1 | ||
---|---|---|---|
Line 1: | Line 1: | ||
+ | ====== Pre-Jane-athon testing ====== | ||
+ | |||
+ | I wanted to see how things might go on **The Day** after the inputting, etc., was over, especially re: the use of Windows explorer to join the files from each user/table, so I constructed the following test scenario. | ||
+ | |||
+ | First, I searched these MARC sources to gather some sample data | ||
+ | -LC MARC English (7M records) | ||
+ | -LC MARC Foreign (5M records) | ||
+ | |||
+ | The search was for any occurrence in any field of 'Jane Austen' | ||
+ | I took the matching records, concatenated them together, and deduped them on LCCN.\\ | ||
+ | In the end I assembled a file of 1188 MARC records in a file called ' | ||
+ | |||
+ | Next I created 11 folders in my RIMMF data area named: jata ... jatk \\ | ||
+ | Then I copied Deborah' | ||
+ | |||
+ | Next, I repeated the following steps, once for each folder (11 times): | ||
+ | |||
+ | -opened the rimmf .ini file | ||
+ | -changed the record prefix: (1. jaia ... 11. jaik) | ||
+ | -changed the data folder: | ||
+ | -saved the .ini | ||
+ | -started RIMMF, went to F3 | ||
+ | -batchloaded* the next 100 records from the jane.mrc file | ||
+ | -closed the program | ||
+ | |||
+ | Notes: | ||
+ | For the last folder, I batchloaded((batchload refers to a function I am working on; in brief, the user can direct the F3 form to take its searches from a file (of ISBNs, LCCNs, or in this case, MARC records). The program then iterates the file, constructs a search term from each item, and performs the standard F3 processing from there without asking the user for any confirmations)) 188 records instead of the usual 100.\\ | ||
+ | The time to run 100 records through F3 ranged from 15-18 minutes (sat. afternoon) | ||
+ | |||
+ | |||
+ | The results of this processing: | ||
+ | < | ||
+ | folder prefix jbase marc import folder total | ||
+ | jata jaia 75 100 380 455 | ||
+ | jatb jaib 75 100 336 411 | ||
+ | jatc jaic 75 100 369 444 | ||
+ | jatd jaqid 75 100 386 461 | ||
+ | jate jaie 75 100 326 401 | ||
+ | jatf jaif 75 100 338 413 | ||
+ | jatg jaig 75 100 326 401 | ||
+ | jath jaih 75 100 350 425 | ||
+ | jati jaii 75 100 377 452 | ||
+ | jatj jaij 75 100 365 440 | ||
+ | jatk jaik 75 188 588 663 | ||
+ | |||
+ | tot 825 1188 4141 5329 | ||
+ | </ | ||
+ | |||
+ | In addition, since it was the main point of the test, I tracked the links too((the following table is revised and corrected from the original)): | ||
+ | |||
+ | < | ||
+ | folder prefix J2J O2J total links | ||
+ | jata jaia 148 63 211 | ||
+ | jatb jaib 148 69 217 | ||
+ | jatc jaic 148 87 235 | ||
+ | jatd jaqid 148 77 225 | ||
+ | jate jaie 148 65 213 | ||
+ | jatf jaif 148 92 240 | ||
+ | jatg jaig 148 129 277 | ||
+ | jath jaih 148 110 258 | ||
+ | jati jaii 148 84 232 | ||
+ | jatj jaij 148 112 260 | ||
+ | jatk jaik 148 178 326 | ||
+ | |||
+ | tot 1628 1066 2694 | ||
+ | </ | ||
+ | |||
+ | **J2J** are the number of ' | ||
+ | **O2J** are the number of ' | ||
+ | |||
+ | Now run the test of the janeathon ' | ||
+ | |||
+ | - create a new folder named ' | ||
+ | - copied the 75 records from the janebase into it | ||
+ | - for each of the 11 ' | ||
+ | |||
+ | When asked how to handle the 75 janebase records (ie. overwrite or skip), elect: ' | ||
+ | |||
+ | Next: load ' | ||
+ | |||
+ | The first time I thought it was not going to work. It took a long time--a very long time. Well over an hour, maybe an hour and a half. Not looking too good. | ||
+ | |||
+ | But the next day I started looking for bottlenecks. The main one was the ' | ||
+ | |||
+ | The stats for the resulting EI are | ||
+ | < | ||
+ | folder jbase marc import records J2J O2J total links | ||
+ | jane1 75 4216 148 1066 1214 | ||
+ | </ | ||
+ | |||
+ | The cumulated ' | ||
+ | |||
+ | Currently, the RIMMF start up time is 2 mins 30 secs. This includes adding all the missing links, plus making the EI and all of the support tables. | ||
+ | |||
+ | This isn't great, but workable. Knock a minute off that time on a fast computer, but what about a laptop? I will try it on my laptop this week and see how much slower it is. | ||
+ | |||
+ | I have also added support for EI caching to RIMMF3 (which we had in RIMMF2, but which has been turned off for RIMMF3 thus far). | ||
+ | |||
+ | Thus, with caching, after the initial build, the program will be ready in about 2 or 3 seconds the next time it is started. The cache files can be distributed with the results if necessary (they are machine-independent) | ||
+ | |||
+ | In addition to the above, note that running the program with the 4200 records is noticeably slower in some areas (all of which need to be addressed): | ||
+ | * F3/Import (the EI dupe-checking routines are slow) | ||
+ | * Loading an RTree that includes Jane Austen ;-) | ||
+ | * Discarding a set of imported records from the RTree | ||
+ | |||
+ | I exported the ' | ||
+ | |||
+ | By way of validation I ran it through raptor conversions on our server, submitted it to an RDF distiller, loaded it into an in-house triplestore (allegrograph), | ||
+ | |||
+ | One problem found during validation was based on importing tag 670 from the LC authority records (which we map to ' | ||
+ | |||
+ | Here's an example from the MARC((in my example the 670 $b is omitted, but in actual practice its included in the string directly after the 670 $a)): | ||
+ | 670 $a http:// | ||
+ | |||
+ | and the resulting RDF object in RIMMF: | ||
+ | < | ||
+ | < | ||
+ | |||
+ | This was fixed (as the validators trip on it) by inserting a label at the head of the text during mapping, so that it now looks like this: | ||
+ | < | ||
+ | "Url: http:// | ||
+ | |||
+ | __Data available for download__ | ||
+ | |||
+ | * http:// | ||
+ | * http:// | ||
+ | * http:// | ||
+ | * http:// | ||
+ | * http:// | ||
+ | * http:// | ||
+ | |||
+ | |||