248:major_changes

Split By Data: Options

MARC Split, as you know, allows you to break a large MARC file into several/many smaller ones. With version 248, there are now three methods available to split a file:

by the number of records per file,
by the number of bytes per file,
by the data values in the file itself

Here is a screenshot of the updated MARC Split options form, with the new options for 'By Data' highlighted:

The By Data option might be useful if you need to split a file by the value of a MARC content designator, and that content designator contains relatively consistent data.

For example, you might need to split a MARC file

by language code
by holdings code
by media type
by publication date

and so on.

How does it work

For a quick demo, try these steps.

Select a reasonably large MARC file and then select MARC Split from the Utilities menu.

On the 'Split' form, select the By Data radiobutton and set up the form as follows:


Tag/Subf	008/15
Data length	3

Down in the Output folder section, click on the 'ellipsis button' and select a folder where the results will be output.

Now press Run. Hopefully, the green bar should start moving. Just press 'Cancel' if for some reason you need to abort the job.

When the green bar gets to the end of the file, a report will pop-up. Have a look at it, and then press Cancel. Read on to find out what's happening.

This utility is a bit different than others, in that it runs two passes on the source file. The first pass is a report mode, where the program applies the options on the form to determine what files will be created. When this pass completes, it pops up the proposed results and asks you to confirm them¹⁾. If you confirm the proposed results, the utility runs a second time to actually write the output files.

The reason for this two-stage processing is that the Split By Data utility has the potential to create a lot of files–thousands of them–because the filenames are generated using the actual data found in the specified Tag/Subf. So in the example above, which uses the MARC Country code in the 008, the resulting output filenames might be:

nyu.mrc
cau.mrc
enk.mrc
mau.mrc

etc.

This will not normally be a problem for data like that found in fixed fields, as coded data values are usually very well controlled, and relatively limited in number. But let's say, instead of the scenario, above we tried:


Tag/Subf	700e
Data length	32

You will no doubt find quite a different story, as the control we have over our variable field data is not as good. Here's a piece of a report from the same file, with Tag/Subf set to 700e instead of 008/15:

film co director                 1
film composer                    2
film diector                     2
film directer                    1
film director                    6268
film director editor of moving i 1
film director film producer      2
film director screenwriter       2
film diretor                     1
film distributor                 4
film editor                      43
film narrator                    1
film photographer                1
film pproducer                   1
film pro ducer                   1
film procer                      1
film prodcuer                    4
film produce                     1
film producer                    7064
film producer editor of moving i 1
film proudcer                    2
film publisher                   1
film screenwriter                2
filmmaker                        75
filmproducer                     3
films producer                   3
fim director                     1
fim producer                     1

Lots of typos here–the catalogers must not be MARC Report users at this library!

Nota Bene

So this brings us to an important note about Split By Data:

There is no attempt to validate the contents of the specified Tag/Subf.

Keep this in mind when using this option.

To avoid a scenario where thousands of files are output to an unwitting user's hard drive, we have a arbitrarily set a limit on the number of files that Split By Data will create. If that limit is exceeded, then the first pass will fail²⁾. The limit in version 248 is 1,000 files. That should accommodate most needs; if it doesn't, let us know about it.

Finally, just a reminder of the obvious: the MARC Split Utility, and especially Split By Data, take a MARC file and split it into smaller files based on your options. If you don't want all these files, and simply want a report or list of, say, every MARC Country code in your file and the number of times it appears, or every relationship designator, etc., this is not the tool to use.

Instead, use the Custom List option in MARC Analysis. It is much faster, much more flexible, with none of the limits that have been imposed on Split By Data.

Sample report 1: MARC Country code

MARC REPORT 2.48		03/14/17 3:02 PM

Split By Data on 008/15, Length=3, Repeat='First occ only'

MARC Source File:       D:\un\_big_marc\verified-161001.mrc
Records processed:      220695
Report filename:        D:\un\_big_marc\splitResults\SplitReport-17031401.txt

The current split options will generate the following results:

Number of files:        177
Number of records:      220695
Number of no-hits:      0

Filename                Record count
nyu                     107916
cau                     25161
mnu                     8791
meu                     8108
mau                     7364
ilu                     5136
enk                     5037
miu                     4663
onc                     3925
nju                     3462
azu                     3095
xxu                     2798
mdu                     2539
ctu                     2317
pau                     2293
ohu                     2017
flu                     1906
tnu                     1701
vau                     1655
dcu                     1637
wiu                     1574
sp#                     1574
oru                     1403
txu                     1389
cou                     1027
xx#                     990
wau                     923
inu                     781
mx#                     697
ncu                     598
vtu                     552
nmu                     537
utu                     528
###                     464
iau                     415
gau                     374
oku                     364
scu                     341
mou                     291
bcc                     262
riu                     255
at#                     253
nbu                     246
nhu                     241
nvu                     237
ksu                     210
gw#                     194
deu                     184
quc                     182
ag#                     160
mtu                     150
aru                     130
alu                     129
lau                     103
kyu                     100
%%%                     84
fr#                     79
ja#                     77
cc#                     73
vra                     70
ne#                     62
xxk                     60
si#                     50
ck#                     49
idu                     43
stk                     40
it#                     38
xxc                     36
msu                     35
wyu                     32
xna                     26
nz#                     26
ii#                     25
hiu                     24
sz#                     23
abc                     22
aku                     19
sw#                     17
ie#                     16
wvu                     14
ve#                     12
pr#                     12
ko#                     12
is#                     11
ch#                     10
au#                     10
[report truncated; another 90 items follow]

Sample report 2: RDA Carrier type

MARC REPORT 2.48		03/14/17 12:15 PM

Split By Data on 338a, Length=32, Repeat='First occ only'

MARC Source File:       D:\un\_big_marc\verified-161001.mrc
Records processed:      220695
Report filename:        D:\un\_big_marc\splitResults\SplitReport-17031401.txt

The current split options will generate the following results:

Number of files:        12
Number of records:      221154
Number of no-hits:      163782

Filename                Record count
_SplitByDataNoHits      163782
volume                  42741
videodisc               9163
audio disc              4926
other                   460
sheet                   35
computer disc           30
object                  8
videocassette           5
unspecified             2
card                    1
audiocassette           1

No records will be output more than once.

¹⁾

You can see a sample report at the bottom of this page.

²⁾

As a reward to those who read footnotes, if you have a real need to exceed this limit, send us an email and we'll tell you how

Table of Contents

MARC Report 248: Program enhancements

Split By Data: Options

How does it work

Nota Bene

Sample report 1: MARC Country code

Sample report 2: RDA Carrier type