phelp:helpmarcsplit [[MARC Report]]

UTILITIES–MARC SPLIT

MARC Split allows you to break a large MARC file into several/many smaller ones.

WHY USE MARC SPLIT

Here are a few scenarios where MARC SPLIT would be useful.

First, say you need to transfer a large database across the internet. You decide (or are told) to split the database into several smaller files, then ftp each file separately. In this case, if the ftp job aborts, you will only need to resend one smaller file. MARC Split's 'By Record' option will do it for you.

Second, say you want to copy your database to a set of removable media (floppy, CD, USB drive, etc.), and want each disk to contain a readable MARC file of roughly the same size. MARC Split's 'By Size' option will do this. (if you do not require each disk to contain a readable MARC file, using a zip program that supports multi-volume archives might also suffice).

Finally, you might use MARC Split to split your database into files by: holdings code, record type, media type, or any other distinct MARC element. For example, say you have 12 holdings codes in your system, and you need to generate 12 files, where each file contains all bib records to which each holdings code is attached. MARC Split's 'By Data' option will do it for you.

Note: To split a file into two files based on the presence/absence of a pattern, use the MARC Review Utility.

SPLIT OPTIONS

There are three methods to use to split a file:

by the number of records per file,
by the number of bytes per file,
by the data values in the file itself

By Records

This option is suitable for the first scenario discussed above. Click on 'By records'; then, in the box labeled 'Number of units', enter the number of records that you want to be written to each file created. The number of files created will be equal to the number of records in your source file divided by the number of 'units'. Note, however, that the last file created will typically have less than the specified number of records (i.e., the remainder of the division operation).

The size of each file in bytes will vary, of course, due to the variable-length nature of MARC.

If for some reason (perhaps for a web catalog) you need to save each record in your database to a separate file, you can do it by setting the Number of units to 1. Be careful if doing this for a large file. First, check that there is enough disk space, since, as a rule, each file created by the Windows operating system uses several times more disk space than the size of the MARC record itself. Second, be sure to direct the output to an empty folder–this will make clean up alot easier if something goes wrong, or if you just change your mind.

By Size

This option is suitable for when you are working with a known quantity of limited disk space, such as with removeable media. Click on 'By Size' option; then enter the capacity of the storage medium, or set a maximum file size, in kilobytes, in the 'Number of units' box. The program will make sure that the source file is split into smaller files of NO MORE THAN this size.

The number of files created will be roughly equal to the size of your source file in bytes divided by the number of bytes entered. The number of records in each file will vary, of course, due to the variable-length nature of MARC.

Note that the size of a storage device these days is typically given in Gigabytes, eg. '8 GB', or Megabytes ('650 MB'). You may find it necessary to convert this rounded off number into Kilobytes. There are many free converters on the web that will do this (google 'byte converter'). Or, get out your calculator and multiply the number of gigabytes by 1048576, or the number of megabytes by 1024.

By Data

The By Data option is useful if you need to split a file by the value of a MARC content designator, and that content designator contains relatively consistent data. For example, you might need to split a file

by language code
by holdings code
by media type
by publication date
or by any other MARC content designator

In a 'By Data' run, the output files are named using the strings contained in the specified MARC content designators (to which we add the file extension '.mrc'). So, if you are splitting a file by RDA Media Type (337 $a), the resulting files might be named: 'audio.mrc', 'computer.mrc', 'unmediated.mrc', etc., where every record output to 'audio.mrc' will contain the string 'audio' in 337 $a, and every record output to 'computer.mrc' will contain the string 'computer' in 337 $a, and so on.

In practice, Split By Data is a two-phase operation.

The utility first processes the selected MARC file in read-only mode; the purpose of this run is to gather statistics on the proposed files to be created. When that's done, the proposed file list is displayed, and you can either approve it, or cancel it and go back to the options. If you approve the proposed file list, the utility runs a second time to create the output files.

Its important to note that the Split utility does not attempt to validate the contents of the specified content designator. Any typos in your data will end up becoming filenames. Thus, continuing with the above example for 337 $a, you might end up with filenames like: 'unmedia ted.mrc', 'unmediaetd.mrc', 'unmediate.mrc', 'unmedited.mrc', etc. This isn't necessarily a bad thing (as it will help you to clean up the data and make it more useable), and the report that is displayed after the first phase will alert you to this and give you the chance to cancel the job.

Using the options

Enter the location of the data in the 'Tag/Subf' box. For variable tags, enter a three-digit MARC tag, followed by a subfield code. Here are a few examples:

040a - original cataloging agency code
337a - media type (string value)
852a - holdings code
949l - holdings code

For fixed fields, enter the three-digit tag, followed by '/', followed by the starting offset. Here are a few examples:

000/06 - record type
008/07 - publication date
008/35 - language code

In the 'Data length' option, enter the length of the data. For variable tags, enter a number between 1 and 32. For fixed fields, enter a number between 1-6 (there is no fixed field element with a length greater than 6). For example, for language code, enter '3'; for leader record type, enter '1'. For data that varies in length (like some holdings codes), enter a number that will accommodate the largest unique value.

The Data Length option is important since it will determine, for non-coded values, the number of files created–the greater the Data Length, the greater the number of files that will be created.

The 'If data repeats' option tells the program what to do when the value of a content designator repeats.

The first option is 'First occ only', which means Split will process only the first occurrence of each content designator. So, if you are splitting on holdings code, and a holdings field has data like this:

 852   $a MAIN $a NORTH $a SOUTH

–only 'MAIN' will generate an output record (to a file named 'MAIN.mrc')–the other occurrences of $a will be ignored.

On the other hand, if 'Every occ' is selected, then the program will output one copy of the record for each unique occurrence of a string. Referring to the previous example:

 852   $a MAIN $a NORTH $a SOUTH

–there will be one record output for 'MAIN', one record for 'NORTH', and one record for 'SOUTH'. Note that this option will typically create results where the sum of the records output is greater than the number of records input. See also the note about the 'Deduping' option, below.

If 'concat to one' is selected, the program will combine each occurrence–separated by a hyphen–into one string. For example, if splitting by media type (337 $a), which is repeatable, and a record contains two media types, the program will output that record to a file named by concatenating the two media types. For example, output files might be named: 'audio.mrc', 'audio-video.mrc', 'computer.mrc', and so on.

Finally, if 'Split to file' is selected, and a record contains more than one occurrence, that record will be output to a file named (literally) '_SplitByDataTooManyOccs.mrc'.

Normalization and deduping

The following normalizations will be applied to each data value found (whether the 'Normalize' option is selected or not):

blank spaces in fixed field values are replaced with '#'
leading and trailing blank spaces are deleted; internal blanks are preserved
any character present that is not permitted in a Windows filename is replaced with '%'
strings longer than the value set in the 'Data length' option are truncated to that length

If 'Normalize' is selected, then for each value:

MARC-8 diacritics are converted to ASCII approximations
All punctuation marks except '#', '&', and '+' are deleted
Consecutive blank spaces within a string are reduced to a single space
the string is shifted to lowercase

If 'dedupe' is selected, then, after normalization, if two or more data values are discovered, they are deduped. This option is applicable only if the selected repeat option is not 'First occ. only' or 'Split to file'. For example, if splitting by holdings code, with a repeat option of 'Every occ'

 852   $a MAIN $a NORTH $a SOUTH $a SOUTH

–deduping would remove the second occurrence of 'SOUTH' from the list of candidates to be output.

By default, any records that lack the specified Tag/Subf are written to a separate file, named (literally) '_SplitByDataNoHits.mrc'. If for some reason you want to avoid this select the 'Ignore no-hits' option.

FILENAME PREFIX

This option applies to By Records and By Size splits, but not to the By Data option.

Enter up to seven characters to be used as a filename prefix. The default is the letter 'F. Whatever you enter here will be used to name the output files. For example, if you enter 'NEW', the resulting output files will be named 'NEW000001', 'NEW000002', and so on.

Notes

The number of zeroes in the file sequence numbers is fixed (at 6) and does not depend on the number of files generated.

The file extension of a MARC file created by the Split utility will always be '.mrc'

OUTPUT FOLDER

Enter the folder to which you want the split files to be output. The folder-select dialog (activated by pressing the '…' button on the right) will include an option to make a new folder.

Note: for a split By Data type of run, the output is always written to a subf-folder of the 'Output folder'. So, if you set the output folder to 'D:\work\', then the results of the split By Data run will actually be written to a folder named 'D:\work\splitResults\'. If that folder already exists, a uniqueness qualifier will beadded to it, eg., 'splitResults(1)', 'splitResults(2)', and so on.

If you type in the folder name manually, be sure to end with a trailing backslash.

LAST WORDS

There are a few cases where the Split By Data option might fail. The most prominent of these would be when the specified Tag/Subf results in too many files being created. The current file creation limit is 1,000 and this could easily be overrun on a MARC file of modest size (with ensuing negative consequences to your system) by selecting a common MARC field.

In fact, preventing this type of disaster is one of the reasons why Split By Data runs twice: first, in read-only mode, to discover the data and create a proposed list of files; then, after approval, it runs again to actually output the proposed files.

phelp/helpmarcsplit.txt · Last modified: 2021/12/29 16:21 (external edit)

Back to top