MARC SORT

This utility allows you to change the order of the records in a MARC file by sorting them on any MARC tag. You can also use MARC Sort to dedupe a MARC file. MARC Sort first appeared as part of our MARC Global software in 1996.

NOTE: Use the File|Select option in MARC Report to pick the source file you want to sort, then select 'MARC Sort' from the Utilities menu. The name of the source file is always visible in the status bar at the bottom of the MARC Sort options form. To switch to a different source file, double-click on the picture on the left before you start the job.

SORT KEYS

In our description of the MARC Sort options, we mention something called a “sort key”. When MARC Sort processes the source file, it will read each record and try to create a sort key for that record. It does this by first extracting the MARC tag, subfield, etc., that you have defined, and second by applying all of the selected options (Ignore Punctuation, Ignore Case, etc.) to the extracted data. The final result of this process is a string which is written to an index–we call this string the “sort key”. In the current version of MARC Sort, the maximum length of this sort key is 80 characters.

Each MARC record is represented by one (and only one) sort key.

DEFINE SORT KEYS

TAG 1/TAG 2

You must enter the three-digit MARC tag (000-999) that you want to sort the file on. For best results, you should select a non-repeatable field (however, see the section on 'MATCH' below).

The program allows you to create a sort key consisting of up to two tags: TAG 1 and TAG 2.

You can enter '1XX' in the TAG box to sort the file on the 100, 110, 111, etc. Note that '1XX' is safe as there should only be one tag beginning with '1' in a record. However, if you try this with other tags (like 4XX or 8XX), you should be aware of how 'XX' works: the first matching tag will be the one that is used to create the sort key.

TAG 1 is a required field–you cannot leave it blank. TAG 2, however, is optional. You should define TAG 2 when TAG 1 does not represent a unique value. For example, use Tag 2 if you are running the program on a file of duplicates, and want to order the records so that the most recent record is last in each group of dupes; set up the match key in Tag 1 (eg. 001, 010, 020, etc), and then use a date key in Tag 2 (005, 008, etc).

SUBF/POSITION

Use the SUBF option to limit the TAG to a specific subfield. If this option is blank (the default), all of the data in the tag will be used to create the sort key.

You can enter a string of subfields, such as 'anp', and the program will extract only these subfields from the tag. The subfields are added to the list in the order that they occur in the MARC record, not in the order that they are entered here.

Hint: To search for duplicate title fields (245), enter 'anpb' in the SUBF box.

For Fixed Fields (like the leader and 008), enter the (zero-based) position of the element you want to sort on. For example, to sort on the Record Type, enter '000' in the TAG box, and '06' in the SUBF box; or, to sort on Date 1, enter '008' in the TAG box and '07' in the SUBF box. The program will look up any Fixed Field element in a table and automatically determine the correct length to use when creating the sort key.

SKIP

This option can be used to skip the first x characters of a tag before creating the sort key–think of it as a non-filing indicator. It is set to 0 by default. An example of how this option can be useful is given below in the section on NUMERIC KEYS.

NOTE: This option is applied at the end of the sort key creation process, just before the data is written to the index.

MATCH

This option will pattern match data within the tag. You should only need to do this if the tag is repeatable. For example, if you want to sort your database on the 035 $a, enter your system prefix here and the program will create the sort key from your control number, and not from, say, an OCLC number. The default is to match any pattern specified. If you want to NOT match your pattern, prefix the pattern will '!' (an exclamation mark), as in '!OCoLC'. This example would match any tag that does not contain the string 'OCoLC'.

An extension to the 'MATCH' option, added in version 2.39, allows you to truncate match keys for variable fields to a specified number of bytes. The syntax for this is to enter the phrase

Len=

in the 'MATCH' box, followed by a number ('Len' is not case-sensitive), where the number will be the maximum number of bytes used in a sort key. For example, you might use

Len=20

when sorting a file on Tag 245 to find all records where the first twenty bytes of each title are the same. This truncation is performed after all other normalization.

NOTE: The program will use the first tag it finds that matches the TAG/SUBF/SKIP/MATCH options for the sort key, even if multiple tags match.

KEY IS …

Use these options to tell the program what kind of data is being sorted; only one option from this checklist can be selected. Some of these options use built-in rules for creating the sort key and may ignore other options on this form.

ALPHANUMERIC–This is the default. It essentially means that the data you want to sort on is not one of the following types.

NUMERIC–This option indicates that the sort key will consist solely of numbers (like a control number that does not have a prefix). If you select this option, the program will normalize the numbers so that they sort numerically, instead of alphabetically. Here is a simple illustration of an alphabetic sort vrs. a numeric sort:

Alphabetic sort Numeric sort 8 8 895 92 92 895

You can create a numeric key from an alphanumeric key using the SKIP option. For example, if your control number always contains a prefix, like '(QPQ)12345', you could set SKIP to '5', then use the NUMERIC KEY option to force a normalized numeric sort to take place on the remaining digits ('12345').

LCCN/ISBN/ISSN–This option will use a special set of built-in rules to normalize the number being used as a sort key; this is useful if you have a file where not every LCCN or ISBN has been entered correctly.

NOTE: If you select this option, many other sort options will be ignored (eg., leading and trailing spaces will not be ignored for LCCNs, punctuation will not be ignored for ISSNs, etc.). Also, this option normalizes only the sort key; it does not normalize the underlying MARC data.

LCC CALL NUMBER–This option indicates that the sort key will be a Library of Congress Classification call number; the program will try to normalize the call number using built-in rules. NOTE: If Tag1 or Tag2 contains '050', this option will automatically be applied.

DDC CALL NUMBER–This option indicates that the sort key will be a Dewey Decimal Classification call number; the program will try to normalize the call number using built-in rules. NOTE: If Tag1 or Tag2 contains '082', this option will automatically be applied.

SUDOCS CALL NUMBER–This option indicates that the sort key will be a Superintendent of Documents Classification System number; the program will try to normalize the call number using built-in rules. NOTE: If Tag1 or Tag2 contains '086', this option will automatically be applied.

DATE–This option currently is useful only for a field like the 260 $c. This option will force the program to extract a four-digit year from the subfield and ignore everything else (of course, a better approach to sorting by data might be to use the 008/Date 1).

NOTE: You do not need to use this option when selecting a Fixed Field date as a sort key (since the length of the sort key will always be set to a fixed length).

SORT OPTIONS

MARC Sort includes a large number of options, the goal being to make it possible for you to sort your file exactly as you would like it. In most cases, the default settings here will be just what you need. (In exceptional cases, however, even these options will not be enough; you may need to run a preliminary pass with MARC Review or MARC Global to 'set up' a file first in order for it to be sorted exactly the way you want.)

The sort options are generally applied after the data for the key(s) specified above is extracted; therefore, they modify the data that will be used as a sort key. These options will be treated in the order that the program applies them to the sort key (as opposed to the order in which they are presented on the options form).

IGNORE INDICATORS

If checked (the default), both indicators will be removed from all sort keys. This option applies only to variable fields.

IGNORE SUBFIELDS

If checked (the default), all subfields (the MARC delimiter + the subsequent subfield code) are deleted during sort key creation. This option applies only to variable fields.

IGNORE INITIAL ARTICLES

If checked, initial articles are deleted from sort keys that begin with one. You should only apply this option when sorting by title fields. The default is unchecked.

IGNORE PUNCTUATION

If checked (the default), punctuation characters are replaced with a blank space during sort key creation. Once this task has been completed, multiple blanks are reduced to a single blank. Punctuation is defined as one of the following characters:

'  [  ]  (  )  .  ,  :  ;  -  !  ?  /  "

IGNORE DIACRITICS

If checked (the default), diacritic characters are replaced with a blank space during sort key creation. Exercise care here when sorting a file that contains both MARC-8 and UTF-8 records, as each coding scheme has different rules for positioning diacritics.

IGNORE TRAILING BLANKS

If checked (the default), consecutive blank spaces at the end of a sort key are deleted.

IGNORE LEADING BLANKS

If checked, consecutive blank spaces at the beginning of a sort key are deleted. The default is unchecked.

IGNORE CASE

If checked (the default), all characters (except for MARC delimiters) are shifted to uppercase during sort key creation.

SORT ON DATA

This option should be used with care as it contravenes accepted cataloging practice.

If selected, 'Sort on Data' will sort all variable tags in the record in numerical order, and sort all matching tags in alpha-numeric order (starting with indicator 1, unless the 'Ignore indicators option is selected, in which case the sort starts with the first subfield.)

This option might be useful to produce reports on records that have large numbers of repeated tags (like holdings data, or 650s, 700s, etc), since it will put the repeated tags in alphabetical order, which may make the records more readable.

However, please be aware that Note fields (5XX) are (intentionally) not entered by catalogers in numeric order, and the first subject heading in an LC record is (intentionally) meant to be first. This option will have a negative effect on such cataloging practice, and other local practice not mentioned here, so please use it only to produce a file that will be used to generate a report.

TREAT ALL-BLANK KEYS AS NULL

If checked, and a sort key evaluates to all-blanks, it will be treated as a null key (ie., it will be treated as if the tag were not present in the record). There are only a few situations when this option should be selected (for example, in certain Fixed Field elements, where a blank space has meaning), so by default, it is not selected.

ASCENDING SORT ORDER

If checked (the default), the file will be sorted in ascending order (from lowest to highest); if unchecked, the file will be sorted in descending order (from highest to lowest).

APPEND NULL KEYS

If checked (the default), records without sort keys will be added to the end of the results. If not checked, records without sort keys will be added to the beginning of the results.

There is not a way, in the current version, to exclude from the results records that are determined not to have a sort key; you can easily accomplish this goal, however, by running MARC Review (and excluding these records) before you run MARC Sort.

BE NICE TO (MY) CPU

If checked (the default), the program will pause a millisecond here and there to try to make your system more responsive to other actions. This option will not have a noticeable impact when sorting smaller files. However, milliseconds do add up, and if you are sorting a very large file, you may want to uncheck this box (and find something else to do).

OUTPUT OPTIONS

MARC Sort contains several options that it make it possible to dedupe a file of MARC records.

NOTE: By 'duplicate' records, we do not mean that two records are exactly the same; instead, we here use the term 'duplicate' to refer to two records that have sort keys that have evaluated to the same string after the sort options have been applied. For example, if you set TAG to 001, then all records that have the same 001 will be considered duplicates; if you set TAG to 008 and SUBF to 007, then all records that have the same Date 1 will be considered duplicates.

TYPE OF OUTPUT

SORT ONLY–The records will be sorted in order of the sort keys defined above; no records will be removed from the file, only the order of the records will be changed.

COUNT DUPES ONLY–This option runs the sort job only as far as is needed to report a count of the number of duplicates in the file.

By default, the program counts duplicates by the record. For example, if two sort keys are the same, this counts as '2' duplicate records and not '1'; if three sort keys are the same, this counts as '3' duplicates, and not '1'; etc.

The program keeps a second count, which it refers to as 'matches', which tracks the actual number of groups, or pairs of records, that have the same sort key. This number is reported in parentheses immediately after the duplicate record count described above.

PULL ALL DUPES–If selected, all records determined to be duplicates will be written to the 'Results' file. All other (non-duplicate) records will be skipped–they will not be output.

PULL ALL DUPES/SPLIT–If selected, all records determined to be duplicates will be written to the 'Results' file. All non-duplicate records will be output to the 'Split' file.

KEEP FIRST DUPE–If selected, only the first record in any group of records determined to be duplicates will be written to the 'Results' file. All other records will be skipped–they will not be output.

KEEP FIRST DUPE/SPLIT–If selected, the first duplicate record in any group of records determined to be duplicates will be written to the 'Results' file. All other records, including dupes (ie. dupes that were not the first occurring dupe) will be output to the 'Split' file.

KEEP LAST DUPE–If selected, only the last record in any group of records determined to be duplicates will be written to the 'Results' file. All other records will be skipped–they will not be output.

KEEP LAST DUPE/SPLIT–If selected, only the last duplicate record in any group of records determined to be duplicates will be written to the 'Results' file. All other records, including dupes (ie. dupes that were not the last occurring dupe) will be output to the 'Split' file.

DEDUPE/KEEP FIRST DUPE–If selected, the program will make two passes on the source file: the first will split the file into dupes and nondupes, and the second will then run throug the dupes and select the first record in each group of duplicate records. Finally, the program will concatenate the results of the second pass with the nondupes from the first pass, thus creating a deduped file.

DEDUPE/KEEP LAST DUPE–If selected, the program will make two passes on the source file: the first will split the file into dupes and nondupes, and the second will then run through the dupes and select the last record in each group of duplicate records. Finally, the program will concatenate the results of the second pass with the nondupes from the first pass, thus creating a deduped file.

NOTES: In the related options above, 'first' and 'last' are determined by the ordering of the sort keys. If only a primary sort key is defined, then first and last will refer to the record's sequential position in the file. For this reason, if you are using one of the two dedupe options, you should always try to specify a secondary sort key (such as a date field like the 005, or 008/00) that will order all duplicate records according to their creation date (or a similar criteria).

RESULTS FILE

Click this button to open a Windows dialog that you can use to navigate to the directory on your system where you want to save the results of MARC Sort; you will then have to enter a filename. You can also just enter a filename in the edit box to the right.

SPLIT FILE

A 'Split' file is only created when deduping, and only if one of the three split options described above is selected. Click this button to open a Windows dialog that you can use to navigate to the directory on your system where you want to save the SPLIT results; you will then have to enter a filename. You can also just enter a filename in the edit box to the right.

NOTE: whenever you enter a filename manually, you should enter the full path of the filename (eg. 'D:\MARC\mysortedfile.mrc' and not 'mysortedfile.mrc'). Otherwise you may have trouble finding the file when the run is complete.

ADVANCED USERS–HOW TO CHANGE THE WORK PATH

If you are sorting a large MARC file, the location of the workpath will have a critical impact on the performance of the program. By default, MARC Report sets the workpath based on the program options in the Files and Directories section. For example, if you have set the option 'Output results to source', then the working folder will be set to the same folder as the source file. This may not always result in a good choice.

If you want more control over the working folder location, right-click on the status bar and select a different folder before you press the 'RUN' button.

If you have more than one drive on your system, set the workpath to a folder on a drive that is not your system drive (the system drive is typically where Windows is installed). Do not set the workpath to a network drive; do not set the workpath to an external drive (like a USB drive)–either choice will negatively impact the program (and perhaps other users in the former case).

MARC Sort was designed to not require any additional memory to run effectively. However, it does require extra disk space; a good rule of thumb would be to multiply the size of your file by two–you should have that much free disk space before beginning a sort job. Since most systems these days will have many GBs of free space this may never be a problem for you. All the same, if, for example, you run several sorts on a very large file and are not aware of this, your system may grind to a halt when the Windows swap file starts taking over–you have been warned.

WARNING ABOUT RAID DRIVES

We do not recommend running this utility on a file of any size with a system based on RAID. It has been our (unfortunate) experience that this scenario doesn't work well (and often, at all) for this utility–RAID just does not handle the fast and furious merge file creation that is required by our MARC Sort.

If you have a RAID system, and have access to a non-RAID drive, see the note above on how to change the working folder.

PROGRAM OPERATION

After you have entered your options, press the RUN button to begin the sort.

TECHNICAL NOTES

The basic steps that MARC Sort follows are:

1. Creates an (unsorted) index for your file using the options you defined 2. Organizes the index into smaller, sorted, merge files. 3. Merges the files created in step #2 into one sorted index. 4. Optionally, dedupes the index (if applicable) 5. Finalizes the sorted index (essentially, quickly rewrites the index in order to optimize the last step) 6. Writes out a new MARC file in the order of the sorted index

VIEW LOG

MARC Sort logs each step that it uses when it sorts your file; here is a standard log entry:

MARC Sort started at: 4/23/2003 5:06:52 PM Building index for MARC file e:\marc\dupes001.abc.mrc Indexing processed 1954 MARC records in e:\marc\dupes001.abc.mrc Organizing the index into sorted workfiles … 1954 index entries were processed from e:\marc\_1_INDEX.TMP 4 sorted workfiles were created in e:\marc\ Merging the sorted workfiles … 4 sorted workfiles were merged to e:\marc\_3_MERGD.TMP Preparing sorted index for MARC run … 1954 index entries were finalized to e:\marc\_6_FIN_1.TMP Writing sorted Marc file … 1954 records output to e:\marc\mysorted.mrc MARC Sort finished at: 4/23/2003 5:07:09 PM

Most of the steps in the log can be seen in the status bar while the program is running; however, if the file is small, the sort will happen very quickly, so may be a good thing to have a log to fall back on. You can see this log at the end of the job by pressing the 'View Log' button. (Note that this log is cumulative; that is, it will not be cleared until you manually clear it).

DELETE INDEXES

This option is selected by default. This option automatically deletes all of the temporary files created during the Sort run. The only reason to de-select this option would be to troubleshoot a sorting problem.

During the merge process (when the index is being sorted), the program may create a large number of temporary files. Although these files are not very large (less than 100 KB), if, for some reason, the program is interrupted during this phase, they may be left on your disk (“may” since we do try to clean up after ourselves). If this happens, you should try to find these files and delete them. The filename format for these files is '000001.tmp', and they will be located in the MARC Sort work directory (or, if that is undefined, in the directory of the source file).