234:same_record

Identifying same record dupes in MARC Global

It is now possible (with version 234) to use MARC Global to identify and delete duplicate fields in the same record.

Note: If you want to identify records in a file that have a duplicate title, match key (ISBN, LCCN, OCLC), or any other MARC Field, use the SORT Utility instead.

To run this type of task, start MARC Global, press Skip, select the 'Identify duplicate data' task, then press Next. The following options form should appear:

At the top of this form, enter the tag that you want to check for duplicate data. You can specify a single tag, or an 'X' tag here (for example, 65X, 6XX, XXX). If you enter 'XXX', the program will look for data that repeats anywhere within the record (see notes below for more details). For example, if your library code is present in an 003 field, an 049 field, and again in a 9XX field, then these fields will all be displayed as dupes in the report.

Normalization

The following data normalizations can be applied to each field before it is compared.

Ignore blanks: remove leading and trailing blanks, and compress two blanks to a single blank.
Ignore case: all data (except for MARC subfield codes) will be shifted to uppercase.
Ignore indicators: if whole tags are being compared, the indicators will be removed.
Ignore punctuation: all punctuation marks will be converted to blank spaces, and then multiple blanks will be compressed to a single blank.
Ignore subfield codes: all MARC subfield delimiters (x1F) and the byte that immediately follows them (hopefully a subfield code) will be replaced with a single blank space.

Additional options

Two additional options are available for this type of job.

Find duplicate data only in matching tags

This option is applicable only when the TAG includes an 'X'. If this option is not selected, which is the default, all duplicating data within the record is identified. If selected, then the search will discard all duplicate data fields that do not have the same tag number (for example, a 650 that is repeated as a 655 will not be reported as a duplicate because the tags are different).

Delete duplicate data

This option will remove any duplicate data identified by the preceding options. There are some restrictions on the use of this option.

First, only duplicate data in the same tag will be removed. If the TAG entered on this form includes an 'X', then the 'Find duplicate data only in matching tags' option must also be selected; if it is not, an error message will appear when filling out the form.

Second, there is no way to specify which duplicate field to delete–at present, the progam will always delete the duplicate that occurs second in the record. Further, if a field repeats three (or more) times, then the second and third (and fourth, etc.) duplicate copies of the field are deleted, and so forth.

Output options

Apart from MARC Output, two types of Text Output are available for this task: 'Custom', and 'Full record'.

A peculiarity of this task is that text output is written to the MARC Global log (specified on the 'Global Change' tab, and set to 'mglog.txt' by default), instead of the MARC Review report ('mreview.txt' by default; this is confusing and we'll try to fix this in a coming version).

For best results, we recommend that you set the Output Format to 'Custom record', and enter a tag (such as your system control number) in the 'Tags to Output' box¹⁾:

If you are using the 'Delete duplicate data' option, you can get a very nice report by using the recommended setting:

Note the occurrence numbers after the tags. The '[DELETED]' flag indicates which fields were deleted²⁾. If the 'Delete duplicate data' option was not selected, this report would look the same except that there would be no '[DELETED]' flags.

NOTES

Please take care when using the 'XXX' tag and the 'Delete duplicate data' option. There is no way (as yet) to exclude fields from this processing. Its possible that certain local fields, such as those used in item processing, are intentionally repeated. In this case, it would be best to run the program in review mode first, then look at the resulting report and design a strategy for cleaning up the duplicate data in a more restriced manner (for example, separate runs using 6XX and 7XX, instead of one run using XXX).

This type of task can be saved to the saved reviews file.

Some of the normalization options may have dependencies. For example, if a subfield is specified for the Tag being searched, then indicators are always ignored. On the other hand, if a subfield is specified for the Tag being searched, the 'Ignore subfield codes' option is effectively itself ignored by the program.

An 'XXX' search will usually turn up some unexpected results (its enlightening to see how much data is repeated in MARC!) For example:

Control number fields are often repeated (001 and 010; 035 and 9XX)
Call number fields are often repeated (050 and 090; 082 and 092)
Authors in 100 often appear in 600 and 700 fields
Title fields (22X-24X) often duplicate after normalization is applied
Edition (250) statements are sometimes repeated in XXX notes
Note (500) fields are often repeated as 650 fields (eg. 'Compact discs')
Subjects (6XX) often repeat verbatim with only a change in indicators,
Untraced series (490) headings are almost always repeated in 830
Title (24X) and Series (4XX) headings are often repeated in 7XX fields

For this reason, we added the option that restricts deletion to duplicates in the same tag.

¹⁾

'Full record' output can also be used here, but that might make it difficult to quickly see which tags are being duplicated

²⁾

The 'nice report' in the example above is currently only available when using 'XXX' as the tag. If specifying a specific tag this report format will not be produced–we hope to fix this in the next version