MARC VERIFY

MARC Verify provides a quick way to remedy common errors in MARC files, along with a few utilities for users interested in batch processing. Under normal circumstances, you will never need to use this utility.

WHEN TO USE MARC VERIFY

Use this utility on your file if you get a 'MARC error' when running MARC Report.

WHAT IT DOES

The main result of running MARC Verify will be a clean USMARC file that can be read by any software that reads USMARC data. This file will be written to your Batch Report directory and named 'verified.mrc'. If errors are reported, then you should use this 'clean' file for any further MARC processing; otherwise, it is safe to continue to use the input file. At the end of the verify run, a pop-up message will report the results, and any errors that may have been encountered.

HOW IT WORKS

MARC Verify reads each byte of the input file and writes it to an output file. During this process, it attempts to remove any non-MARC formatting from the input file.

If the incoming byte is a null character (x00) or DOS EOF character (x1A), it is converted to a blank space (x20) before being written to the output file.

If the program encounters a MARC record terminator in a position that does not agree with where the leader says the terminator should be, and the leader appears to be correct, the 'errant' terminator is also converted to a blank space.

Any bytes that are found in between records (ie. bytes that occur after the MARC record terminator (x1D) and before the next leader), or bytes that occur at the end of the file following the last MARC record, are not written to the output file.

Also, while reading the file, if a record is found that cannot be processed (ie the record was truncated, or the MARC structure of the record is in error), the record, or part thereof, is written to a textfile instead of the output file. We refer to this as a 'MARC Error' (Older versions of the program called this a 'structural error'.)

VERIFICATION OPTIONS

Remove records with MARC errors: This option appears grayed out because it is always in effect. As noted above, a MARC error occurs when the MARC structure (leader, directory, data) of the record is incomplete or in error.

Each record with a MARC error is dumped to a separate textfile named MarcErr###.txt, where '###' is the record's sequential number in the file. This file is in ASCII format and may be viewed by any text editor.

Remove records with null characters: Select this option to remove records that contain null characters or multiple MARC record terminators from the file. The records will be written to a file called 'eHasNull'.

Remove record status = 'd': Select this option to remove all records where the leader Record Status value is 'd'. The records will be written to a file called 'eDeleted'.

Remove non-bibliographic records: Select this option to remove all records where the leader Record Type value is one of 'q', 'v', 'x', 'y', 'z'. The records will be written to a file called 'eNonBib'.

Remove, or Count, records with invalid indicators/subfields: Select the 'Remove' option to remove all records that contain invalid data in the indicator positions, or lack a subfield delimiter at the beginning of the data portion of a variable tag. Select 'Count' if you only want to count such records. The purpose of this option to scan a file for 'scrambled' records–records where the Marc structural elements (leader, directory, and field offsets) are correct, but the presence of garbage characters in key positions indicate a problem with the data (eg. a variable field was output without any indicators, etc).

Remove, or Count, records without an 001: Select the 'Remove' option to remove all records that do not contain an 001 tag. The records will be written to a file called 'eLack001'. Select the 'Count' option if you only want to count the number of records lacking the 001 tag: they will not be removed from the file.

Remove, or Count, records without an Local Holdings tag: Select the 'Remove' option to remove all records that do not contain the specified local holdings tag. The records will be written to a file called 'eNoHldng'. Select the 'Count' option if you only want to count the number of records lacking the holdings tag: they will not be removed from the file.

Remove, or Count, records without an a System Control tag: Select the 'Remove' option to remove all records that do not contain the specified system tag. The records will be written to a file called 'eNoSysId'. Select the 'Count' option if you only want to count the number of records lacking the system tag: they will not be removed from the file. This option will be most helpful when the unique control number used on your system is not the 001.

Remove, or Count, records with Record Length >= x, where x represents a valid MARC record length from 40 to 99999. This option allows you to remove or count records greater than a certain size; removed records will be written to a file called 'eHugeRec'. This option is useful if you are transferring records to a system with size limitations (for example, OCLC does not allow records greater than about 20,000 bytes to be sent via EDX). This option may also be useful if you are planning global fixes on a file, and want to identify records where there is no free space for additional data.

To specify a tag in either of the two preceding options, enter the three digit MARC tag, (optionally) followed by the subfield code. For example: '010a', '035a', '049a' '852p', and so on.

NOTE: For each of the above options, records are removed from the source file in the order that the options are listed. For example, if you have selected both Remove records without an 001, and Remove records without the Local Holdings tag 852, if a record lacks both the 001 and the 852, it will be written to 'eLack001' and not 'eNoHldg', as the 001 option precedes the Local Holding tag option.

Reset Leader: Vendors often 'borrow' certain bytes in the leader to mark records for their own internal purposes. Typically, positions 22 and 23 are used for this purpose, but we have seen other positions used. The theory behind this borrowing may be that since these bytes are exactly the same in every MARC record, processing software will just ignore them anyways.

By default, MARC Verify does not change any byte in the leader (with the exceptions of a null character, or a DOS EOF character, each of which will automatically be converted to a blank).

However, if you want to clean-up 'junky' leaders, select the first 'Reset' option: 'Reset Leader 10-11, 20-23'. This will make sure that every record that is output will have these bytes set to '22' and '4500', respectively, which are the only truly valid values defined by MARC21. Positions 10-11, and 20 cannot be edited in MARC Report (simply because they should never need to be!), so we provide a way to globally correct them here.

The second 'Reset' option will change several coded positions in the leader to blank. WARNING: You should only select this option if you know for a fact that the leader contains spurious characters that you want to clean up. The positions reset by the option 'Reset Leader 08-09, 17-19' are:

000/08: Type of control (usually blank)
000/09: Character coding scheme (usually blank, unless unicode)
000/17: Encoding level (various codes apply; Note that blank=Full Level record) 
000/18: Descriptive cataloging form (usually 'a' for BIB; blank=non-ISBD)
000/19: Linked record requirement (usually blank)

Note that the 'Reset Leader' options take precedence over the 'Control Code Translation' option if both are selected.

Identify unused local fields: This option will list, at the end of the Verify run, every 9xx tag that has not been used in the file. This may be helpful during database processing, where sometimes data must be added/copied/moved to temporary tags within the record.

NOTE: You can also get a list of unused 9XX fields (and a whole lot more) by running the MARC Analysis utility with the default options.

Control Code translation: This option will force Verify to convert any character in the sequence of Control Codes (x00-x20, x7F), with the exception of the Escape character (x1B) and the MARC delimiters (x1D-x1F), to a blank space. At the end of the Verify run, the report will include a table that lists each character converted, and the number of times it occurred in the source file. (In older versions of the program, this option was called 'Bad Character Translation'). By default, this option is not selected.

NOTE: This option is safe for UTF-8 data because Control Code values are not used in UTF-8. This option would NOT ordinarily be safe for true unicode data (eg. UTF-16 as opposed to UTF-8). But because the current version of MARC Report does not recognize an unicode file header (if you try to select a file with a BOM header using MARC Report, the 'This does not seem to be a MARC file' error will pop up), this potential problem is averted.

Repair Terminator Problems: If selected, the utility will repair fields that include multiple field terminators (x1E). It will also try to fix records where the field terminator has been omitted from the last field, and records where the record terminator was counted as part of the length of the last field. By default, this option is not selected. NOTE: This option is incompatible with 'Alphabetic Tag Support', described below.

Repair Delimiter Problems: In this section, the term 'delimiter' is used to refer the MARC subfield delimiter (x1F). If this option is selected, the utility will try to fix the following problems with subfield delimiters:

control fields that begin with subfield $a (the $a will be deleted)
control fields that contain a subfield $a in position 3 and 4 (the first four bytes will be deleted)
control fields that contain the delimiter byte (it will either be deleted or replaced with a blank)
variable fields that begin with subfield $a (two blank spaces (indicators) will be inserted) 
variable fields that contain an 'a' after two indicators (a delimiter byte will be inserted)
variable fields that begin with alphabetic data (two blanks spaces and subfield $a will be inserted)
variable fields that contain two consecutive delimiters (the second delimiter will be deleted)
variable fields that contain a dollarsign after the indicators (the dollarsign will be replaced by a delimiter byte)
variable fields for numbers that begin with subfield $A ($a is replaced with $a)
indicator positions that contain invalid indicators (replace the invalid indicator(s) with blank space(s); a valid indicator is defined as a blank space, the numbers 0-9, or the fill character)

Any changes made via this option are logged to a file called vrfy_fix_log.txt (in the same directory as the results). By default, this option is not selected.

NOTE: As some systems use alphabetic indicators to mark local fields, the indicator fix for invalid indicators is no longer applied to any tag containg a '9' (019, 949, etc).

Pause on MARC error: If selected, MARC Verify will stop on any MARC error and open a message window which contains the following information: the record number in the file, the disk block number (useful only to TMQ), the TMQ error code, the last good offset in the file (the position of the record terminator of the last successfully read MARC record), and the current file offset (where the error occurred).

Stop processing if MARC error count exceeds [10]: MARC Verify will stop reading the file if more than this number (default is 10) of MARC errors are encountered. The reason for this limit is that each record with a MARC error is dumped to a separate text file, and this limit ordinarily prevents thousands of records from being dumped in this manner if there is a problem with the file.

TMQ Option 1

This option is not selected by default. If selected, this option changes the behavior of two items in Verify: 1) how the default results filename is generated, and 2) whether or not a machine-readable statistics report is generated.

ALPHABETIC TAG SUPPORT

MARC21 does not support alphabetic tags, although the ISO standard (2709) upon which MARC21 is based does make a provision for this. The default action in MARC Verify is to treat any non-numeric character in the directory as an error, and to dump the record to a text error file. Therefore, if you have a file containing alphabetic tags, you will need to translate the alphabetic tags to numeric tags in order to use the file with MARC Report.

NOTE: We are using 'alphabetic tag' to mean any three-byte sequence in the tag position of the directory that contains at least one uppercase letter.

The 'Support Alphabetic Tags' option in MARC Verify makes such a translation or conversion possible. When this option is selected, the program will prompt you to select a conversion table from the MARC Report 'Options' folder. If the conversion table (as outlined below) is located and successfully parsed, MARC Verify will convert alphabetic tags to their corresponding numeric tags. If you select the 'Support alphabetic tags' option, and the conversion table is missing or invalid, you will be prompted to read this Help section :-(

ALPHABETIC TAG TABLE FORMAT

The format of the alphabetic tag conversion table must be (exactly) as follows: a 3-byte alphabetic tag in uppercase, followed by an '=' sign, followed by a 3-digit numeric tag. For example:

A01=961 CAT=962 FIN=963 FMT=964 LCS=965 SRC=966

Anything that does not match this pattern will be ignored. Thus, you can add comments before or after each line in the table; however, do not add anything on a line containing a valid entry (unless you wish to disable it).

Only uppercase alphabetic tags are supported at present. The table does not need to be in alphabetical order, although that is probably a good idea for ease of maintenance.

The numeric tags used on the right side of this table do not need to be 9XX tags–any numeric tags greater than '009' are permitted. However, if your records contain a mix of numeric and alphabetic tags, converting the alphabetic tags to a single tag range (such as 9XX) is a good way to separate the two types of tag in the output.

NOTES

You can create as many conversion tables as you want. Save these conversion tables to the folder: MyDocuments\MarcReport\Options. The program will use whatever conversion table was most recently selected (if any). To switch from one conversion table to another, you must toggle the 'Support Alphabetic Tags' checkbox.

If the data field referenced by the directory entry for an alphabetic tag lacks indicators or subfields, then two blank spaces and a subfield $a will be inserted into the tag.

When the program concludes, brief statistics on the alpha-to-numeric conversion will be added to the verify log. These statistics are added to the logfile only–they do not appear in the window that appears when the verify session completes. Therefore, check the log after every run, since any alphabetic tag in a record that is not in the conversion table will force the record to be regarded as an error and removed from the file.

If you need help getting a list of alphabetic tags for a file, check the MARC Report program folder for the utility called 'findAlphas.exe'. When you run this program, it will read the file you select and create a list of all alphabetic tags in the file and the number of times each one was found. Optionally, once the file has been read, this utility will generate an alphabetic tag conversion table using the information collected. You can edit the table yourself once it has been saved.

COPY ALPHABETIC TAGS TO $9

If selected, any 3-byte alphabetic tags that are converted will be copied to a subfield $9 and inserted at the start of the corresponding numeric field (ie. the first subfield, following the indicators) in the record when it is output. This option has no effect unless 'Support alphabetic tags' is also selected.

FIXING MARC ERRORS

The text file that is created for a MARC error has the following format:

000 00921nam 2200289 a 4500 001 2002505120 003 DLC 005 20000227211754.0 008 880509s1988 nyua j 001 0 eng 010 $a 2002505120 020 $a0531171027 (lib. bdg.) 040 $aDLC$cDLC$dDLC$dWaU 042 $alcac 050 00$aQL737.M3$bB46 1988 082 00$a599.2$220 100 1 $aBender, Lionel. 245 10$aKangaroos and other marsupials /$cLionel Bender ; illustrations, George Thompson èõ ; editor, Denny Robson ; consultant, John Stidworthy. TAG $aNew York :$bGloucester Press,$c1988. TAG $a31 p. :$bill. (some col.) ;$c30 cm. TAG 0$aFirst sight TAG 0$aMarsupials$vJuvenile literature. TAG 0$aKangaroos$vJuvenile literature. TAG 1$aMarsupials. TAG 1$aKangaroos. TAG 1 $aThompson, George,$d1944-$eill. TAG 1 $aRobson, Denny. TAG 1 $aStidworthy, John,$d1943

— Record Number: 4 File Offset: 3965 Last Good EOR: 3042

In this example, all of the data in the record was recovered, but not all of the tags. The word 'TAG' is used as a placeholder, and it indicates entries where the directory could not be successfully validated. Note that in this text file, there is one space after every tag, no space between or after indicators, that '$' is used for the MARC subfield delimiter, and that lines wrap. The information following the '—' is debugging information and not part of the record.

It is quite easy to import this record into a MARC Report Edit Session. To do this, open the MarcErr textfile, edit the record (if necessary, but being careful to preserve the simple formatting outlined above) and copy it to your clipboard (select the record with your mouse, then press <Ctrl> + C).

Next, run an Edit Session in MARC Report on the MARC file you want to add this record to, and press <F9>. The record will be added to the session. You can then edit the record again in MARC Report if necessary.

Alternately, if you import the record above without first fixing the 'TAG' errors, MARC Report will translate each 'TAG' to a 999; so, if you prefer, you could do this and just fix the tags in MARC Report.

PROBLEMS WITH RECORDS GREATER THAN 99999 BYTES

Some software that exports MARC records from library systems does not seem to realize that the largest number that is valid in the record length defined by MARC21 is 99999. We have seen leaders with record lengths that are 6 digits (not a good idea, as all offsets then become misaligned), and record lengths that where the 6th digit has been dropped (eg. a 600K record is in the file, but the record length of the leader has only the rightmost 5 digits).

This happens more and more as systems export serial records with huge numbers of holdings. Alas, if only we switched to MarcXml five years ago :-)

Best wishes aside, Verify will not able to make sense of these records, and it is going to try to skip the record and dump it to a text 'MarcErr' file. It usually succeeds, but there may be situations where this fails and some sort of crash occurs.

In this case, after picking up the pieces, you might try the following. Start Verify, fill out the options form, and save it–a suggested name is 'Verify Huge Blocks.cfg'. Now, go find that file (in your MarcReport\Options folder) and open it in a text editor. On the last line, enter the following (left-justified, case-sensitive)

customBlocksize=1024000

Save the file. Start Verify, click Options|Load, and select the config created above. Then run the job. If it still fails, try the steps above again, this time upping the limit:

customBlocksize=5120000

Strange, but we've found this technique actually gets Verify to process files to completion that it would not process otherwise.