Encoding Repair

The Encoding repair tool can check GEDCOM files to see if their text encoding is legal. It can also convert files to a modern text encoding so that modern applications can decode the GEDCOM file without data corruption.

Fig 1. The Encoding repair tool.

Valid text encoding

Legal GEDCOM text encodings vary depending upon the version of the GEDCOM specification in use.

GEDCOM Version	Supported text encodings	CHAR tag?
1, and 2	Various - no standard	No
3.0, 4.0, 5.0, and 5.1	ANSEL and ASCII	Optional
5.2, and 5.3	Unicode (UTF-16), ANSEL and ASCII	Optional
5.4, and 5.5	Unicode (UTF-16), ANSEL and ASCII	Required
5.5.1, and 5.6	UTF-8, Unicode (UTF-16), ANSEL and ASCII	Required
7.x	UTF-8 only	No

The CHAR tag of the file header may state the text encoding being used in a GEDCOM file, but this is only required in some versions of GEDCOM. GEDCOM files may optionally use a byte order mark (BOM) to identify the text encoding in the case of Unicode and UTF-8 encoded files. In all cases, only valid encodings for the specific GEDCOM version should be used and the BOM should be consistent with the CHAR tag value.

Problems with the text encoding of GEDCOM files fall into 3 categories:

Files are encoded using an illegal codepage

Some GEDCOM 5.5 files may be encoded as UTF-8 which is not a valid text encoding for GEDCOM 5.5 files. Some have a CHAR tag value of ANSI, which is both invalid and misleading, especially as ANSI can refer to multiple codepages, although in the case of GEDCOM, most commonly codepage 1252. In other cases, the text may be encoded using an entirely different codepage to that specified in the CHAR tag.

Analysis of legacy GEDCOM applications can be used to infer the most likely codepage to use in cases where the CHAR tag value refers to an illegal codepage.

Files contain a byte order mark which does not agree with the CHAR tag value

A byte order mark (BOM) is used to indicate to reading systems the text encoding of the file. It is common for a byte order mark to be specified and the CHAR tag to refer to a different type of text encoding entirely. The byte order mark should take precedence over the CHAR tag in the case of a mismatch.

Files do not contain a CHAR tag detailing the text encoding being used

Where a GEDCOM file does not contain a CHAR tag, it is difficult to automatically determine the encoding of the original file unless a BOM mark is present. This is the case for older versions of GEDCOM which did not require a CHAR tag. The system default codepage is assumed in such cases but you will need manually check the output for data corruption.

Selecting a custom source encoding

Sometimes you may be aware of the codepage used to create the GEDCOM file. For instance, if it was handwritten or subsequently edited in a text editor, the actual codepage in use may be any that was supported by the text editor. In these cases, you are advised to specify the source encoding and manually check the output for data corruption. You can preview possible character encoding issues in the GEDCOM file to determine if the selected source encoding will lead to data corruption.

Repairing files

Some GEDCOM files can be repaired by converting the file to use a valid modern text encoding and updating the corresponding CHAR tag value in the file. GEDCOM Validator can perform this repair operation on GEDCOM files with the following versions:

GEDCOM Version	Final text encoding	CHAR tag
5.2, 5.3, 5.4 and 5.5	Unicode (UTF-16)	`UNICODE`
5.5.1 and 5.6	Unicode (UTF-8)	`UTF-8`

All modern applications should be able to read UTF-16 and UTF-8 encoded GEDCOM files. Use of applications not supporting UTF-8 is not recommended for new projects.

Much older GEDCOM versions do not support modern text encodings and applications may not interpret these files correctly unless the application allows you to specify the actual text encoding used to create the file. GEDCOM Validator allows you to convert such files to use Unicode (UTF-8) encoding. Whilst this is not legal GEDCOM, modern applications and text editors should be able to correctly interpret text in any GEDCOM file which is encoded using Unicode (UTF-8).