The Encoding repair tool can check GEDCOM files to see if their text encoding is legal. It can also convert files to a modern text encoding so that modern applications can decode the GEDCOM file without data corruption.
Fig 1. The Encoding repair tool.
Valid text encoding
Legal GEDCOM text encodings vary depending upon the version of the GEDCOM specification in use.
|GEDCOM Version||Supported text encodings||CHAR tag?|
|1, and 2||Various - no standard||No|
|3.0, 4.0, 5.0, and 5.1||ANSEL and ASCII||Optional|
|5.2, and 5.3||Unicode (UTF-16), ANSEL and ASCII||Optional|
|5.4, and 5.5||Unicode (UTF-16), ANSEL and ASCII||Required|
|5.5.1, and 5.6||UTF-8, Unicode (UTF-16), ANSEL and ASCII||Required|
The CHAR tag of the file header may state the text encoding being used in a GEDCOM file, but this is only required in some versions of GEDCOM. GEDCOM files may optionally use a byte order mark (BOM) to identify the text encoding in the case of Unicode and UTF-8 encoded files. In all cases, only valid encodings for the specfic GEDCOM version should be used and the BOM should be consistent with the CHAR tag value.
Problems with the text encoding of GEDCOM files fall into 3 categories:
Files are encoded using an illegal codepage
Some GEDCOM 5.5 files may be encoded as UTF-8 which is not a valid text encoding for GEDCOM 5.5 files. Some have a CHAR tag value of ANSI, which is both invalid and misleading, especially as ANSI can refer to multiple codepages, although in the case of GEDCOM, most commonly codepage 1252. In other cases the text may be encoded using an entirely different codepage to that specified in the CHAR tag.
Analysis of legacy GEDCOM applications can be used to infer the most likely codepage to use in cases where the CHAR tag value refers to an illegal codepage.
Files contain a byte order mark which does not agree with the CHAR tag value
A byte order mark (BOM) is used to indicate to reading systems the text encoding of the file. It is common for a byte order mark to be specified and the CHAR tag to refer to a different type of text encoding entirely. The byte order mark should take precedence over the CHAR tag in the case of a mismatch.
Files do not contain a CHAR tag detailing the text encoding being used
Where a GEDCOM file does not contain a CHAR tag, it is difficult to automatically determine the encoding of the original file unless a BOM mark is present. This is the case for older versions of GEDCOM which did not require a CHAR tag. The system default codepage is assumed in such cases but you will need manually check the output for data corruption.
Selecting a custom source encoding
Sometimes you may be aware of the codepage used to create the GEDCOM file. For instance if it was hand written or subsequently edited in a text editor, the actual codepage in use may be any that was supported by the text editor. In these cases you are advised to specify the source encoding and manually check the output for data corruption. You can preview possible character encoding issues in the GEDCOM file to determine if the selected source encoding will lead to data corruption.
Some GEDCOM files can be repaired by converting the file to use a valid modern text encoding and updating the corresponding CHAR tag value in the file. GEDCOM Validator can perform this repair operation on GEDCOM files with the following versions:
|GEDCOM Version||Final text encoding||CHAR tag|
|5.2, 5.3, 5.4 and 5.5||Unicode (UTF-16)||UNICODE|
|5.5.1 and 5.6||Unicode (UTF-8)||UTF-8|
All modern applications should be able to read UTF-16 and UTF-8 encoded GEDCOM files. Use of applications not supporting UTF-8 is not recommended for new projects.
Much older GEDCOM versions do not support modern text encodings and applications may not interpret these files correctly unless the application allows you to specify the actual text encoding used to create the file. GEDCOM Validator allows you to convert such files to use Unicode (UTF-8) encoding. Whilst this is not legal GEDCOM, modern applications and text editors should be able to correctly interpret text in any GEDCOM file which is encoded using Unicode (UTF-8).