The Encoding fixer tool can check GEDCOM files to see if their text encoding is legal. It can also correct the text encoding to ensure that modern applications can decode the GEDCOM file without data corruption.
Fig 1. The Encoding fixer tool.
Valid text encoding
Legal GEDCOM text encodings vary depending upon the version of the GEDCOM specification in use.
|GEDCOM Version||Supported text encodings|
|3.0, 4.0, 5.0 and 5.1||ANSEL and ASCII|
|5.2, 5.3, 5.4 and 5.5||Unicode (UTF-16), ANSEL and ASCII|
|5.5.1 and 5.6||UTF-8, Unicode (UTF-16), ANSEL and ASCII|
Valid GEDCOM files must state the text encoding being used in the CHAR tag of the file header. GEDCOM files may optionally use a byte order mark (BOM) to identify the text encoding in the case of Unicode and UTF-8 encoded files. In all cases only valid encodings for the GEDCOM version of the file should be used and the BOM should be consistent with the CHAR tag value.
Problems with the text encoding of GEDCOM files fall into 3 categories:
Files are encoded using an illegal codepage
Some GEDCOM 5.5 files may be encoded as UTF-8 which is not a valid text encoding for GEDCOM 5.5 files. Some have a CHAR tag value of ANSI, which is both invalid and misleading, especially as ANSI can refer to multiple codepages, although in the case of GEDCOM, most commonly codepage 1252. In other cases the text may be encoded using an entirely different codepage to that specified in the CHAR tag.
Analysis of legacy GEDCOM applications can be used to infer the most likely codepage to use in cases where the CHAR tag value refers to an illegal codepage.
Files contain a byte order mark which does not agree with the CHAR tag value
A byte order mark (BOM) is used to indicate to reading systems the text encoding of the file. It is common for a byte order mark to be specified and the CHAR tag to refer to a different type of text encoding entirely. The byte order mark should take precedence over the CHAR tag in the case of a mismatch.
Files do not contain a CHAR tag detailing the text encoding being used
Where a GEDCOM file does not contain a CHAR tag, it is almost impossible to automatically determine the encoding of the original file unless a BOM mark is present. The system default codepage is assumed in such cases but you will need manually check the output for data corruption.
Selecting a custom source encoding
Sometimes you may be aware of the codepage used to create the GEDCOM file. For instance if it was hand written or subsequently edited in a text editor, the actual codepage in use may be any that was supported by the text editor. In these cases you are advised to select a custom source encoding and manually check the output for data corruption. You can preview possible character encoding issues in the GEDCOM file to determine if the selected source encoding will lead to data corruption.
GEDCOM files can be repaired by converting the file to use a valid text encoding and updating the corresponding CHAR tag in the file. GEDCOM Validator can perform this repair operation on GEDCOM file with the following versions:
|GEDCOM Version||Final text encoding||CHAR tag|
|3.0, 4.0, 5.0 and 5.1||ANSEL||ANSEL|
|5.2, 5.3, 5.4 and 5.5||Unicode (UTF-16)||UNICODE|
|5.5.1 and 5.6||UTF-8||UTF-8|
All modern applications should be able to read UTF-16 and UTF-8 encoded GEDCOM files. Use of applications not supporting UTF-8 is not recommended.