Hi Sancoder,sancoder wrote:As to character encodings.
We have several options to choose: ASCII, UTF-8, any other 1-byte per character codepages (cp1252, for Europe I guess), UTF-16, other N-byte per character codepages (UCS-2 but it seems to be deprecated almost everywhere).
If ASCII would go to the standard - the checkers developers world would be treated as dinosaurs - come on, it's 21st century. There is absolutely NO reason to use such encoding.
Any other character encoding without full Unicode coverage (cp1252, cp1251 for Cyrillic world, etc) would cause problems from the beginning - the PDN file should have an external attribute, i.e. encoding. The end user would have to choose what encoding particular PDN file has - I think this is not very good. For developers any encoding is the pain - nobody understands all details from the start.
There are several encodings with full Unicode coverage. UTF-8 is the superset of the ASCII, for this reason I would recommend this to become a standard. UTF-16 has some options (big endian, little endian), but the main problem is that every character (even ASCII one) would have 2 bytes. So, UTF-16 (and UCS-2 as its parent) would break all compatibility.
For UTF-8 to be standardized, the only change to the grammar is necessary: to treat characters from 0x80 to 0xFF as normal characters, i.e. to allow them.
There is still a question about BOM marker at the beginning of the file (BOM specifies that file is in UTF-8 or UTF-16 encoding).
The reader should allow this marker. The writer ... I'm not sure, but I think it should not write this marker to the file (for compatibility reasons).
As to dash in alphanumeric moves.
It should be optional. Disallowing the 'a1b2' is very unfriendly.
in PGN there is a section about character codes: http://www.saremba.de/chessgml/standard ... e.htm#c4.1. It recommends to use only ASCII characters. This seems questionable to me, so I did not copy it into the PDN standard.
It seems to me that STRING values and COMMENT values should be allowed to contain Unicode characters. Note that the current token definitions already allow this. It depends on the parser generator whether or not Unicode characters are accepted. All of the example implementations accept the following UTF-8 input:
[White "Сергей Фадеев"]
[Black "高文龙"]
1.32-28 19-23
So I believe that the grammar does not need to be changed to support Unicode. Note that the current grammar does not allow Unicode characters to be used in other places, like tag names. This was done on purpose.
On the one hand I can understand that it is useful to restrict PDN to a single format like UTF-8. On the other hand I find it questionable to disallow ASCII files. I can't imagine PDN readers that would reject a PDN file because it is encoded in ASCII. Are there more opinions about this?