Sealeopard
(KiX Master)
2003-12-05 04:13 PM
UNICODE support for READLINE

Currently, READLINE does not support UNICODE-formatted text files. These files are read in on a character-by-character basis instead of a line-by-line basis. Additionally, the CR and LF characters are dropped this making it nearly impossible to reassemble the text file unless one uses two empty 'lines' as indicator of a CRLF combination.

Suggestion triggered by the following post: http://www.kixscripts.com/forum/tm.asp?m=3472


Howard Bullock
(KiX Supporter)
2003-12-05 04:59 PM
Re: UNICODE support for READLINE

This is a good request. Maybe something like supporint chr(0) in strings would be sufficient.

http://www.kixtart.org/ubbthreads/showthreaded.php?Cat=&Number=66178


AllenAdministrator
(KiX Supporter)
2003-12-05 07:12 PM
Re: UNICODE support for READLINE

IIRC, I ran into the same problem while writing the Addprinter() UDF. The way I got around it was to shell out and TYPE the file to another file.

Code:
  
shell '%comspec% /c type "$driverinf">%temp%\addprinter.txt'



Once the new file was created readline worked normally. However, you are right, it would be nice to be able to just read the unicode directly.




Richard H.Administrator
(KiX Supporter)
2003-12-08 11:37 AM
Re: UNICODE support for READLINE

Quote:

Maybe something like supporint chr(0) in strings would be sufficient.




Supporting wide character/double-byte/unicode is a good idea and I believe will become more and more important. I think that simply changing the internal support for basic strings is not going to be sufficient to support Unicode.

Unicode characters are not simply an ASCII character preceded by a Chr(0). That Chr(0) is there for a reason, see the Unicode home page for the full spec.

This means that you need to be careful when reading, writing, substringing, intstringing, catenating, testing and converting strings to preserve the character set information.

The better solution is to have a new "wide string" basic type and either update the string functions to auto-magically support both or add wide string functions.

Conversion between wide and non-wide strings would be automatic in the same way as (say) between strings and number types. The extra byte would of course be lost when converting from a wide to a non-wide string, and would have to be set to '0' when converting from a non-wide to a wide string.

Specifying characters is also an interesting task. "Simple Latin" isn't a problem as it corresponds to 7-bit ASCII (&0000 - &007F), so the automatic conversion could handle that. How do you specify Cyrillic, Greek, Box drawing ot mathematical characters though, especially when the high order byte is often a non-printable? Perhaps a new conversion function, so if you wanted the currency symbol for the Euro you would use:
Code:
$wsEuroSymbol=CWStr(&20AC)



This is going to be a lot of work, so a short-term measure would be:
  • Update "OPEN", to recognise wide character files, and allow the specification of wide character when creating files.
  • Update "Readline" to silently drop the leading byte of each wide character.
  • Update "Writeline" to prefix each character with Chr(0).


While this doesn't actually provide Unicode support, it will allow the reading and writing of files which contain only the "Basic Latin" and "Latin-1 Supplement" codes which may be sufficient for administration purposes.


Les
(KiX Master)
2005-09-20 10:08 PM
Re: UNICODE support for READLINE

Any chance of KiXing this up a notch? Today, more and more data is written in unicode and while we could call upon the FSO gods, it would be nice to support it natively.

While you are at it, any ideas on a way for KiX to detect if the file is unicode or not?


NTDOCAdministrator
(KiX Master)
2005-09-20 10:32 PM
Re: UNICODE support for READLINE

Well there is a UDF to detect it so I'm sure doing it natively would be a walk in the park.

Les
(KiX Master)
2005-09-20 10:42 PM
Re: UNICODE support for READLINE

Ja, the question was directed at Ruud. I know there is a UDF.

Was thinking maybe Open() could return a different code for unicode.