Skip to content

Added support for UTF-8 commands on Windows#261

Merged
gwsw merged 3 commits into
gwsw:masterfrom
Konstantin-Glukhov:master
May 2, 2022
Merged

Added support for UTF-8 commands on Windows#261
gwsw merged 3 commits into
gwsw:masterfrom
Konstantin-Glukhov:master

Conversation

@Konstantin-Glukhov

Copy link
Copy Markdown

PeekConsoleInputA() hangs when reading Unicode characters and needs to be replaced with PeekConsoleInputW().
On top of that, ReadConsoleInputW() returns Unicode character in one pass, while Linux read() gets Unicode character one byte at a time, so the windows Unicode character needs to be converted to multibyte to match the Linux read().

One more improvement is to suppress Alt-~ sequence to switch to IME Japanese mode.

@gwsw

gwsw commented Apr 18, 2022

Copy link
Copy Markdown
Owner

Thanks for finding this. I have a few questions about the code.

  1. In line 2943, I think the "while" should be "if". Since there is a return at the end of the clause, the clause can't execute more than once.
  2. I'm a little uneasy about using an unsigned int (utf8) to hold the multibyte string. WideCharToMultiByte expects an LPSTR. For type consistency, I think utf8 should be an unsigned char array. Also, if at some point in the future UTF-8 were changed back to allow 5-char or 6-char strings, using an unsigned int would fail.

@Konstantin-Glukhov

Konstantin-Glukhov commented Apr 18, 2022

Copy link
Copy Markdown
Author
  1. Logically both while and if work the same in this case. Linguistically while makes more sense to me, but if probably makes more sense from the computer science point of view. I have no strong preference and will change it to if.
  2. According to the specs maximum length of UTF-8 encoding is 4 bytes, but adding 4 more bytes would not waste too much memory :). Are you OK changing it to unsigned long? unsigned char array cannot be used with shift operator. It will make the code more complicated to achieve the same result as with word or double word.
  3. if statement on line 2965 is needed because utf8 variable is set only when AsciiChar != UnicodeChar.

Let me know if I should proceed with changes.

@gwsw

gwsw commented Apr 19, 2022

Copy link
Copy Markdown
Owner

"Are you OK changing it to unsigned long?"
My main concern is not to be able to store more than four bytes, but to use the types intended by the Windows API. My preference would be to use a char array even though it requires a couple more lines of code. Using a long pointer as an LPSTR and storing a string of bytes in an unsigned long and using shifts to extract them seems to me to unnecessarily obfuscate what's really happening.

@Konstantin-Glukhov

Copy link
Copy Markdown
Author

OK, I changed the code. See my latest commit.

@gwsw gwsw merged commit 7b977a2 into gwsw:master May 2, 2022
@gwsw

gwsw commented May 2, 2022

Copy link
Copy Markdown
Owner

@Konstantin-Glukhov: if WideCharToMultiByte returns 0, it appears the code returns an uninitialized variable utf8[0]. I think it should return a constant, like (char)0, in that case so that its behavior is consistent. Do you agree?

@Konstantin-Glukhov

Copy link
Copy Markdown
Author

The error condition (WideCharToMultiByte returns 0) should definitely be handled somehow. I don't know the rest of the code and am not sure what will happen with the displayed value on the command line. If it makes sense to you to return 0 then I support it :)

@gwsw

gwsw commented May 2, 2022

Copy link
Copy Markdown
Owner

Yes, I'm happy with WIN32getch returning 0 on error. It will get converted to \340 at the end of getchr() and then ignored.

@Konstantin-Glukhov

Konstantin-Glukhov commented May 3, 2022

Copy link
Copy Markdown
Author

If this error happens while parsing text for a search command, a character in the middle of the text will be ignored and the searched text will not be found even if it is present in the file.

@gwsw

gwsw commented May 4, 2022

Copy link
Copy Markdown
Owner

Maybe I'm not understanding when the error would occur. I thought it would only appear if the user typed some weird character sequence that didn't result in a valid Unicode character.

@Konstantin-Glukhov

Copy link
Copy Markdown
Author

Yes, the invalid input from keyboard or invalid byte sequence from pasting the contents of the system clipboard can potentially cause the error. In my opinion, if the error is detected, then it should be communicated to the interface with some message, e.g. "Input Error", and the current command should be aborted.

@gwsw

gwsw commented May 4, 2022

Copy link
Copy Markdown
Owner

Well, that's basically what will happen if you return a 0. The 0 will get converted to \340 and since \340 is not part of any valid command, less will (I think) beep and abort any current command.

@Konstantin-Glukhov

Copy link
Copy Markdown
Author

Do you want me to make the change, or you will take care of it?

@gwsw

gwsw commented May 5, 2022

Copy link
Copy Markdown
Owner

If you can make the change, that would be great. I don't have a Windows build environment and I don't like to make changes that I can't test.

@Konstantin-Glukhov

Copy link
Copy Markdown
Author

I made the change. See my latest pull request.

As a side note, funcs.h file is missing in the master branch, though it is included unconditionally in less.h. I don't know how anyone can build any OS version from the master branch now. I was only able to test by unzipping less-603-betta.zip from http://greenwoodsoftware.com/less

@gwsw

gwsw commented May 5, 2022

Copy link
Copy Markdown
Owner

Thanks for the change. But shouldn't line 2970 be checking utf8_size rather than utf8_next_byte?

Regarding funcs.h, you have to run "make -f Makefile.aut distfiles" before building from a git clone. See the instructions in README.

@Konstantin-Glukhov

Copy link
Copy Markdown
Author

Sorry, sloppy coding :( I fixed it, see my latest commit.

About Makefile.aut - The README file says "Unix & Linux systems only". How can I run "make -f Makefile.aut distfiles" in Windows environment?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants