Skip to content

Escape additonal classes of characters in escape_debug_ext#158286

Open
Jules-Bertholet wants to merge 3 commits into
rust-lang:mainfrom
Jules-Bertholet:unicode-deprecated
Open

Escape additonal classes of characters in escape_debug_ext#158286
Jules-Bertholet wants to merge 3 commits into
rust-lang:mainfrom
Jules-Bertholet:unicode-deprecated

Conversation

@Jules-Bertholet

@Jules-Bertholet Jules-Bertholet commented Jun 23, 2026

Copy link
Copy Markdown
Contributor

(See also #158057, #155527)

This PR escapes a few additional categories of characters in escape_debug_ext (used to implement char::escape_debug and the various Debug impls for characters and strings):

  • Deprecated characters: Unicode says that the use of these characters is "strongly discouraged".
  • Full composition exclusion characters: These characters cannot occur in strings normalized according to any of the Unicode normalization forms. Any string containing one of these characters is necessarily unnormalized.
  • NFC_Quick_Check=Maybe characters: Depending on context, these characters may sometimes appear in NFC-normalized text, or they may be removed by the application of NFC. Most characters in this category are also grapheme extenders, which we already escape, but there are a few that are not. By escaping these last as well, we ensure that any string which is not normalized according to NFC gets escaped; that seems useful for debugging normalization-related issues. (Note: NFC_Quick_Check=No is equivalent to Full_Composition_Exclusion=Yes.)

Additionally, the second commit makes the Deprecated and Full_Composition_Exclusion data tables unstably public, behind the unicode_discouraged feature gate, so that users don't need to ship duplicate copies of std data tables. Libs-api can feel free to reject if unsure.

Making the NFC_Quick_Check=Maybe table public is left to future work, because the API for that is more complicated (quick_check returns a 3-valued enum instead of a boolean, do we want to ship more normalization data alongside it, etc.). However, we do leave room for it by including the whole table, instead of optimizing the implementation by including only the diff from Grapheme_Extend.


Adds 403 bytes total to the Unicode data tables.

@rustbot label A-Unicode T-libs-api needs-fcp

@rustbot

rustbot commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

library/core/src/unicode/unicode_data.rs is generated by the src/tools/unicode-table-generator tool.

If you want to modify unicode_data.rs, please modify the tool then regenerate the library source file via ./x run src/tools/unicode-table-generator instead of editing unicode_data.rs manually.

@rustbot rustbot added S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. labels Jun 23, 2026
@rustbot

rustbot commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator

r? @Darksonn

rustbot has assigned @Darksonn.
They will have a look at your PR within the next two weeks and either review your PR or reassign to another reviewer.

Use r? to explicitly pick a reviewer

Why was this reviewer chosen?

The reviewer was selected based on:

  • Owners of files modified in this PR: libs
  • libs expanded to 12 candidates
  • Random selection from Darksonn, JohnTitor, Mark-Simulacrum, clarfonthey, jhpratt

@rustbot rustbot added A-Unicode Area: Unicode T-libs-api Relevant to the library API team, which will review and decide on the PR/issue. needs-fcp This change is insta-stable, or significant enough to need a team FCP to proceed. labels Jun 23, 2026
@rust-log-analyzer

This comment has been minimized.

@Darksonn

Copy link
Copy Markdown
Member

Is this intended to supersede #158057, or be merged before/after that one?

@Jules-Bertholet

Jules-Bertholet commented Jun 23, 2026

Copy link
Copy Markdown
Contributor Author

Is this intended to supersede #158057, or be merged before/after that one?

They are conceptually independent changes—that PR removes some characters from the list we escape, this PR adds some. I'll need to rebase in between them, though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

A-Unicode Area: Unicode needs-fcp This change is insta-stable, or significant enough to need a team FCP to proceed. S-waiting-on-review Status: Awaiting review from the assignee but also interested parties. T-libs Relevant to the library team, which will review and decide on the PR/issue. T-libs-api Relevant to the library API team, which will review and decide on the PR/issue.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants