Add new unic-ucd-common component#172
Conversation
|
This is needed in implementation of the Segmentation algorithm. Tracker: #135 For the resolved properties put here, another option is to put them directly under Either way, it should be noted that neither of Numeric nor Alphanumeric are explicitly mentioned in the spec, and it's left to the application to decide how to extract words from boundaries. See http://www.unicode.org/reports/tr29/#Word_Boundaries. |
CAD97
left a comment
There was a problem hiding this comment.
Seems like a good addition; alpha/numeric is something that is commonly referenced.
Though in the case of segmentation:
Numeric: http://www.unicode.org/reports/tr29/#SB_Numeric
And I assume the part of the spec you're referencing is
Letters are not the only characters that can be used to determine the “significant” words; different implementations may include other types of characters such as digits or perform other analysis of the characters.
I don't directly see how these are required for segmentation, but I still think they're worthwhile to expose here.
|
That's The current implementation doesn't have SB char property or algorithm. But, it's a good idea to provide an API that allows application to choose how they want to detect words. A common implementation (and probably with its own function) would be to use
After Word Boundaries are detected, string is split into many pieces, some of which are only white spaces, punctuation, etc. The iterator to return "words" needs to distinguish between useful pieces and useless ones, in this perspective. This is where |
|
Following the lead of the std lib, I think we should put all these 5 functions here:
UPDATE: The This PR adds 3 of them. I can add the rest in another step, or just include here if there's any reason for it. |
Wouldn't this exclude Emoji? I'm pretty sure that what the spec is implying is a check for being whitespace? But then again I've only skimmed over it. I say add in the other two simple components here and then add it. |
There are some core properties that are commonly used in Unicode algorithsm, as well as in applications directly, and are not specific to any single area. Examples of these properties are `Alphabetic` and `White_Space`. Also, there are some *resolved* properties that are used commonly, like *Numeric* and *Alphanumeric*, which are commonly defined as based on *General_Category* and *Alphabetic* properties. Since they are common in applications, it makes sense to provided optimized implementations. This new componet, `unic-ucd-common`, hosts these properties.
|
Okay, I'm adding the other common properties here then before we land.
Yeah, that part of the API should definitely be expanded. I'll keep it minimal while I'm importing the existing code, and we will build on top of that. The only thing we need in the API at the moment is to let user choose the filtering for word iterator. The rest, I think we better keep when we get to string transformation API, which will be in a month or two. |
Add UCD `White_Space` property. Add `is_control()` method following the Rust std lib practice for defining it based on `General_Category = Cc`.
CAD97
left a comment
There was a problem hiding this comment.
Looks ready to me. Just one extremely minor stylistic thing inline.
| macro_rules! assert_char { | ||
| ($ch:expr, $x:expr => $y:expr) => ( | ||
| assert!( | ||
| !$x || $y, |
There was a problem hiding this comment.
These are logical equivalents, but given the below textual assertion, I think this would read better as !($x && !$y). I think that is closer to the actual assertion here: that if we have $x then $y must be true. Maybe, it's better written explicitly: if $x { assert!( $y, ... ) }
But this is a stylistic choice. I'll leave it up to you.
There was a problem hiding this comment.
Thanks, @CAD97, for the comments!
The reason the precondition is inside the assertion macro to be able to write out a meaningful error on failure. First I was only passing in $y to the macro, then realized that the text needs to know about the precondition, and it doesn't make sense to repeat it on each macro call-site.
About changing the logical form, IMHO the current form is more common and easier to read. The double-negation on the other form makes it harder to follow. (I think it's similar to the case we had last week, where we went for the simpler logical formula.)
There's only one thing that I want to change here, though, which is =>, which is a valid operator for the expressions in this context, and therefore, I want to update it to ==>, -> or -->, which cannot be part of the expressions.
Add new integration test to check, for all code-points, consistency between common properties themselves, and versus General_Category.
|
Okay, updated the macro be expand the support, make the syntax more readable by using |
|
Btw, later we should move this to the |
|
bors: r+ |
|
bors: ping |
|
pong |
|
bors: status? |
|
bors: status |
|
It says we're waiting in queue on my side (through the GitHub bit at the bottom) |
|
172: Add new unic-ucd-common component r=CAD97 a=behnam There are some core properties that are commonly used in Unicode algorithsm, as well as in applications directly, and are not specific to any single area. Examples of these properties are `Alphabetic` and `White_Space`. Also, there are some *resolved* properties that are used commonly, like *Numeric* and *Alphanumeric*, which are commonly defined as based on *General_Category* and *Alphabetic* properties. Since they are common in applications, it makes sense to provided optimized implementations. This new componet, `unic-ucd-common`, hosts these properties.
|
Hum, where did you get that log, @CAD97? The thing is that it shows the PR under "Awaiting review" on the bors dashboard, which is not correct. And looks like there's no command to query the latest status from it. |
|
bors: retry |
(source: https://bors.tech/documentation/reference/) So no retry/status command. Just re-r-ing it. Status came from https://app.bors.tech/repositories/871/log. |
|
Not awaiting review |
|
✌️ behnam can now approve this pull request |
|
Not awaiting review |
Canceled |
|
..... Bors I did not say that EDIT: oh... the table confused bors. A lot. 😆 bors: r+ |
172: Add new unic-ucd-common component r=CAD97 a=behnam There are some core properties that are commonly used in Unicode algorithsm, as well as in applications directly, and are not specific to any single area. Examples of these properties are `Alphabetic` and `White_Space`. Also, there are some *resolved* properties that are used commonly, like *Numeric* and *Alphanumeric*, which are commonly defined as based on *General_Category* and *Alphabetic* properties. Since they are common in applications, it makes sense to provided optimized implementations. This new componet, `unic-ucd-common`, hosts these properties.
|
(This is a hot mess, sorry...) |
|
No worries. Looks like "It Who Cannot Be Named" is back to work, at least. |
Build succeeded |
Build succeeded |
|
I updated the reference to describe how to not do that. Thanks for helping me get the docs into shape! 😝 |
There are some core properties that are commonly used in Unicode
algorithsm, as well as in applications directly, and are not specific to
any single area. Examples of these properties are
AlphabeticandWhite_Space.Also, there are some resolved properties that are used commonly, like
Numeric and Alphanumeric, which are commonly defined as based on
General_Category and Alphabetic properties. Since they are common in
applications, it makes sense to provided optimized implementations.
This new componet,
unic-ucd-common, hosts these properties.