Add new unic-ucd-common component by behnam · Pull Request #172 · open-i18n/rust-unic

behnam · 2017-10-02T23:09:12Z

There are some core properties that are commonly used in Unicode
algorithsm, as well as in applications directly, and are not specific to
any single area. Examples of these properties are Alphabetic and
White_Space.

Also, there are some resolved properties that are used commonly, like
Numeric and Alphanumeric, which are commonly defined as based on
General_Category and Alphabetic properties. Since they are common in
applications, it makes sense to provided optimized implementations.

This new componet, unic-ucd-common, hosts these properties.

behnam · 2017-10-02T23:12:54Z

This is needed in implementation of the Segmentation algorithm. Tracker: #135

For the resolved properties put here, another option is to put them directly under unic-segment.

Either way, it should be noted that neither of Numeric nor Alphanumeric are explicitly mentioned in the spec, and it's left to the application to decide how to extract words from boundaries. See http://www.unicode.org/reports/tr29/#Word_Boundaries.

CAD97

Seems like a good addition; alpha/numeric is something that is commonly referenced.

Though in the case of segmentation:

Numeric: http://www.unicode.org/reports/tr29/#SB_Numeric

And I assume the part of the spec you're referencing is

Letters are not the only characters that can be used to determine the “significant” words; different implementations may include other types of characters such as digits or perform other analysis of the characters.

I don't directly see how these are required for segmentation, but I still think they're worthwhile to expose here.

behnam · 2017-10-02T23:24:12Z

Side note: Rust's primitive char type already has similar methods, including:

behnam · 2017-10-02T23:29:53Z

Numeric: http://www.unicode.org/reports/tr29/#SB_Numeric

That's SB::Numeric, as in Sentence Break. The place is_alphanumeric() is called is in Word Break (WB).

The current implementation doesn't have SB char property or algorithm.

But, it's a good idea to provide an API that allows application to choose how they want to detect words. A common implementation (and probably with its own function) would be to use is_alphanumeric(), as defined here.

I don't directly see how these are required for segmentation, but I still think they're worthwhile to expose here.

After Word Boundaries are detected, string is split into many pieces, some of which are only white spaces, punctuation, etc. The iterator to return "words" needs to distinguish between useful pieces and useless ones, in this perspective. This is where has_alphanumeric is used on pieces.

behnam · 2017-10-02T23:32:15Z

Following the lead of the std lib, I think we should put all these 5 functions here:

is_alphabetic()
is_whitespace()
is_alphanumeric()
is_control()
is_numeric()

UPDATE: The is_digit() function is more complicated than a simple Char Property. Let's skip that one.

This PR adds 3 of them. I can add the rest in another step, or just include here if there's any reason for it.

CAD97 · 2017-10-02T23:37:47Z

The iterator to return "words" needs to distinguish between useful pieces and useless ones, in this perspective. This is where has_alphanumeric is used on pieces.

Wouldn't this exclude Emoji? I'm pretty sure that what the spec is implying is a check for being whitespace? But then again I've only skimmed over it. I say add in the other two simple components here and then add it.

There are some core properties that are commonly used in Unicode algorithsm, as well as in applications directly, and are not specific to any single area. Examples of these properties are `Alphabetic` and `White_Space`. Also, there are some *resolved* properties that are used commonly, like *Numeric* and *Alphanumeric*, which are commonly defined as based on *General_Category* and *Alphabetic* properties. Since they are common in applications, it makes sense to provided optimized implementations. This new componet, `unic-ucd-common`, hosts these properties.

behnam · 2017-10-03T05:01:13Z

Okay, I'm adding the other common properties here then before we land.

Wouldn't this exclude Emoji? I'm pretty sure that what the spec is implying is a check for being whitespace? But then again I've only skimmed over it. I say add in the other two simple components here and then add it.

Yeah, that part of the API should definitely be expanded. I'll keep it minimal while I'm importing the existing code, and we will build on top of that. The only thing we need in the API at the moment is to let user choose the filtering for word iterator.

The rest, I think we better keep when we get to string transformation API, which will be in a month or two.

Add UCD `White_Space` property. Add `is_control()` method following the Rust std lib practice for defining it based on `General_Category = Cc`.

CAD97

Looks ready to me. Just one extremely minor stylistic thing inline.

CAD97 · 2017-10-04T06:26:15Z

+macro_rules! assert_char {
+    ($ch:expr, $x:expr => $y:expr) => (
+        assert!(
+            !$x || $y,


These are logical equivalents, but given the below textual assertion, I think this would read better as !($x && !$y). I think that is closer to the actual assertion here: that if we have $x then $y must be true. Maybe, it's better written explicitly: if $x { assert!( $y, ... ) }

But this is a stylistic choice. I'll leave it up to you.

Thanks, @CAD97, for the comments!

The reason the precondition is inside the assertion macro to be able to write out a meaningful error on failure. First I was only passing in $y to the macro, then realized that the text needs to know about the precondition, and it doesn't make sense to repeat it on each macro call-site.

About changing the logical form, IMHO the current form is more common and easier to read. The double-negation on the other form makes it harder to follow. (I think it's similar to the case we had last week, where we went for the simpler logical formula.)

There's only one thing that I want to change here, though, which is =>, which is a valid operator for the expressions in this context, and therefore, I want to update it to ==>, -> or -->, which cannot be part of the expressions.

Add new integration test to check, for all code-points, consistency between common properties themselves, and versus General_Category.

behnam · 2017-10-04T21:42:30Z

Okay, updated the macro be expand the support, make the syntax more readable by using if, and remove yet more duplications and improve runtime perf by only checking preconditions one time.

behnam · 2017-10-04T21:43:38Z

Btw, later we should move this to the utils component, to use in various consistency tests we have. Many of them are too verbose at the moment because of not using something like this.

behnam · 2017-10-04T21:44:09Z

bors: r+

behnam · 2017-10-05T00:20:43Z

bors: ping

bors · 2017-10-05T00:20:44Z

pong

behnam · 2017-10-05T00:20:53Z

bors: status?

behnam · 2017-10-05T00:21:25Z

bors: status

CAD97 · 2017-10-05T00:21:54Z

It says we're waiting in queue on my side (through the GitHub bit at the bottom)

CAD97 · 2017-10-05T00:22:54Z

10/4/2017, 5:44:27 PM	Crash	batch	{:timeout, {GenServer, :call, [BorsNG.GitHub, {:get_file, {{:installation, 45082}, 94862886}, {"staging.tmp", ".github/bors.toml"}}, 5000]}}

(source: https://app.bors.tech/repositories/871/log)

CAD97

bors: r+

(retry?)
EDIT: Status page now says running

172: Add new unic-ucd-common component r=CAD97 a=behnam There are some core properties that are commonly used in Unicode algorithsm, as well as in applications directly, and are not specific to any single area. Examples of these properties are `Alphabetic` and `White_Space`. Also, there are some *resolved* properties that are used commonly, like *Numeric* and *Alphanumeric*, which are commonly defined as based on *General_Category* and *Alphabetic* properties. Since they are common in applications, it makes sense to provided optimized implementations. This new componet, `unic-ucd-common`, hosts these properties.

behnam · 2017-10-05T00:24:47Z

Hum, where did you get that log, @CAD97?

The thing is that it shows the PR under "Awaiting review" on the bors dashboard, which is not correct. And looks like there's no command to query the latest status from it.

behnam · 2017-10-05T00:24:55Z

bors: retry

CAD97 · 2017-10-05T00:28:16Z

Syntax	Description
bors r+	Run the test suite, and push to master if it passes. Short for “reviewed: looks good.”
bors r=	Same as r+, but the “reviewer” in the commit log will be recorded as the user(s) given as the argument.
bors r-	Cancel an r+ or r=.
bors try	Run the test suite, without pushing to master.
bors delegate+	Allow the pull request author to r+ their changes.
bors delegate=	Allow the listed users to r+ this pull request’s changes.
bors ping	Will respond if bors is set up.

(source: https://bors.tech/documentation/reference/)

So no retry/status command. Just re-r-ing it.

Status came from https://app.bors.tech/repositories/871/log.

bors · 2017-10-05T00:28:16Z

Not awaiting review

bors · 2017-10-05T00:28:16Z

✌️ behnam can now approve this pull request

bors · 2017-10-05T00:28:17Z

Not awaiting review

bors · 2017-10-05T00:28:20Z

Canceled

CAD97 · 2017-10-05T00:28:41Z

..... Bors I did not say that

EDIT: oh... the table confused bors. A lot. 😆

bors: r+

172: Add new unic-ucd-common component r=CAD97 a=behnam There are some core properties that are commonly used in Unicode algorithsm, as well as in applications directly, and are not specific to any single area. Examples of these properties are `Alphabetic` and `White_Space`. Also, there are some *resolved* properties that are used commonly, like *Numeric* and *Alphanumeric*, which are commonly defined as based on *General_Category* and *Alphabetic* properties. Since they are common in applications, it makes sense to provided optimized implementations. This new componet, `unic-ucd-common`, hosts these properties.

CAD97 · 2017-10-05T00:29:17Z

(This is a hot mess, sorry...)

behnam · 2017-10-05T00:41:28Z

No worries. Looks like "It Who Cannot Be Named" is back to work, at least.

bors · 2017-10-05T00:45:06Z

Build succeeded

continuous-integration/travis-ci/push

bors · 2017-10-05T00:58:35Z

Build succeeded

continuous-integration/travis-ci/push

notriddle · 2017-10-05T01:29:37Z

I updated the reference to describe how to not do that. Thanks for helping me get the docs into shape! 😝

behnam requested a review from CAD97 October 2, 2017 23:09

behnam self-assigned this Oct 2, 2017

behnam added A: lib-impl Library Implementation C: ucd Unicode Character Database labels Oct 2, 2017

behnam added this to the UNIC-0.7 milestone Oct 2, 2017

CAD97 reviewed Oct 2, 2017

View reviewed changes

behnam force-pushed the ucd-common branch from 8c7f914 to ac3365c Compare October 3, 2017 05:01

[ucd/common] Add White_Space and is_control

cb96897

Add UCD `White_Space` property. Add `is_control()` method following the Rust std lib practice for defining it based on `General_Category = Cc`.

behnam force-pushed the ucd-common branch from 2456352 to c36ba8e Compare October 4, 2017 00:39

CAD97 reviewed Oct 4, 2017

View reviewed changes

[ucd/common] Add consistency test

652f4c0

Add new integration test to check, for all code-points, consistency between common properties themselves, and versus General_Category.

behnam force-pushed the ucd-common branch from c36ba8e to 652f4c0 Compare October 4, 2017 21:41

CAD97 approved these changes Oct 5, 2017

View reviewed changes

bors Bot added a commit that referenced this pull request Oct 5, 2017

Try #172: | Run the test suite, without pushing to master.

f457b12

bors Bot merged commit 652f4c0 into master Oct 5, 2017

behnam deleted the ucd-common branch October 5, 2017 01:38

Conversation

behnam commented Oct 2, 2017

Uh oh!

behnam commented Oct 2, 2017

Uh oh!

CAD97 left a comment

Choose a reason for hiding this comment

Uh oh!

behnam commented Oct 2, 2017

Uh oh!

behnam commented Oct 2, 2017

Uh oh!

behnam commented Oct 2, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CAD97 commented Oct 2, 2017

Uh oh!

behnam commented Oct 3, 2017

Uh oh!

CAD97 left a comment

Choose a reason for hiding this comment

Uh oh!

CAD97 Oct 4, 2017

Choose a reason for hiding this comment

Uh oh!

behnam Oct 4, 2017

Choose a reason for hiding this comment

Uh oh!

behnam commented Oct 4, 2017

Uh oh!

behnam commented Oct 4, 2017

Uh oh!

behnam commented Oct 4, 2017

Uh oh!

behnam commented Oct 5, 2017

Uh oh!

bors Bot commented Oct 5, 2017

Uh oh!

behnam commented Oct 5, 2017

Uh oh!

behnam commented Oct 5, 2017

Uh oh!

CAD97 commented Oct 5, 2017

Uh oh!

CAD97 commented Oct 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CAD97 left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

behnam commented Oct 5, 2017

Uh oh!

behnam commented Oct 5, 2017

Uh oh!

CAD97 commented Oct 5, 2017

Uh oh!

bors Bot commented Oct 5, 2017

Uh oh!

bors Bot commented Oct 5, 2017

Uh oh!

bors Bot commented Oct 5, 2017

Uh oh!

bors Bot commented Oct 5, 2017

Canceled

Uh oh!

CAD97 commented Oct 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

CAD97 commented Oct 5, 2017

Uh oh!

behnam commented Oct 5, 2017 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

bors Bot commented Oct 5, 2017

Build succeeded

Uh oh!

bors Bot commented Oct 5, 2017

Build succeeded

Uh oh!

behnam commented Oct 2, 2017 •

edited

Loading

CAD97 commented Oct 5, 2017 •

edited

Loading

CAD97 left a comment •

edited

Loading

CAD97 commented Oct 5, 2017 •

edited

Loading

behnam commented Oct 5, 2017 •

edited

Loading

notriddle commented Oct 5, 2017 •

edited

Loading