Skip to content

Rewrite parser as recursive descent#591

Merged
binji merged 5 commits into
masterfrom
parser
Aug 15, 2017
Merged

Rewrite parser as recursive descent#591
binji merged 5 commits into
masterfrom
parser

Conversation

@binji

@binji binji commented Aug 11, 2017

Copy link
Copy Markdown
Member
  • Remove Bison dependency
  • Remove pre-generated parser files
  • Rename build config from no-re2c-bison to no-re2c
  • Add a simple make_unique implementation
  • Move handling of module bindings into ir.cc
  • Simplify lexer
    • Remove lookahead, the parser handles this now
    • Unify Token/LexerToken, it only contains terminal values now
    • Refactor setting token type and value into one function (e.g.
      LITERAL, RETURN => RETURN_LITERAL)
  • New Parser
    • Uses two tokens of lookahead (use Peek(), PeekAfter())
    • Consume() consumes one token of any kind
    • Match(t) consumes the current token if it matches
    • PeekMatch(t) returns true iff the token matches, but doesn't consume
    • Basic error synchronization; plenty of room for improvement here

* Remove Bison dependency
* Remove pre-generated parser files
* Rename build config from no-re2c-bison to no-re2c
* Add a simple make_unique implementation
* Move handling of module bindings into ir.cc
* Simplify lexer
  - Remove lookahead, the parser handles this now
  - Unify Token/LexerToken, it only contains terminal values now
  - Refactor setting token type and value into one function (e.g.
    LITERAL, RETURN => RETURN_LITERAL)
* New Parser
  - Uses two tokens of lookahead (use Peek(), PeekAfter())
  - Consume() consumes one token of any kind
  - Match(t) consumes the current token if it matches
  - PeekMatch(t) returns true iff the token matches, but doesn't consume
  - Basic error synchronization; plenty of room for improvement here
@binji binji requested review from KarlSchimpf and sbc100 August 11, 2017 23:48
Comment thread test/parse/bad-toplevel.txt Outdated
(foo bar)
(;; STDERR ;;;
out/test/parse/bad-toplevel.txt:2:2: error: unexpected token "foo"
out/test/parse/bad-toplevel.txt:2:1: error: unexpected token "(", expected a module field or a command.

@jayphelps jayphelps Aug 13, 2017

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I know this is a first pass conversion (and it's awesome) but this one jumped out at me as confusing. Perhaps when an invalid field is discovered it consumes the Lpar before erroring so the more descriptive/accurate token is given?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call; this is handled in a few other places but I think I must have missed it here. 👍

@KarlSchimpf KarlSchimpf left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I have a few nits (on style/generality). However, this CL is very large (and time consuming to review). If you want to fix the nits in a later CL, that is fine.

Comment thread src/wast-lexer.cc
} \
#define FILL(n) \
do { \
if (Failed(Fill((n)))) { \

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if there were lookahead tokens? In such a case you would return eof too early.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never mind. I realize now that lookahead is handled in the parser.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I moved the lookahead tokens to the parser, so the lexer can always just return the immediate value. Seemed more natural to me this way, since only the parser actually cares about being able to look ahead.

Comment thread src/wast-lexer.cc Outdated
Token lval_;
};
const char* GetTokenTypeName(TokenType token_type) {
const char* s_names[] = {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be static (i.e. initialized only once)?

Comment thread src/wast-parser.cc
}

bool WastParser::MatchLpar(TokenType type) {
if (PeekMatchLpar(type)) {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not a call to check if PeekAfter() == type?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's what PeekMatchLpar does, but it also checks that Peek() is TokenType::Lpar

Comment thread src/wast-parser.cc
Result WastParser::ParseQuotedText(std::string* text) {
WABT_TRACE(ParseQuotedText);
if (!PeekMatch(TokenType::Text))
return ErrorExpected({"a quoted string"}, "\"foo\"");

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

?foo?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is just for example text. Maybe not needed, but I thought it made the errors a little nicer:

error: unexpected token "(", expected a quoted string (e.g. "foo").

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Verified. Ok.

Comment thread src/wast-parser.cc Outdated
return GetToken().loc;
}

TokenType WastParser::Peek() {

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why doesn't this check if there available tokens?

I assume this is because you are assuming that multiple lookahead only happens in very restricted contexts (such as "(" "token"). If you are making this assumption, make it more clear.

Another choice would be to replace Peek() and PeekAfter() with:

TokenType WasmParser::Peek(size_t n = 0) {
while (tokens_.size() <= n)
tokens_.push_back(lexer_->GetToken(this);
return tokens_.at(n).,token_type;
}

This solution would allow more general peek behavior.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought about this, but since we only need two tokens of lookahead, I thought it would be better to make that clear by not making it general. Guess I should add some more documentation about all this.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While I agree that you only need two tokens of look-ahead, the current implementation was making some larger assumptions, such as you are always assuming that at least one token of look-ahead is being maintained (my solution removes this restriction).

I prefer a solution that doesn't have implicit assumptions because they are ALWAYS harder to maintain. In most contexts, an extension will come along that violates it, and then it is much harder to convince yourself what places need to be fixed. I understand that the Peek(0) (or simply Peek()) that I suggest does add an additional test (to see if a token needs to be retrieved), but I don't see that as a major performance cost.

I also like only having one "Peek" method, but that is my feeling.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I'll give it a shot :-)

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, PTAL

Comment thread src/wast-parser.cc
return Result::Error; \
} while (0)

#define EXPECT(token_type) CHECK_RESULT(Expect(TokenType::token_type))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not make this a function and let the compiler inline it?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly, create a "check()" function

void check(Token_type ty) { CHECK_RESULT(expr); }

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and

void ExpectCheck(TokenType ty) { CHECK_RESULT(Expect(ty)); }

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made this a macro mostly because of the magic automatic control flow (since CHECK_RESULT will return on failure)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I didn't read it carefully enough. current form is acceptable.

* Add TokenTypePair for convenience when dealing with two tokens
* Some additional function documentation too

@KarlSchimpf KarlSchimpf left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

I like it!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants