7.5 KiB
Lexer component documentation
The lexer is responsible for providing tokens to the parser. The project comes with two lexers: PhpParser\Lexer
and
PhpParser\Lexer\Emulative
. The latter is an extension of the former, which adds the ability to emulate tokens of
newer PHP versions and thus allows parsing of new code on older versions.
This documentation discusses options available for the default lexers and explains how lexers can be extended.
Lexer options
The two default lexers accept an $options
array in the constructor. Currently only the 'usedAttributes'
option is
supported, which allows you to specify which attributes will be added to the AST nodes. The attributes can then be
accessed using $node->getAttribute()
, $node->setAttribute()
, $node->hasAttribute()
and $node->getAttributes()
methods. A sample options array:
$lexer = new PhpParser\Lexer(array(
'usedAttributes' => array(
'comments', 'startLine', 'endLine'
)
));
The attributes used in this example match the default behavior of the lexer. The following attributes are supported:
comments
: Array ofPhpParser\Comment
orPhpParser\Comment\Doc
instances, representing all comments that occurred between the previous non-discarded token and the current one. Use of this attribute is required for the$node->getComments()
and$node->getDocComment()
methods to work. The attribute is also needed if you wish the pretty printer to retain comments present in the original code.startLine
: Line in which the node starts. This attribute is required for the$node->getLine()
to work. It is also required if syntax errors should contain line number information.endLine
: Line in which the node ends. Required for$node->getEndLine()
.startTokenPos
: Offset into the token array of the first token in the node. Required for$node->getStartTokenPos()
.endTokenPos
: Offset into the token array of the last token in the node. Required for$node->getEndTokenPos()
.startFilePos
: Offset into the code string of the first character that is part of the node. Required for$node->getStartFilePos()
.endFilePos
: Offset into the code string of the last character that is part of the node. Required for$node->getEndFilePos()
.
Using token positions
Note: The example in this section is outdated in that this information is directly available in the AST: While
$property->isPublic()
does not distinguish betweenpublic
andvar
, directly checking$property->flags
for the$property->flags & Class_::VISIBILITY_MODIFIER_MASK) === 0
allows making this distinction without resorting to tokens. However the general idea behind the example still applies in other cases.
The token offset information is useful if you wish to examine the exact formatting used for a node. For example the AST
does not distinguish whether a property was declared using public
or using var
, but you can retrieve this
information based on the token position:
function isDeclaredUsingVar(array $tokens, PhpParser\Node\Stmt\Property $prop) {
$i = $prop->getAttribute('startTokenPos');
return $tokens[$i][0] === T_VAR;
}
In order to make use of this function, you will have to provide the tokens from the lexer to your node visitor using code similar to the following:
class MyNodeVisitor extends PhpParser\NodeVisitorAbstract {
private $tokens;
public function setTokens(array $tokens) {
$this->tokens = $tokens;
}
public function leaveNode(PhpParser\Node $node) {
if ($node instanceof PhpParser\Node\Stmt\Property) {
var_dump(isDeclaredUsingVar($this->tokens, $node));
}
}
}
$lexer = new PhpParser\Lexer(array(
'usedAttributes' => array(
'comments', 'startLine', 'endLine', 'startTokenPos', 'endTokenPos'
)
));
$parser = (new PhpParser\ParserFactory)->create(PhpParser\ParserFactory::ONLY_PHP7, $lexer);
$visitor = new MyNodeVisitor();
$traverser = new PhpParser\NodeTraverser();
$traverser->addVisitor($visitor);
try {
$stmts = $parser->parse($code);
$visitor->setTokens($lexer->getTokens());
$stmts = $traverser->traverse($stmts);
} catch (PhpParser\Error $e) {
echo 'Parse Error: ', $e->getMessage();
}
The same approach can also be used to perform specific modifications in the code, without changing the formatting in other places (which is the case when using the pretty printer).
Lexer extension
A lexer has to define the following public interface:
function startLexing(string $code, ErrorHandler $errorHandler = null): void;
function getTokens(): array;
function handleHaltCompiler(): string;
function getNextToken(string &$value = null, array &$startAttributes = null, array &$endAttributes = null): int;
The startLexing()
method is invoked whenever the parse()
method of the parser is called and is passed the source
code that is to be lexed (including the opening tag). It can be used to reset state or preprocess the source code or tokens. The
passed ErrorHandler
should be used to report lexing errors.
The getTokens()
method returns the current token array, in the usual token_get_all()
format. This method is not
used by the parser (which uses getNextToken()
), but is useful in combination with the token position attributes.
The handleHaltCompiler()
method is called whenever a T_HALT_COMPILER
token is encountered. It has to return the
remaining string after the construct (not including ();
).
The getNextToken()
method returns the ID of the next token (as defined by the Parser::T_*
constants). If no more
tokens are available it must return 0
, which is the ID of the EOF
token. Furthermore the string content of the
token should be written into the by-reference $value
parameter (which will then be available as $n
in the parser).
Attribute handling
The other two by-ref variables $startAttributes
and $endAttributes
define which attributes will eventually be
assigned to the generated nodes: The parser will take the $startAttributes
from the first token which is part of the
node and the $endAttributes
from the last token that is part of the node.
E.g. if the tokens T_FUNCTION T_STRING ... '{' ... '}'
constitute a node, then the $startAttributes
from the
T_FUNCTION
token will be taken and the $endAttributes
from the '}'
token.
An application of custom attributes is storing the exact original formatting of literals: While the parser does retain some information about the formatting of integers (like decimal vs. hexadecimal) or strings (like used quote type), it does not preserve the exact original formatting (e.g. leading zeros for integers or escape sequences in strings). This can be remedied by storing the original value in an attribute:
use PhpParser\Lexer;
use PhpParser\Parser\Tokens;
class KeepOriginalValueLexer extends Lexer // or Lexer\Emulative
{
public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
$tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);
if ($tokenId == Tokens::T_CONSTANT_ENCAPSED_STRING // non-interpolated string
|| $tokenId == Tokens::T_ENCAPSED_AND_WHITESPACE // interpolated string
|| $tokenId == Tokens::T_LNUMBER // integer
|| $tokenId == Tokens::T_DNUMBER // floating point number
) {
// could also use $startAttributes, doesn't really matter here
$endAttributes['originalValue'] = $value;
}
return $tokenId;
}
}