php-parser/doc/component/Lexer.markdown
2014-02-06 20:52:01 +01:00

4.1 KiB

Lexer component documentation

The lexer is responsible for providing tokens to the parser. The project comes with two lexers: PhpParser\Lexer and PhpParser\Lexer\Emulative. The latter is an extension of the former, which adds the ability to emulate tokens of newer PHP versions and thus allows parsing of new code on older versions.

A lexer has to define the following public interface:

startLexing($code);
getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null);
handleHaltCompiler();

startLexing

The startLexing method is invoked when the parse() method of the parser is called. It's argument will be whatever was passed to the parse() method.

Even though startLexing is meant to accept a source code string, you could for example overwrite it to accept a file:

<?php

class FileLexer extends PhpParser\Lexer {
    public function startLexing($fileName) {
        if (!file_exists($fileName)) {
            throw new InvalidArgumentException(sprintf('File "%s" does not exist', $fileName));
        }

        parent::startLexing(file_get_contents($fileName));
    }
}

$parser = new PhpParser\Parser(new FileLexer);

var_dump($parser->parse('someFile.php'));
var_dump($parser->parse('someOtherFile.php'));

getNextToken

getNextToken returns the ID of the next token and sets some additional information in the three variables which it accepts by-ref. If no more tokens are available it has to return 0, which is the ID of the EOF token.

The first by-ref variable $value should contain the textual content of the token. It is what will be available as $1 etc in the parser.

The other two by-ref variables $startAttributes and $endAttributes define which attributes will eventually be assigned to the generated nodes: The parser will take the $startAttributes from the first token which is part of the node and the $endAttributes from the last token that is part of the node.

E.g. if the tokens T_FUNCTION T_STRING ... '{' ... '}' constitute a node, then the $startAttributes from the T_FUNCTION token will be taken and the $endAttributes from the '}' token.

By default the lexer creates the attributes startLine, comments (both part of $startAttributes) and endLine (part of $endAttributes).

If you don't want all these attributes to be added (to reduce memory usage of the AST) you can simply remove them by overriding the method:

<?php

class LessAttributesLexer extends PhpParser\Lexer {
    public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
        $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);

        // only keep startLine attribute
        unset($startAttributes['comments']);
        unset($endAttributes['endLine']);

        return $tokenId;
    }
}

You can obviously also add additional attributes. E.g. in conjunction with the above FileLexer you might want to add a fileName attribute to all nodes:

<?php

class FileLexer extends PhpParser\Lexer {
    protected $fileName;

    public function startLexing($fileName) {
        if (!file_exists($fileName)) {
            throw new InvalidArgumentException(sprintf('File "%s" does not exist', $fileName));
        }

        $this->fileName = $fileName;
        parent::startLexing(file_get_contents($fileName));
    }

    public function getNextToken(&$value = null, &$startAttributes = null, &$endAttributes = null) {
        $tokenId = parent::getNextToken($value, $startAttributes, $endAttributes);

        // we could use either $startAttributes or $endAttributes here, because the fileName is always the same
        // (regardless of whether it is the start or end token). We choose $endAttributes, because it is slightly
        // more efficient (as the parser has to keep a stack for the $startAttributes).
        $endAttributes['fileName'] = $fileName;

        return $tokenId;
    }
}

handleHaltCompiler

The method is invoked whenever a T_HALT_COMPILER token is encountered. It has to return the remaining string after the construct (not including ();).