2011-11-10 11:40:11 +01:00
|
|
|
Introduction
|
|
|
|
============
|
|
|
|
|
|
|
|
This project is a PHP 5.4 (and older) parser **written in PHP itself**.
|
|
|
|
|
|
|
|
What is this for?
|
|
|
|
-----------------
|
|
|
|
|
|
|
|
A parser is useful for [static analysis][0] and manipulation of code and basically any other
|
|
|
|
application dealing with code programmatically. A parser constructs an [Abstract Syntax Tree][1]
|
|
|
|
(AST) of the code and thus allows dealing with it in an abstract and robust way.
|
|
|
|
|
|
|
|
There are other ways of dealing with source code. One that PHP supports natively is using the
|
|
|
|
token stream generated by [`token_get_all`][2]. The token stream is much more low level than
|
|
|
|
the AST and thus has different applications: It allows to also analyize the exact formating of
|
|
|
|
a file. On the other hand the token stream is much harder to deal with for more complex analysis.
|
|
|
|
For example an AST abstracts away the fact that in PHP variables can be written as `$foo`, but also
|
|
|
|
as `$$bar`, `${'foobar'}` or even `${!${''}=barfoo()}`. You don't have to worry about recognizing
|
|
|
|
all the different syntaxes from a stream of tokens.
|
|
|
|
|
|
|
|
Another questions is: Why would I want to have a PHP parser *written in PHP*? Well, PHP might not be
|
|
|
|
a language especially suited for fast parsing, but processing the AST is much easier in PHP than it
|
|
|
|
would be in other, faster languages like C. Furthermore the people most probably wanting to do
|
|
|
|
programmatic PHP code analysis are incidentially PHP developers, not C developers.
|
|
|
|
|
|
|
|
What can it parse?
|
|
|
|
------------------
|
|
|
|
|
|
|
|
The parser uses a PHP 5.4 compliant grammar, but lexing is done using the `token_get_all` tokenization
|
|
|
|
facility provided by PHP itself. This means that you will be able to parse pretty much any PHP code you
|
|
|
|
want, but there are some limitations to keep in mind:
|
|
|
|
|
|
|
|
* The PHP 5.4 grammar is implemented in such a way that it is backwards compatible. So parsing PHP 5.3
|
|
|
|
and PHP 5.2 is also possible (and maybe older versions). On the other hand this means that the parser
|
|
|
|
will let some code through, which would be invalid in the newest version (for example call time pass
|
|
|
|
by reference will *not* throw an error even though PHP 5.4 doesn't allow it anymore). This shouldn't
|
|
|
|
normally be a problem and if it is strictly required it can be easily implemented in a NodeVisitor.
|
|
|
|
|
|
|
|
* Even though the parser supports PHP 5.4 it depends on the internal tokenizer, which only supports
|
|
|
|
the PHP version it runs on. So you will be able parse PHP 5.4 if you are running PHP 5.4. But you
|
|
|
|
wouldn't be able to parse PHP 5.4 code (which uses one of the new features) on PHP 5.3. The support
|
|
|
|
matrix looks roughly like this:
|
|
|
|
|
|
|
|
| parsing PHP 5.4 | parsing PHP 5.3 | parsing PHP 5.2
|
|
|
|
---------------------------------------------------------------------
|
|
|
|
running PHP 5.4 | yes | yes | yes
|
|
|
|
running PHP 5.3 | no | yes | yes
|
|
|
|
running PHP 5.2 | no | no | yes
|
|
|
|
|
|
|
|
* The parser inherits all bugs of the `token_get_all` function. There are only two which I
|
|
|
|
currently know of, namely lexing of `b"$var"` literals and nested HEREDOC strings. The former
|
|
|
|
bug is circumvented by the `PHPParser_Lexer` wrapper which the parser uses, but the latter remains
|
|
|
|
(though I seriously doublt it will ever occur in practical use.)
|
|
|
|
|
|
|
|
What output does it produce?
|
|
|
|
----------------------------
|
|
|
|
|
|
|
|
The parser produces an [Abstract Syntax Tree][1] (AST) also known as a node tree. How this looks like
|
|
|
|
can best be seen in an example. The program `<?php echo 'Hi', 'World';` will give you a node tree
|
|
|
|
roughly looking like this:
|
|
|
|
|
2011-11-12 19:28:53 +01:00
|
|
|
```
|
|
|
|
array(
|
|
|
|
0: Stmt_Echo(
|
|
|
|
exprs: array(
|
|
|
|
0: Scalar_String(
|
|
|
|
value: Hi
|
|
|
|
)
|
|
|
|
1: Scalar_String(
|
|
|
|
value: World
|
2011-11-10 11:40:11 +01:00
|
|
|
)
|
|
|
|
)
|
|
|
|
)
|
2011-11-12 19:28:53 +01:00
|
|
|
)
|
|
|
|
```
|
2011-11-10 11:40:11 +01:00
|
|
|
|
|
|
|
This matches the semantics the program had: An echo statement, which takes two strings as expressions,
|
|
|
|
with the values `Hi` and `World!`.
|
|
|
|
|
|
|
|
You can also see that the AST does not contain any whitespace or comment information (only doc comments
|
|
|
|
are saved). So using it for formatting analysis is not possible.
|
|
|
|
|
|
|
|
What else can it do?
|
|
|
|
--------------------
|
|
|
|
|
|
|
|
Apart from the parser itself this package also bundles support for some other, related features:
|
|
|
|
|
|
|
|
* Support for pretty printing, which is the act of converting an AST into PHP code. Please note
|
|
|
|
that "pretty printing" does not imply that the output is especially pretty. It's just how it's
|
|
|
|
called ;)
|
|
|
|
* Support for serializing and unserializing the node tree to XML
|
|
|
|
* Support for dumping the node tree in a human readable form (see the section above for an
|
|
|
|
example of how the output looks like)
|
|
|
|
* Infrastructure for traversing and changing the AST (node traverser and node visitors)
|
|
|
|
* A node visitor for resolving namespaced names
|
|
|
|
|
|
|
|
[0]: http://en.wikipedia.org/wiki/Static_program_analysis
|
|
|
|
[1]: http://en.wikipedia.org/wiki/Abstract_syntax_tree
|
|
|
|
[2]: http://php.net/token_get_all
|