1
0
mirror of https://github.com/danog/PHP-Parser.git synced 2024-11-30 04:19:30 +01:00
PHP-Parser/doc/0_Introduction.markdown

99 lines
4.8 KiB
Markdown
Raw Normal View History

Introduction
============
This project is a PHP 5.4 (and older) parser **written in PHP itself**.
What is this for?
-----------------
A parser is useful for [static analysis][0] and manipulation of code and basically any other
application dealing with code programmatically. A parser constructs an [Abstract Syntax Tree][1]
(AST) of the code and thus allows dealing with it in an abstract and robust way.
There are other ways of dealing with source code. One that PHP supports natively is using the
token stream generated by [`token_get_all`][2]. The token stream is much more low level than
the AST and thus has different applications: It allows to also analyize the exact formating of
a file. On the other hand the token stream is much harder to deal with for more complex analysis.
For example an AST abstracts away the fact that in PHP variables can be written as `$foo`, but also
as `$$bar`, `${'foobar'}` or even `${!${''}=barfoo()}`. You don't have to worry about recognizing
all the different syntaxes from a stream of tokens.
Another questions is: Why would I want to have a PHP parser *written in PHP*? Well, PHP might not be
a language especially suited for fast parsing, but processing the AST is much easier in PHP than it
would be in other, faster languages like C. Furthermore the people most probably wanting to do
programmatic PHP code analysis are incidentially PHP developers, not C developers.
What can it parse?
------------------
The parser uses a PHP 5.4 compliant grammar, but lexing is done using the `token_get_all` tokenization
facility provided by PHP itself. This means that you will be able to parse pretty much any PHP code you
want, but there are some limitations to keep in mind:
* The PHP 5.4 grammar is implemented in such a way that it is backwards compatible. So parsing PHP 5.3
and PHP 5.2 is also possible (and maybe older versions). On the other hand this means that the parser
will let some code through, which would be invalid in the newest version (for example call time pass
by reference will *not* throw an error even though PHP 5.4 doesn't allow it anymore). This shouldn't
normally be a problem and if it is strictly required it can be easily implemented in a NodeVisitor.
* Even though the parser supports PHP 5.4 it depends on the internal tokenizer, which only supports
the PHP version it runs on. So you will be able parse PHP 5.4 if you are running PHP 5.4. But you
wouldn't be able to parse PHP 5.4 code (which uses one of the new features) on PHP 5.3. The support
matrix looks roughly like this:
| parsing PHP 5.4 | parsing PHP 5.3 | parsing PHP 5.2
---------------------------------------------------------------------
running PHP 5.4 | yes | yes | yes
running PHP 5.3 | no | yes | yes
running PHP 5.2 | no | no | yes
* The parser inherits all bugs of the `token_get_all` function. There are only two which I
currently know of, namely lexing of `b"$var"` literals and nested HEREDOC strings. The former
bug is circumvented by the `PHPParser_Lexer` wrapper which the parser uses, but the latter remains
(though I seriously doublt it will ever occur in practical use.)
What output does it produce?
----------------------------
The parser produces an [Abstract Syntax Tree][1] (AST) also known as a node tree. How this looks like
can best be seen in an example. The program `<?php echo 'Hi', 'World';` will give you a node tree
roughly looking like this:
2011-11-12 19:28:53 +01:00
```
array(
0: Stmt_Echo(
exprs: array(
0: Scalar_String(
value: Hi
)
1: Scalar_String(
value: World
)
)
)
2011-11-12 19:28:53 +01:00
)
```
This matches the semantics the program had: An echo statement, which takes two strings as expressions,
with the values `Hi` and `World!`.
You can also see that the AST does not contain any whitespace or comment information (only doc comments
are saved). So using it for formatting analysis is not possible.
What else can it do?
--------------------
Apart from the parser itself this package also bundles support for some other, related features:
* Support for pretty printing, which is the act of converting an AST into PHP code. Please note
that "pretty printing" does not imply that the output is especially pretty. It's just how it's
called ;)
* Support for serializing and unserializing the node tree to XML
* Support for dumping the node tree in a human readable form (see the section above for an
example of how the output looks like)
* Infrastructure for traversing and changing the AST (node traverser and node visitors)
* A node visitor for resolving namespaced names
[0]: http://en.wikipedia.org/wiki/Static_program_analysis
[1]: http://en.wikipedia.org/wiki/Abstract_syntax_tree
[2]: http://php.net/token_get_all