PHP/Zend "string types" patch

This is a proposed extension to the PHP language (and the Zend engine that drives the PHP interpreter). The patch introduces five types of strings: plain string, SQL string, HTML string, URL (query) string and undefined (unknown type) string. The difference is in escaping characters that have special meaning in SQL (quotes, nul), HTML (ampersand, less-than, greater-than, double-quote) and URL (nearly everything except plain letters and digits). The conversion is done automatically when requested. This language extension is fully backwards-compatible; users who don't know about the new features (or don't want to know) need not worry: their existing scripts should work the same without any change. For users who do know about this and want to use it, I believe this new feature should bring significant improvement of code readability, reduction of code size and reduced probability of bugs.

Download

strtypes-v1.patch.gz - 130kB; the uncompressed patch is about 1.2MB big, but most of the volume are changes in the parser, which is generated - so the real hand-made changes are not so big. The patch applies cleanly against PHP 4.2.0 release.

Apply the patch by the command: "zcat strtypes-v1.patch.gz | patch -p1" in the root directory of the PHP source tree. Then compile as you normally would.

The license is the same as that of PHP, of course. (Whatever that may be.)

Detailed description

Every string in PHP has a type, one of the following: These names are constants both in PHP and in the C source (zend.h). The string type is stored in a new member: "int str_type" of zval.value.str.

New syntax

You create a string of a specific type either by using a special new syntax, or by conversion or casting. The new syntax looks like this:
		$plain = p"When A < 0, A is called 'negative'.";
		echo h"A definition of a math term: <B>$plain</B>";
		mysql_query(s"INSERT INTO table VALUES ('$plain')");
		$url = q"http://example.com/cgi-bin/script?param=$plain";
		
As you may have guessed, the first line creates a plain string, the second line an HTML string, the third line an SQL string and the fourth line a URL query string. This alone doesn't bring anything special. The exciting new thing about this is that on each of these lines except the first, an automatic conversion occurs when a variable is inserted into the string in place of $varname.

So on line 2, you can safely output HTML without calling HtmlSpecialChars on the included string, because this occurs automatically, so the less-than sign will be converted to &lt; on the fly. On line 3, you can safely issue the SQL command without worrying about the inserted string containing apostrophes - they will be escaped with backslashes (or doubled, if you have magic_quotes_sybase set to on) automatically. And on line 4, you can safely pass the string as a URL parameter to a CGI script, because it will, again, be automatically converted.

New functions

There are three new built-in functions that are always available:

New constants

STR_* - see above.

New semantics

When concatenating strings, the second is converted to the type of the first. When either of the two strings are STR_UNDEFINED, no conversion occurs. The resulting string is always of the same type as the first string.

When comparing strings, they are first converted to the same type. Again, when either of them is STR_UNDEFINED, no conversion occurs.

Code status

The code is very much in alpha status. It is nearly untested and is very likely to contain bugs. I warn you: this is the first time I have even seen PHP/Zend internals, let alone hacked them. Try it, test it, even use it if you want, but don't blame me for any catastrophes. And be aware that I will probably still make changes to the semantics and maybe even syntax. I do want to hear about the bugs you find, of course.

Help requested

As I said, this is the first time I hack PHP/Zend internals, so I can't really say I understand everything. If I may criticize the PHP developers a little, the code could sure use some commenting! Also, there are many macros, but they are rarely used: for example, there is ZVAL_STRING to set a zval to a given string value, but most of the code just manually sets the zval fields instead of using it. This makes it difficult to assure that every string is initialized to a valid string type. I have searched for "[.>]type = IS_STRING" and added "<zval>.value.str.str_type = STR_UNDEFINED" everywhere, but still there are cases where a string somehow gets an invalid type. You can find these if you turn on reporting of E_NOTICE - see lines 521 and 544 in zend_operators.c, function _convert_to_string_type. This is where I could use some help, because currently I have no idea how to find these cases.
P.S.: Aha! Perhaps I missed some Z_TYPE_P(xxx) = IS_STRING cases - I think I didn't search for those!

Another problem is with the self-test suite that comes with PHP. After making my changes, I tried running "make test" and was quite shocked to find that it reported 45 failed tests. But then I ran it on the unmodified PHP 4.2.0, and the result was the same. So I must be doing something terribly wrong. README.TESTING says "You must build CGI SAPI". After untarring php-4.2.0.tar.bz2, I ran "./configure" without parameters and sure enough, it said (hidden in about a kilometer of useless stuff) "checking for chosen SAPI module... cgi". So I thought that was OK and ran "make test", but the result was "No rule to make target `/root/php/php-4.2.0/sapi/cli/php'". So then I tried "./configure --enable-cli". "make test" then ran, but as I said, 45 tests failed. Since this was on the unmodified PHP, it can't be because I broke PHP with my changes. :-) I must be doing something wrong, or the test suite simply doesn't work. This is another thing where I would like some help.

Also see section 'Known problems' below.

Known problems, things to do and things to think about

As mentioned in the previous section, sometimes strings are not initialized to a valid type. This surely occurs especially in the extensions.

The following outputs '00', but should output '11'. I don't know why.

	echo str_type_get(str_type_set(h"&lt;tag&gt;\n", STR_PLAIN));
	echo str_type_get(str_type_convert(h"&lt;tag&gt;\n", STR_PLAIN));
	

Currently, all strings are STR_UNDEFINED unless converted or casted, except GET/POST/COOKIE parameters, which are STR_SQL when magic_quotes_gpc is on, and STR_PLAIN otherwise. Other strings are unaffected (i.e. magic_quotes_runtime affected strings are still always STR_UNDEFINED). This should be completely backwards-compatible, because unless you use the new features, only STR_UNDEFINED and exactly _one_ other string type is used, so no conversion should ever occur. We might want to make magic_quotes_runtime affected strings also have a type, but this would break compatibility in the case where magic_quotes_runtime is set differently from magic_quotes_gpc. I see two possibilities: either we simply let it be as it is - after all, magic_quotes_runtime can be turned off at runtime, so it doesn't have the same problem as magic_quotes_gpc, which the programmer may have no way of changing. The other possibility is to make magic_quotes_runtime-affected strings have type, but only when the programmer says so with a run-time setting. This would be better, but of course much more work to do.

Currently, when converting from a STR_SQL, STR_HTML or STR_QUERY string to another of these types, the string is first converted to STR_PLAIN (i.e. the escaped characters are un-escaped) and then to the target type. Perhaps this intermediate step should be skipped? It seems it would be more practical...

Unlike HtmlSpecialChars and HtmlEntities, the string type conversion doesn't handle multi-byte character sets. The conversion from HTML to PLAIN handles only the four HTML entities that the reverse conversion uses (amp, lt, gt, quot).

The conversion from and to a URL query string is equivalent to the "raw" URL encode/decode functions, i.e. it uses %20 as space, not '+'.

var_dump and print_r should be changed to output the string type specifier. Also, (un)serialization must preserve the string type.

There are surely more unresolved problems and questions that I forgot to mention or that I don't even know of.

Additional information and links

There is a bug database entry (no. 16480), a Zend "Into the Future" forum thread and an announcement in the same forum. And there is a discussion (and another thread - sorry, my fault) going on in the php-dev mailing list.

Feedback

I want to know what you think. Do you like/dislike the idea? Do you have anything to say about the proposed changes? Did you try it and it worked? Great, tell me more! It didn't work? Not so great, but tell me still more! I'm keen to hear from you. And of course, if you are a member of the PHP/Zend team, I would like to know whether you are willing to include this in a future version of PHP. Write me to vdvo@vdvo.net.


Valid HTML 4.0!
(c) 2000-2003 Vaclav Dvorak
Poslední změna / last modified: 2002-05-25 23:14:31