This Week on p5p 2000/07/02
- Notes
- More Unicode
- Speeding up method lookups
my __PACKAGE__ $foo
cfgperl
- Missing Methods
- Signals on Windows
- New
File::Spec
- Another depressing regex engine bug
s///
Appears to be Slowerperlforce.pod
\&
prototype now works- Call for Short Doc Patch
- More Bug Bounty
sprintf
tests- Regression Tests and
@INC
setting asdgdasfasdgasdf;jklaskldhgauklhc dhacb;dh
- Various
Notes
You can subscribe to an email version of this summary by sending an empty message to p5p-digest-subscribe@plover.com
.
Please send corrections and additions to mjd-perl-thisweek-YYYYMM@plover.com
where YYYYMM
is the current year and month.
More Unicode
Simon continues to generate Unicode patches.
Patch that fixes concatenation operator.
Unicode Handling HOWTO
Simon wrote a clear and amusing summary of what Unicode is and how to deal with it. If you’ve been puzzled by all this unicode stuff, you should certainly Read about it.
Unicode Regex Matching
Simon also asked what would happen if you did this:
$b = v300
v196.172.200 =~ /^$b/;
(This is an issue because the UTF8 representation of $b
is actually the two bytes with values 196 and 172.) But Gisle said that of course it should not match, because the target string does not in fact contain character #300.
This led to a brief discussion of what the regex engine should do with UTF8 strings. The problem here goes back to the roots of the UTF8 implementation.
Larry’s original idea was that if use utf8
was in scope, operation would assume that all data was UTF8 strings, and if not, they would assume byte strings. This puts a lot of burden on the programmer and especially on the module writer. For example, suppose you had wanted to write a function that would return true if its argument were longer than 6 characters:
sub is_long {
my ($s) = @_;
length($s) > 6;
}
No, that would not work, because if the caller had passed in a UTF8 string, then your answer ouwld be whether the string was longer than six bytes, not six characters. (Remember characters in a UTF8 may be longer than one byte each.) You would have had to write something like this instead:
sub is_long {
my ($s) = @_;
if (is_utf8($s)) {
use utf8;
length($s) > 6;
} else {
length($s) > 6;
}
}
This approach was abandoned several versions ago, and you can see why. The current approach is that every scalar carries around a flag that says whether it is a UTF8 string or a plain byte string, and operations like length()
are overloaded to work on both kinds of strings; length()
returns the number of characters in the string whether or not the string is UTF8.
Now here’s a dirty secret: Overloading the regex engine this way is difficult, and hasn’t been done yet. Regex matching ignores the UTF8 flag in its target. Instead, it uses the old method that was abandoned: if it was compiled with use utf8
in scope, it assumes that its argument is in UTF8 format, and if not, it assumes its argument is a byte string.
The right thing to do here is to fix the regex engine so that its behavior depends on whether the UTF8 flag in the target. The hard way (but the right way) is to really fix the regex engine. The easier way is to have the regex engine compile everything as if use utf8
was not in scope, and then later on if it is called on to match a UTF8 string, it should recompile the regex as if use utf8
had been enabled, and stash that new compiled regex alongside the original one for use with UTF8 strings.
I18N FAQ
Jarkko posted a link to an excellent Perl I18N/L10N FAQ written by James.
Normalization
This led Simon to ask if Perl should have support for normalization. What is normalization? Unicode has a character for the letter ‘e’ (U+0065), and a character for an acute accent (U+00B4), which looks something like ´ and is called a ‘combining character’ because it combines with the following character to yield an accented character; when the string containing an acute accent is displayed, the accent should be superimposed on the previous character. But Unicode also has a character for the letter e with an acute accent (U+00E9), as é. This should be displayed the same way as the two character sequence U+00B4 U+0065.
Perl does not presently do this, and if you have two strings, produced by pack "U*", 0xB4, 0x65
and by pack "U*", 0xE9
it reports them as different, which they certainly are. But clearly, for some applications, you would like them to be considered equivalent, and Perl presently has no built-in function to recognize this.
Sarathy said yes, we do want this, but not until the basic stuff is working.
Simon Stops Working on Unicode
Simon announced a temporary halt to his Unicode activities; he is going to work on the line disciplines feature next.
He also said that he would be happy if someone would help him with both Unicode and line disciplines.
Speeding up method lookups
Fergal Daly pointed out that Doug’s patch will break abstract base classes, because it extends the semnatics of use Dog $spot
to mean something new. Formerly, it meant that $spot
was guaranteed to be implemented with a pseudohash, and that the fields in $spot
were guaranteed to be a subset of those specified in %Dog::FIELDS
. Doug’s patch now adds the meaning that method calls on $spot
will be resolved at compile time by looking for them in class Dog
. This is a change, because it used to be that it was permissble to assign $spot
with an object from some subclass of Dog
, say Schnauzer
, as long as its fields were laid out in a way that was compatible with %Dog::FIELDS
. But now you cannot do that, because when you call $spot->meth
you get Dog::meth
instead of Schnauzer::meth
.
Oops.
Some discussion ensued. Sarathy suggested that the optimization only be enabled if, at the end of compilation, Dog
has no subclasses. Fergal said it would be a shame to limit it to such cases, and it would not be much harder to enable the optimization for any method that was not overridden in any subclass.
Discussion is ongoing.
my __PACKAGE__ $foo
Doug MacEachern contributed a patch that allows my __PACKAGE__ $foo
, where __PACKAGE__
represents the current package name. There was some discussion about whether the benefit was worth ths cost of the code bloat. Doug said that it was useful for the same reasons that __PACKAGE__
is useful anywhere else. (As a side note, why is it that the word ‘bloat’ is never used except in connection with three-line patches?)
Andreas Koenig said that it would be even better to allow my CONSTANT $foo
where CONSTANT
is any compile-time constant at all, such as one that was created by use constant
. Doug provided an amended patch to do that also.
Jan Dubois pointed out that this will break existing code that has a compile-time constant that is of the same name as an existing patch. Andreas did not care.
Andreas Koenig: Who uses constants that have the same name as existing and actually used classes isn’t coding cleanly and should be shot anyway.
More persuasively, he pointed out that under such a circumstance, my Foo $x = Foo->new
would not work either, because the Foo
on the right would be interpreted as a constant instead of as a class name.
Andreas’ explanation of why he wants this feature
Doug then submitted an updated updated patch that enables my Foo:: $x
as well.
cfgperl
Last week I sent aggrieved email to a number of people asking what cfgperl
was and why there appeared to be a secret source repository on Jarkko’s web site that was more up-to-date than the documented source repository. I was concerned that there was in inner circle of development going on with a hidden development branch that was not accessible to the rest of the world.
Jarkko answered me in some detail in email, and then posted to p5p to explain the real situation. cfgperl
is simply the name for Jarkko’s private copy of the source, to which he applies patches that he deems worthy. It got ahead of the main repository because Sarathy was resting last month.
Missing Methods
Richard Soderberg responded to my call for a patch for this (see last week’s discussion) and produced one. Thank you very much, Richard!
Signals on Windows
Sarathy said that signals really couldn’t be emulated properly under Windows, but that people keep complaining about it anyway. So he put in a patch that tries to register the signal handler anyway, I guess in hopes of stopping them from complaining.
New File::Spec
Barrie Slaymaker submitted a set of changes to the File::Spec
suite.
Another depressing regex engine bug
This can result in backreference variables being set incorrectly when they should be undef. Apparently state is not always restored properly on backtracking.
s///
Appears to be Slower
Perl Lindquist reported an example of s///
that runs much slower in 5.6.0 than in 5.004_03. The regex is bad, so that you would expect a quadratic search, but Mike Guy reported that in fact Perl was doing a cubic search.
Mike’s analysis and shorter test case
perlforce.pod
Simon claims that this document is three years old and that he was only sending a minor update, but I don’t find it in my copy of the development sources.
It is a document about how to use the Perforce repository in which the master copies of the Perl sources reside.
\&
prototype now works
Larry sent a patch that permits a function to have \&
in its prototype. It appears to be synonymous with &
.
Call for Short Doc Patch
The sequence \_
in a regex now elicits a warning where it didn’t before. Dominic Dunlop tracked down the patch that introduced this and pointed out that it needs to be documented (in perldelta
and possibly perldiag
) and probably also needs a test case. But nobody stepped up. Here’s an easy opportunity for someone to contribute a doc patch.
More Bug Bounty
Dominic Dunlop reported an interesting bug in the new printf "%v"
specifier. The bug is probably not too difficult to investigate and fix, because it is probably localized to a small part of Perl that does not deal woo much with Perl’s special data structures. So it is a good thing for a beginner to work on. Drop me a note if you are interested and if you need help figuring out where to start.
sprintf
tests
Dominic also sent a patch that added 188 new tests to t/op/sprintf.t
.
Regression Tests and @INC
setting
Some time ago, Nicholas Clark pointed out that many regression tests will fail if you opt not to build all of Perl’s standard extension modules, such as Fcntl
.
A sidetrack developed out of Nicholas’ patch to fix this, discussing the best way to make sure that tests get the test version of the library, and not the previously installed version of the library. Nicholas was using
unshift '../lib';
This is a common idiom in the test files. What’s wrong with it? It leaves the standard directories in @INC
, which may not be appropriate, and it assumes that the library is in a sibling directory, so you cannot run the test without being in the t/
directory itself.
There was a little discussion of the right thing to do here. Mike Guy suggested that one solution would be to have the test harness set up the environment properly in the first place. The problem with that is that then you can’t run the tests without the harness. (For example, you might want to run a single test file; at present you can just say perl t/op/dog.t
or whatever.)
Sarathy pointed out that having each test file begin with something like
BEGIN { @INC = split('|',$ENV{PERL_TEST_LIB_PATH}
|| '../lib') }
might solve the problem. Then the harness can set PERL_TEST_LIB_PATH
but you can still run a single test manually if you are in the right place.
asdgdasfasdgasdf;jklaskldhgauklhc dhacb;dh
Another garbage bug report from the Czech republic. It was funny the first time; this time it is substantially less amusing.
Hey, Czech dude! Stop using perlbug
to test your keyboard cables, or I will come to your house and chop off all eight of your fingers.
Various
A large collection of bug reports, bug fixes, non-bug reports (you can use a number as a reference!) questions, answers, and a small amount of spam. No flames.
Until next week I remain, your humble and obedient servant,
Tags
Feedback
Something wrong with this article? Help us out by opening an issue or pull request on GitHub