|
|
Author: comdog
Date: Sat Jun 7 17:16:05 2008
New Revision: 11390
Modified:
perlfaq/trunk/perlfaq.pod
perlfaq/trunk/perlfaq6.pod
Log:
* perlfaq6: How do I match XML, HTML, or other nasty, ugly things with a regex?
* suggested by Shlomi Fish since it's a frequent question on IRC
* index with X<sucking out, will to live>
Modified: perlfaq/trunk/perlfaq.pod
==============================================================================
--- perlfaq/trunk/perlfaq.pod (original)
+++ perlfaq/trunk/perlfaq.pod Sat Jun 7 17:16:05 2008
@@ -884,6 +884,10 @@
=item *
+How do I match XML, HTML, or other nasty, ugly things with a regex?
+
+=item *
+
I put a regular expression into $/ but it didn't work. What's wrong?
=item *
Modified: perlfaq/trunk/perlfaq6.pod
==============================================================================
--- perlfaq/trunk/perlfaq6.pod (original)
+++ perlfaq/trunk/perlfaq6.pod Sat Jun 7 17:16:05 2008
@@ -149,6 +149,47 @@
$. = 0 if eof; # fix $.
}
+=head2 How do I match XML, HTML, or other nasty, ugly things with a regex?
+X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration>
+X<sucking out, will to live>
+
+(contributed by brian d foy)
+
+If you just want to get work done, use a module and forget about the
+regular expressions. The C<XML::Parser> and C<HTML::Parser> modules
+are good starts, although each namespace has other parsing modules
+specialized for certain tasks and different ways of doing it. Start at
+CPAN Search ( http://search.cpan.org ) and wonder at all the work people
+have done for you already! :)
+
+The problem with things such as XML is that they have balanced text
+containing multiple levels of balanced text, but sometimes it isn't
+balanced text, as in an empty tag (C<<br/>, for instance). Even then,
+things can occur out-of-order. Just when you think you've got a
+pattern that matches your input, someone throws you a curveball.
+
+If you'd like to do it the hard way, scratching and clawing your way
+toward a right answer but constantly being disappointed, beseiged by
+bug reports, and weary from the inordinate amount of time you have to
+spend reinventing a triangular wheel, then there are several things
+you can try before you give up in frustration:
+
+=over 4
+
+=item * Solve the balanced text problem from another question in L<perlfaq6>
+
+=item * Try the recursive regex features in Perl 5.10 and later. See L<perlre>
+
+=item * Try defining a grammar using Perl 5.10's C<(?DEFINE)> feature.
+
+=item * Break the problem down into sub-problems instead of trying to use a
single regex
+
+=item * Convince everyone not to use XML or HTML in the first place
+
+=back
+
+Good luck!
+
=head2 I put a regular expression into $/ but it didn't work. What's wrong?
X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
X<$RS, regexes in>
|
|