perl.cvs.perlfaq
[Top] [All Lists]

[svn:perlfaq] r11390 - perlfaq/trunk

Subject: [svn:perlfaq] r11390 - perlfaq/trunk
From: comdog@xxxxxxxxxxxx
Date: Sat, 7 Jun 2008 17:16:06 -0700 (PDT)
Newsgroups: perl.cvs.perlfaq

Author: comdog
Date: Sat Jun  7 17:16:05 2008
New Revision: 11390

Modified:
   perlfaq/trunk/perlfaq.pod
   perlfaq/trunk/perlfaq6.pod

Log:
* perlfaq6: How do I match XML, HTML, or other nasty, ugly things with a regex?
        * suggested by Shlomi Fish since it's a frequent question on IRC
        * index with X<sucking out, will to live>


Modified: perlfaq/trunk/perlfaq.pod
==============================================================================
--- perlfaq/trunk/perlfaq.pod   (original)
+++ perlfaq/trunk/perlfaq.pod   Sat Jun  7 17:16:05 2008
@@ -884,6 +884,10 @@
 
 =item *
 
+How do I match XML, HTML, or other nasty, ugly things with a regex?
+
+=item *
+
 I put a regular expression into $/ but it didn't work. What's wrong?
 
 =item *

Modified: perlfaq/trunk/perlfaq6.pod
==============================================================================
--- perlfaq/trunk/perlfaq6.pod  (original)
+++ perlfaq/trunk/perlfaq6.pod  Sat Jun  7 17:16:05 2008
@@ -149,6 +149,47 @@
                $. = 0 if eof;  # fix $.
        }
 
+=head2 How do I match XML, HTML, or other nasty, ugly things with a regex?
+X<regex, XML> X<regex, HTML> X<XML> X<HTML> X<pain> X<frustration>
+X<sucking out, will to live>
+
+(contributed by brian d foy)
+
+If you just want to get work done, use a module and forget about the
+regular expressions. The C<XML::Parser> and C<HTML::Parser> modules
+are good starts, although each namespace has other parsing modules
+specialized for certain tasks and different ways of doing it. Start at
+CPAN Search ( http://search.cpan.org ) and wonder at all the work people
+have done for you already! :)
+
+The problem with things such as XML is that they have balanced text
+containing multiple levels of balanced text, but sometimes it isn't
+balanced text, as in an empty tag (C<<br/>, for instance). Even then,
+things can occur out-of-order. Just when you think you've got a
+pattern that matches your input, someone throws you a curveball. 
+
+If you'd like to do it the hard way, scratching and clawing your way
+toward a right answer but constantly being disappointed, beseiged by
+bug reports, and weary from the inordinate amount of time you have to
+spend reinventing a triangular wheel, then there are several things
+you can try before you give up in frustration:
+
+=over 4
+
+=item * Solve the balanced text problem from another question in L<perlfaq6>
+
+=item * Try the recursive regex features in Perl 5.10 and later. See L<perlre>
+
+=item * Try defining a grammar using Perl 5.10's C<(?DEFINE)> feature.
+
+=item * Break the problem down into sub-problems instead of trying to use a 
single regex
+
+=item * Convince everyone not to use XML or HTML in the first place
+
+=back
+
+Good luck!
+
 =head2 I put a regular expression into $/ but it didn't work. What's wrong?
 X<$/, regexes in> X<$INPUT_RECORD_SEPARATOR, regexes in>
 X<$RS, regexes in>

<Prev in Thread] Current Thread [Next in Thread>
  • [svn:perlfaq] r11390 - perlfaq/trunk, comdog <=