Saturday, September 20, 2008

Remove XML/HTML tags using a regular expression

I'm sharing my slow journey through the wonders of regular expressions here. Here's something I learned recently: searching some text for all instances of the regular expression
<.*?>
and replacing it with nothing will quickly remove most XML tags. 
(The key item here is the ?, which makes the expression non-greedy. A <.*> would try to grab all text between the first and last angle brackets on a line; the ? makes the expression grab the text up to to the very next angle bracket.)
I used this in emacs with a replace-regexp command, it was like magic...quick and easy.
I haven't tested this exhaustively...maybe someone else can improve it. I noted that when I took a whole file, some tags that extended across lines didn't get removed. But this only applied to some of the meta information at the top of my file. 

2 comments:

  1. Or use xslt:

    <xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">
    <xsl:output method="text"/>
    <xsl:template match="*"/>
    </xsl:stylesheet>

    ReplyDelete
  2. Oops, make that:

    <xsl:stylesheet
    xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
    version="1.0">
    <xsl:output method="text"/>
    <xsl:template match="*">
    <xsl:value-of select="normalize-space(.)"/>
    </xsl:template>
    </xsl:stylesheet>

    ReplyDelete