Anecdotal Evidence: Remove XML/HTML tags using a regular expression

Saturday, September 20, 2008

Remove XML/HTML tags using a regular expression

I'm sharing my slow journey through the wonders of regular expressions here. Here's something I learned recently: searching some text for all instances of the regular expression

<.*?>

and replacing it with nothing will quickly remove most XML tags.

(The key item here is the ?, which makes the expression non-greedy. A <.*> would try to grab all text between the first and last angle brackets on a line; the ? makes the expression grab the text up to to the very next angle bracket.)

I used this in emacs with a replace-regexp command, it was like magic...quick and easy.

I haven't tested this exhaustively...maybe someone else can improve it. I noted that when I took a whole file, some tags that extended across lines didn't get removed. But this only applied to some of the meta information at the top of my file.

2 comments:

Outis6:10 PM
Or use xslt:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template match="*"/>
</xsl:stylesheet>
ReplyDelete
Replies
Outis6:11 PM
Oops, make that:

<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template match="*">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
</xsl:stylesheet>
ReplyDelete
Replies

Add comment

Anecdotal Evidence

Pages

Saturday, September 20, 2008

Remove XML/HTML tags using a regular expression

2 comments:

Links

About Me