I'm sharing my slow journey through the wonders of regular expressions here. Here's something I learned recently: searching some text for all instances of the regular expression
<.*?>
and replacing it with nothing will quickly remove most XML tags.
(The key item here is the ?, which makes the expression non-greedy. A <.*> would try to grab all text between the first and last angle brackets on a line; the ? makes the expression grab the text up to to the very next angle bracket.)
I used this in emacs with a replace-regexp command, it was like magic...quick and easy.
I haven't tested this exhaustively...maybe someone else can improve it. I noted that when I took a whole file, some tags that extended across lines didn't get removed. But this only applied to some of the meta information at the top of my file.
Or use xslt:
ReplyDelete<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template match="*"/>
</xsl:stylesheet>
Oops, make that:
ReplyDelete<xsl:stylesheet
xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
version="1.0">
<xsl:output method="text"/>
<xsl:template match="*">
<xsl:value-of select="normalize-space(.)"/>
</xsl:template>
</xsl:stylesheet>