Feature or enhancement
HTMLParser recognizes a CDATA section <![CDATA[...]]> in any context. According to the HTML5 specification, it should only be recognized in foreign content -- the content of svg and math elements. Otherwise <![CDATA[ starts a bogus comment which ends at the first >, not at ]]>. Using the wrong ending condition can make the parser see a different structure of the document than browsers, which can have security consequences. This is the last unresolved item of gh-135661. The fix in #135665 was not satisfying, it just passed the ball to the user's side: the user is supposed to maintain the tracking mechanism outside of HTMLParser and call the new private method _set_support_cdata().
I propose to automatically detect foreign content in HTMLParser itself, by following start and end tags, approximating the tree construction dispatcher and the rules for parsing tokens in foreign content.
>>> parser.feed('<![CDATA[a > b]]>') # bogus comment: comment '[CDATA[a '
>>> parser.feed('<svg><![CDATA[a > b]]>') # CDATA section: unknown decl 'CDATA[a > b'
This also fixes RAWTEXT and RCDATA elements in foreign content: <svg><title>a<b>c</b></title> contains a b element, but HTMLParser currently parses the title content as text.
The new constructor parameter support_cdata controls this: None (default) -- automatic detection; True -- a CDATA section is recognized in any context, foreign content is not detected (the previous default behavior); False -- a CDATA section is never recognized. Calling _set_support_cdata() disables the automatic detection, so existing code which maintains its own tracking machinery works as before.
Has this already been discussed elsewhere?
The last item of gh-135661, discussed also in #135665. Related: gh-137877, gh-140878.
Links to previous discussion of this feature:
Linked PRs
Feature or enhancement
HTMLParserrecognizes a CDATA section<![CDATA[...]]>in any context. According to the HTML5 specification, it should only be recognized in foreign content -- the content ofsvgandmathelements. Otherwise<![CDATA[starts a bogus comment which ends at the first>, not at]]>. Using the wrong ending condition can make the parser see a different structure of the document than browsers, which can have security consequences. This is the last unresolved item of gh-135661. The fix in #135665 was not satisfying, it just passed the ball to the user's side: the user is supposed to maintain the tracking mechanism outside ofHTMLParserand call the new private method_set_support_cdata().I propose to automatically detect foreign content in
HTMLParseritself, by following start and end tags, approximating the tree construction dispatcher and the rules for parsing tokens in foreign content.This also fixes RAWTEXT and RCDATA elements in foreign content:
<svg><title>a<b>c</b></title>contains abelement, butHTMLParsercurrently parses thetitlecontent as text.The new constructor parameter support_cdata controls this:
None(default) -- automatic detection;True-- a CDATA section is recognized in any context, foreign content is not detected (the previous default behavior);False-- a CDATA section is never recognized. Calling_set_support_cdata()disables the automatic detection, so existing code which maintains its own tracking machinery works as before.Has this already been discussed elsewhere?
The last item of gh-135661, discussed also in #135665. Related: gh-137877, gh-140878.
Links to previous discussion of this feature:
Linked PRs