Skip to content

Detect foreign content in HTMLParser for context-dependent parsing of CDATA sections #153027

Description

@serhiy-storchaka

Feature or enhancement

HTMLParser recognizes a CDATA section <![CDATA[...]]> in any context. According to the HTML5 specification, it should only be recognized in foreign content -- the content of svg and math elements. Otherwise <![CDATA[ starts a bogus comment which ends at the first >, not at ]]>. Using the wrong ending condition can make the parser see a different structure of the document than browsers, which can have security consequences. This is the last unresolved item of gh-135661. The fix in #135665 was not satisfying, it just passed the ball to the user's side: the user is supposed to maintain the tracking mechanism outside of HTMLParser and call the new private method _set_support_cdata().

I propose to automatically detect foreign content in HTMLParser itself, by following start and end tags, approximating the tree construction dispatcher and the rules for parsing tokens in foreign content.

>>> parser.feed('<![CDATA[a > b]]>')       # bogus comment: comment '[CDATA[a '
>>> parser.feed('<svg><![CDATA[a > b]]>')  # CDATA section: unknown decl 'CDATA[a > b'

This also fixes RAWTEXT and RCDATA elements in foreign content: <svg><title>a<b>c</b></title> contains a b element, but HTMLParser currently parses the title content as text.

The new constructor parameter support_cdata controls this: None (default) -- automatic detection; True -- a CDATA section is recognized in any context, foreign content is not detected (the previous default behavior); False -- a CDATA section is never recognized. Calling _set_support_cdata() disables the automatic detection, so existing code which maintains its own tracking machinery works as before.

Has this already been discussed elsewhere?

The last item of gh-135661, discussed also in #135665. Related: gh-137877, gh-140878.

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

No one assigned

    Labels

    stdlibStandard Library Python modules in the Lib/ directorytype-featureA feature request or enhancement

    Fields

    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions