Detect foreign content in HTMLParser for context-dependent parsing of CDATA sections

# Feature or enhancement

`HTMLParser` recognizes a CDATA section `<![CDATA[...]]>` in any context. According to [the HTML5 specification](https://html.spec.whatwg.org/multipage/parsing.html#markup-declaration-open-state), it should only be recognized in foreign content -- the content of `svg` and `math` elements. Otherwise `<![CDATA[` starts a bogus comment which ends at the first `>`, not at `]]>`. Using the wrong ending condition can make the parser see a different structure of the document than browsers, which can have security consequences. This is the last unresolved item of gh-135661. The fix in #135665 was not satisfying, it just passed the ball to the user's side: the user is supposed to maintain the tracking mechanism outside of `HTMLParser` and call the new private method `_set_support_cdata()`.

I propose to automatically detect foreign content in `HTMLParser` itself, by following start and end tags, approximating [the tree construction dispatcher](https://html.spec.whatwg.org/multipage/parsing.html#tree-construction-dispatcher) and [the rules for parsing tokens in foreign content](https://html.spec.whatwg.org/multipage/parsing.html#parsing-main-inforeign).

```pycon
>>> parser.feed('<![CDATA[a > b]]>')       # bogus comment: comment '[CDATA[a '
>>> parser.feed('<svg><![CDATA[a > b]]>')  # CDATA section: unknown decl 'CDATA[a > b'
```

This also fixes RAWTEXT and RCDATA elements in foreign content: `<svg><title>a<b>c</b></title>` contains a `b` element, but `HTMLParser` currently parses the `title` content as text.

The new constructor parameter *support_cdata* controls this: `None` (default) -- automatic detection; `True` -- a CDATA section is recognized in any context, foreign content is not detected (the previous default behavior); `False` -- a CDATA section is never recognized. Calling `_set_support_cdata()` disables the automatic detection, so existing code which maintains its own tracking machinery works as before.

### Has this already been discussed elsewhere?

The last item of gh-135661, discussed also in #135665. Related: gh-137877, gh-140878.

### Links to previous discussion of this feature:

* https://github.com/python/cpython/issues/135661
* https://github.com/python/cpython/pull/135665



### Linked PRs
* gh-153028

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Uh oh!

Detect foreign content in HTMLParser for context-dependent parsing of CDATA sections #153027

Feature or enhancement

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Uh oh!

Uh oh!

Detect foreign content in HTMLParser for context-dependent parsing of CDATA sections #153027

Description

Feature or enhancement

Has this already been discussed elsewhere?

Links to previous discussion of this feature:

Linked PRs

Metadata

Metadata

Assignees

Labels

Fields

Projects

Milestone

Relationships

Development

Issue actions