File: //home/arjun/projects/env/lib/python3.10/site-packages/lxml/html/__pycache__/clean.cpython-310.pyc
o
weYn � @ s� d Z ddlmZ ddlZddlZddlZzddlmZ ddlm Z W n e
y3 ddlmZm Z Y nw ddlm
Z
ddlmZ dd lmZmZ dd
lmZmZ ze W n ey_ eZY nw ze W n eyo eZY nw ze W n
ey� eefZY nw g d�Ze�dejejB �jZ e�d
ej�jZ!ejdgej"d dkr�ej#fnd�R � j$Z%e�dej�j&Z'e�dej�j&Z(e�dej�j$Z)dd� Z*e�d�jZ+e�dejejB �Z,e
�-d�Z.e
j-ddeid�Z/G dd� de0�Z1e1� Z2e2j3Z3e�dej�e�dej�gZ4g d �Z5e�d!ej�e�d"ej�e�d#�gZ6d$gZ7e4e5e6e7fd%d&�Z8d'd(� Z9d)d*� Z:e8j e:_ g d+�Z;d,gZ<d-e;e<ed.�fd/d0�Z=d1d2� Z>d3d4� Z?e�d5ej�Z@d6d7� ZAdS )8zcA cleanup tool for HTML.
Removes unwanted tags and content. See the `Cleaner` class for
details.
� )�absolute_importN)�urlsplit)�unquote_plus)r r )�etree)�defs)�
fromstring�XHTML_NAMESPACE)�
xhtml_to_html�_transform_result)�
clean_html�clean�Cleaner�autolink�
autolink_html�
word_break�word_break_htmlzexpression\s*\(.*?\)z
@\s*importz</?[a-zA-Z]+|\son[a-zA-Z]+\s*=� � z:(javascript|jscript|livescript|vbscript|data|about|mocha):z (xml|svg)c C s8 d}t | �D ]
}t|�r dS |d7 }qtt| ��|kS )Nr T� )�_find_image_dataurls�_is_unsafe_image_type�len�_possibly_malicious_schemes)�s�safe_image_urls�
image_typer r �H/home/arjun/projects/env/lib/python3.10/site-packages/lxml/html/clean.py�_has_javascript_schemeV s
r z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)�
namespacesc @ s� e Zd ZdZdZdZdZdZdZdZ dZ
dZdZdZ
dZdZdZdZdZdZdZdZejZdZdZddhZdd � Zed
ddd
gd
d
d
dd�Zdd� Zdd� Zdd� Z dd� Z!dd� Z"d"dd�Z#dd� Z$e%�&de%j'�j(Z)dd� Z*d d!� Z+dS )#r
a
Instances cleans the document of each of the possible offending
elements. The cleaning is controlled by attributes; you can
override attributes in a subclass, or set them in the constructor.
``scripts``:
Removes any ``<script>`` tags.
``javascript``:
Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets
as they could contain Javascript.
``comments``:
Removes any comments.
``style``:
Removes any style tags.
``inline_style``
Removes any style attributes. Defaults to the value of the ``style`` option.
``links``:
Removes any ``<link>`` tags
``meta``:
Removes any ``<meta>`` tags
``page_structure``:
Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
``processing_instructions``:
Removes any processing instructions.
``embedded``:
Removes any embedded objects (flash, iframes)
``frames``:
Removes any frame-related tags
``forms``:
Removes any form tags
``annoying_tags``:
Tags that aren't *wrong*, but are annoying. ``<blink>`` and ``<marquee>``
``remove_tags``:
A list of tags to remove. Only the tags will be removed,
their content will get pulled up into the parent tag.
``kill_tags``:
A list of tags to kill. Killing also removes the tag's content,
i.e. the whole subtree, not just the tag itself.
``allow_tags``:
A list of tags to include (default include all).
``remove_unknown_tags``:
Remove any tags that aren't standard parts of HTML.
``safe_attrs_only``:
If true, only include 'safe' attributes (specifically the list
from the feedparser HTML sanitisation web site).
``safe_attrs``:
A set of attribute names to override the default list of attributes
considered 'safe' (when safe_attrs_only=True).
``add_nofollow``:
If true, then any <a> tags will have ``rel="nofollow"`` added to them.
``host_whitelist``:
A list or set of hosts that you can use for embedded content
(for content like ``<object>``, ``<link rel="stylesheet">``, etc).
You can also implement/override the method
``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
implement more complex rules for what can be embedded.
Anything that passes this test will be shown, regardless of
the value of (for instance) ``embedded``.
Note that this parameter might not work as intended if you do not
make the links absolute before doing the cleaning.
Note that you may also need to set ``whitelist_tags``.
``whitelist_tags``:
A set of tags that can be included with ``host_whitelist``.
The default is ``iframe`` and ``embed``; you may wish to
include other tags like ``script``, or you may want to
implement ``allow_embedded_url`` for more control. Set to None to
include all tags.
This modifies the document *in place*.
TFNr �iframe�embedc K s� t � }|�� D ]-\}}t| ||�}|d ur.|dur.|dur.t|ttttf�s.td||f ��t | ||� q| j
d u rBd|vrB| j| _
|�d�rU|�d�rPt
d��d| _d S d S )NTFzUnknown parameter: %s=%r�inline_style�
allow_tags�remove_unknown_tags�IIt does not make sense to pass in both allow_tags and remove_unknown_tags)�object�items�getattr�
isinstance� frozenset�set�tuple�list� TypeError�setattrr"