File: //proc/2/cwd/lib64/python3.6/site-packages/lxml/html/__pycache__/clean.cpython-36.pyc
3
��]b�i  �            /   @   s�  d Z ddlZddlZyddlmZ W n  ek
rD   ddlmZ Y nX ddlmZ ddl	m
Z
 ddl	mZmZ ddl	m
Z
mZ ye W n ek
r�   eZY nX ye W n ek
r�   eZY nX ye W n ek
r�   eZY nX ye W n ek
�r
   eefZY nX dd	d
ddd
dgZejdejejB �jZejdej�jZejdej�jZejdej�j Z!ejdej�j Z"ejdej�j Z#dd� Z$ejd�jZ%ejdejejB �Z&ej'd�Z(ej'ddeid�Z)G dd
� d
e*�Z+e+� Z,e,j-Z-ejdej�ejdej�gZ.d d!d"d#d$d%gZ/ejd&ej�ejd'ej�ejd(�gZ0d)gZ1e.e/e0e1fd*d�Z2d+d,� Z3d-d� Z4e2j e4_ d!d d"gZ5d.gZ6d/e5e6ed0�fd1d
�Z7d2d� Z8d3d4� Z9ejd5ej�Z:d6d7� Z;dS )8zcA cleanup tool for HTML.
Removes unwanted tags and content.  See the `Cleaner` class for
details.
�    N)�urlsplit)�etree)�defs)�
fromstring�XHTML_NAMESPACE)�
xhtml_to_html�_transform_result�
clean_html�clean�Cleaner�autolink�
autolink_html�
word_break�word_break_htmlzexpression\s*\(.*?\)z
@\s*importz</?[a-zA-Z]+|\son[a-zA-Z]+\s*=z^data:image/(.+);base64,z:(javascript|jscript|livescript|vbscript|data|about|mocha):z	(xml|svg)c             C   s:   d}x t | �D ]}d}t|�rdS qW |r.dS tt| ��S )NFT)�_find_image_dataurls�_is_unsafe_image_type�bool�_is_possibly_malicious_scheme)�sZis_image_urlZ
image_type� r   �/usr/lib64/python3.6/clean.py�_is_javascript_schemeT   s    r   z[\s\x00-\x08\x0B\x0C\x0E-\x19]+z\[if[\s\n\r]+.*?][\s\n\r]*>zdescendant-or-self::*[@style]z�descendant-or-self::a  [normalize-space(@href) and substring(normalize-space(@href),1,1) != '#'] |descendant-or-self::x:a[normalize-space(@href) and substring(normalize-space(@href),1,1) != '#']�x)Z
namespacesc            	   @   s�   e Zd ZdZdZdZdZdZdZdZ	dZ
dZdZdZ
dZdZdZdZdZdZdZdZejZdZf Zeddg�Zdd� Zed	d
ddgd	d	d	d
d
�Zdd� Zdd� Z dd� Z!dd� Z"dd� Z#d!dd�Z$dd� Z%e&j'de&j(�j)Z*dd� Z+dd � Z,dS )"r   a  
    Instances cleans the document of each of the possible offending
    elements.  The cleaning is controlled by attributes; you can
    override attributes in a subclass, or set them in the constructor.
    ``scripts``:
        Removes any ``<script>`` tags.
    ``javascript``:
        Removes any Javascript, like an ``onclick`` attribute. Also removes stylesheets
        as they could contain Javascript.
    ``comments``:
        Removes any comments.
    ``style``:
        Removes any style tags.
    ``inline_style``
        Removes any style attributes.  Defaults to the value of the ``style`` option.
    ``links``:
        Removes any ``<link>`` tags
    ``meta``:
        Removes any ``<meta>`` tags
    ``page_structure``:
        Structural parts of a page: ``<head>``, ``<html>``, ``<title>``.
    ``processing_instructions``:
        Removes any processing instructions.
    ``embedded``:
        Removes any embedded objects (flash, iframes)
    ``frames``:
        Removes any frame-related tags
    ``forms``:
        Removes any form tags
    ``annoying_tags``:
        Tags that aren't *wrong*, but are annoying.  ``<blink>`` and ``<marquee>``
    ``remove_tags``:
        A list of tags to remove.  Only the tags will be removed,
        their content will get pulled up into the parent tag.
    ``kill_tags``:
        A list of tags to kill.  Killing also removes the tag's content,
        i.e. the whole subtree, not just the tag itself.
    ``allow_tags``:
        A list of tags to include (default include all).
    ``remove_unknown_tags``:
        Remove any tags that aren't standard parts of HTML.
    ``safe_attrs_only``:
        If true, only include 'safe' attributes (specifically the list
        from the feedparser HTML sanitisation web site).
    ``safe_attrs``:
        A set of attribute names to override the default list of attributes
        considered 'safe' (when safe_attrs_only=True).
    ``add_nofollow``:
        If true, then any <a> tags will have ``rel="nofollow"`` added to them.
    ``host_whitelist``:
        A list or set of hosts that you can use for embedded content
        (for content like ``<object>``, ``<link rel="stylesheet">``, etc).
        You can also implement/override the method
        ``allow_embedded_url(el, url)`` or ``allow_element(el)`` to
        implement more complex rules for what can be embedded.
        Anything that passes this test will be shown, regardless of
        the value of (for instance) ``embedded``.
        Note that this parameter might not work as intended if you do not
        make the links absolute before doing the cleaning.
        Note that you may also need to set ``whitelist_tags``.
    ``whitelist_tags``:
        A set of tags that can be included with ``host_whitelist``.
        The default is ``iframe`` and ``embed``; you may wish to
        include other tags like ``script``, or you may want to
        implement ``allow_embedded_url`` for more control.  Set to None to
        include all tags.
    This modifies the document *in place*.
    TFN�iframe�embedc             K   sZ   x:|j � D ].\}}t| |�s,td||f ��t| ||� q
W | jd krVd|krV| j| _d S )NzUnknown parameter: %s=%r�inline_style)�items�hasattr�	TypeError�setattrr   �style)�self�kw�name�valuer   r   r   �__init__�   s    
zCleaner.__init__�src�href�code�object)�script�link�appletr   r   �layer�ac             C   s�  t |d�r|j� }t|� x|jd�D ]
}d|_q&W | jsD| j|� t| jpNf �}t| j	p\f �}t| j
pjf �}| jr~|jd� | j
r�t| j�}x:|jtj�D ]*}|j}x|j� D ]}||kr�||= q�W q�W | j�r*| j
o�| jtjk�s(x@|jtj�D ]0}|j}x$|j� D ]}|jd��r||= �qW q�W |j| jdd� | j�s�x\t|�D ]P}|jd�}	td	|	�}
td	|
�}
| j|
��r�|jd= n|
|	k�rJ|jd|
� �qJW | j�s*x�t|jd��D ]p}|jd
d	�j � j!� dk�r�|j"�  �q�|j#�p�d	}	td	|	�}
td	|
�}
| j|
��rd|_#n|
|	k�r�|
|_#�q�W | j�s:| j$�rF|jtj%� | j$�rZ|jtj&� | j�rl|jd� | j�r�tj'|d� | j(�r�|jd
� nT| j�s�| j�r�xBt|jd
��D ]0}d|jdd	�j � k�r�| j)|��s�|j"�  �q�W | j*�r�|jd� | j+�r|j,d)� | j-�r�x\t|jd��D ]J}d}|j.� }x$|dk	�rX|jd*k�rX|j.� }�q6W |dk�r$|j"�  �q$W |j,d+� |j,d,� | j/�r�|j,tj0� | j1�r�|jd� |j,d-� | j2�r�|j,d.� g }
g }x`|j� D ]T}|j|k�r| j)|��r��q�|j3|� n&|j|k�r�| j)|��r"�q�|
j3|� �q�W |
�rb|
d" |k�rb|
j4d"�}d#|_|jj5�  n8|�r�|d" |k�r�|j4d"�}|jdk�r�d#|_|j5�  |j6�  x|D ]}|j"�  �q�W x|
D ]}|j7�  �q�W | j8�r�|�r�t9d$��ttj:�}|�rlg }x(|j� D ]}|j|k�r|j3|� �qW |�rl|d" |k�rT|j4d"�}d#|_|jj5�  x|D ]}|j7�  �qZW | j;�r�xdt<|�D ]X}| j=|��s~|jd�}|�r�d%|k�r�d&d'| k�r��q~d(| }nd%}|jd|� �q~W dS )/z&
        Cleans the document.
        �getrootZimageZimgr*   ZonF)Zresolve_base_hrefr    � �typeztext/javascriptz
/* deleted */r+   Z
stylesheet�rel�meta�head�html�title�paramNr,   r)   r   r   r-   Zform�button�input�select�textarea�blink�marqueer   ZdivzIIt does not make sense to pass in both allow_tags and remove_unknown_tagsZnofollowz
 nofollow z %s z%s nofollow)r4   r5   r6   )r,   r)   )r,   )r   r   r-   r)   r7   )r8   r9   r:   r;   )r<