A modern CSS selector implementation for BeautifulSoup

Overview

Donate via PayPal Discord Build Coverage Status PyPI Version PyPI - Python Version License

Soup Sieve

Overview

Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filtering using modern CSS selectors. Soup Sieve currently provides selectors from the CSS level 1 specifications up through the latest CSS level 4 drafts and beyond (though some are not yet implemented).

Soup Sieve was written with the intent to replace Beautiful Soup's builtin select feature, and as of Beautiful Soup version 4.7.0, it now is 🎊 . Soup Sieve can also be imported in order to use its API directly for more controlled, specialized parsing.

Soup Sieve has implemented most of the CSS selectors up through the latest CSS draft specifications, though there are a number that don't make sense in a non-browser environment. Selectors that cannot provide meaningful functionality simply do not match anything. Some of the supported selectors are:

  • .classes
  • #ids
  • [attributes=value]
  • parent child
  • parent > child
  • sibling ~ sibling
  • sibling + sibling
  • :not(element.class, element2.class)
  • :is(element.class, element2.class)
  • parent:has(> child)
  • and many more

Installation

You must have Beautiful Soup already installed:

pip install beautifulsoup4

In most cases, assuming you've installed version 4.7.0, that should be all you need to do, but if you've installed via some alternative method, and Soup Sieve is not automatically installed for your, you can install it directly:

pip install soupsieve

If you want to manually install it from source, navigate to the root of the project and run

python setup.py build
python setup.py install

Documentation

Documentation is found here: https://facelessuser.github.io/soupsieve/.

License

MIT License

Copyright (c) 2018 - 2021 Isaac Muse [email protected]

Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:

The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.

THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.

Comments
  • Suggestion: adding soupsieve.strain

    Suggestion: adding soupsieve.strain

    Hello @facelessuser,

    I need to create BeautifulSoup strainers to optimize scrapers for my tool here. But as a convenience for my users, I am going to build them from simple css selectors such as li.item[align=left].

    I can do so by using your CSSParser class that processes and return Selector instances and assess whether the selector is simple enough to befit a strainer. If so, I can build a function that will "apply" this selector to tell the strainer whether it should parse the current node etc.

    I will implement this for me in my tool but I was wondering if you'd like me to contribute to this lib instead by adding something like soupsieve.strain basically. It would return an arg (typically a function) you can give to bs4.SoupStrainer and should raise a custom error if the selector is found to be too complex for the task. If this is of any interest I can open a PR for this.

    Have a good day and thanks for your work,

    T: feature P: maybe 
    opened by Yomguithereal 29
  • CDATA handling in HTML changed in lxml parser with libxml2 2.9.12

    CDATA handling in HTML changed in lxml parser with libxml2 2.9.12

    After upgrading the system libxml2 to 2.9.12 (or 2.9.11; 2.9.10 is the previous working version I have here), the two following tests fail with lxml built against the system library:

    FAILED tests/test_extra/test_soup_contains.py::TestSoupContains::test_contains_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
    FAILED tests/test_extra/test_soup_contains_own.py::TestSoupContainsOwn::test_contains_own_cdata_html - AssertionError: Lists differ: ['1', '2']...
    

    The cause seems to be a different representation of CDATA:

            soup       = <html><body><div id="1">Testing that <span id="2">&lt;![CDATA[that]]&gt;</span>contains works.</div></body>
    </html>
    

    (i.e. &lt![CDATA[... instead of <!--[CDATA[...)

    Note that in order to reproduce you need to both upgrade libxml2 and build lxml against the new version. Binary wheels are statically linked to an old version of libxml2, so they do not reproduce the issue yet. For example, I have been able to reproduce it with tox after swapping the installed lxml version:

    . .tox/py39/bin/activate
    pip uninstall lxml
    pip install lxml --no-binary lxml
    

    I am also not sure whether this isn't a bug in libxml2 or lxml.

    S: more-info-needed S: triage 
    opened by mgorny 21
  • 2.2.1: pytest based test suite is failing

    2.2.1: pytest based test suite is failing

    IMO it would be good to fix pytest support as pytest has a bit shorter list of dependencies than tox.

    + PYTHONPATH=/home/tkloczko/rpmbuild/BUILDROOT/python-soupsieve-2.2.1-2.fc35.x86_64/usr/lib64/python3.8/site-packages:/home/tkloczko/rpmbuild/BUILDROOT/python-soupsieve-2.2.1-2.fc35.x86_64/usr/lib/python3.8/site-packages
    + /usr/bin/python3 -Bm pytest -ra
    =========================================================================== test session starts ============================================================================
    platform linux -- Python 3.8.8, pytest-6.2.2, py-1.10.0, pluggy-0.13.1
    rootdir: /home/tkloczko/rpmbuild/BUILD/soupsieve-2.2.1, configfile: tox.ini
    plugins: flaky-3.6.1, forked-1.3.0, shutil-1.7.0, virtualenv-1.7.0, asyncio-0.14.0, expect-1.1.0, cov-2.11.1, mock-3.5.1, httpbin-1.0.0, xdist-2.2.1, flake8-1.0.7, timeout-1.4.2, betamax-0.8.1, hypothesis-6.8.1, pyfakefs-4.4.0, freezegun-0.4.2
    collected 360 items
    
    tests/test_api.py ..........................................                                                                                                         [ 11%]
    tests/test_bs4_cases.py .....                                                                                                                                        [ 13%]
    tests/test_versions.py ....                                                                                                                                          [ 14%]
    tests/test_extra/test_attribute.py ...                                                                                                                               [ 15%]
    tests/test_extra/test_custom.py ..........                                                                                                                           [ 17%]
    tests/test_extra/test_soup_contains.py ..F...............                                                                                                            [ 22%]
    tests/test_extra/test_soup_contains_own.py .F...                                                                                                                     [ 24%]
    tests/test_level1/test_active.py .                                                                                                                                   [ 24%]
    tests/test_level1/test_at_rule.py .                                                                                                                                  [ 24%]
    tests/test_level1/test_class.py ........                                                                                                                             [ 26%]
    tests/test_level1/test_comments.py ..                                                                                                                                [ 27%]
    tests/test_level1/test_descendant.py .                                                                                                                               [ 27%]
    tests/test_level1/test_escapes.py .                                                                                                                                  [ 28%]
    tests/test_level1/test_id.py ...                                                                                                                                     [ 28%]
    tests/test_level1/test_link.py ..                                                                                                                                    [ 29%]
    tests/test_level1/test_list.py ....                                                                                                                                  [ 30%]
    tests/test_level1/test_pseudo_class.py ..                                                                                                                            [ 31%]
    tests/test_level1/test_pseudo_element.py .                                                                                                                           [ 31%]
    tests/test_level1/test_type.py .....                                                                                                                                 [ 32%]
    tests/test_level1/test_visited.py .                                                                                                                                  [ 33%]
    tests/test_level2/test_attribute.py ..............................                                                                                                   [ 41%]
    tests/test_level2/test_child.py .....                                                                                                                                [ 42%]
    tests/test_level2/test_first_child.py .                                                                                                                              [ 43%]
    tests/test_level2/test_focus.py ..                                                                                                                                   [ 43%]
    tests/test_level2/test_hover.py .                                                                                                                                    [ 43%]
    tests/test_level2/test_lang.py ..                                                                                                                                    [ 44%]
    tests/test_level2/test_next_sibling.py ...                                                                                                                           [ 45%]
    tests/test_level2/test_universal_type.py .                                                                                                                           [ 45%]
    tests/test_level3/test_attribute.py ...                                                                                                                              [ 46%]
    tests/test_level3/test_checked.py .                                                                                                                                  [ 46%]
    tests/test_level3/test_disabled.py .......                                                                                                                           [ 48%]
    tests/test_level3/test_empty.py .                                                                                                                                    [ 48%]
    tests/test_level3/test_enabled.py ......                                                                                                                             [ 50%]
    tests/test_level3/test_first_of_type.py ...                                                                                                                          [ 51%]
    tests/test_level3/test_last_child.py ..                                                                                                                              [ 51%]
    tests/test_level3/test_last_of_type.py ...                                                                                                                           [ 52%]
    tests/test_level3/test_namespace.py ..............                                                                                                                   [ 56%]
    tests/test_level3/test_not.py ....                                                                                                                                   [ 57%]
    tests/test_level3/test_nth_child.py ......                                                                                                                           [ 59%]
    tests/test_level3/test_nth_last_child.py ..                                                                                                                          [ 60%]
    tests/test_level3/test_nth_last_of_type.py ..                                                                                                                        [ 60%]
    tests/test_level3/test_nth_of_type.py ..                                                                                                                             [ 61%]
    tests/test_level3/test_only_child.py .                                                                                                                               [ 61%]
    tests/test_level3/test_only_of_type.py .                                                                                                                             [ 61%]
    tests/test_level3/test_root.py ...........                                                                                                                           [ 64%]
    tests/test_level3/test_subsequent_sibling.py .                                                                                                                       [ 65%]
    tests/test_level3/test_target.py ..                                                                                                                                  [ 65%]
    tests/test_level4/test_any_link.py ....                                                                                                                              [ 66%]
    tests/test_level4/test_attribute.py .....                                                                                                                            [ 68%]
    tests/test_level4/test_current.py ....                                                                                                                               [ 69%]
    tests/test_level4/test_default.py .....                                                                                                                              [ 70%]
    tests/test_level4/test_defined.py ..                                                                                                                                 [ 71%]
    tests/test_level4/test_dir.py ...........                                                                                                                            [ 74%]
    tests/test_level4/test_focus_visible.py ..                                                                                                                           [ 74%]
    tests/test_level4/test_focus_within.py ..                                                                                                                            [ 75%]
    tests/test_level4/test_future.py ..                                                                                                                                  [ 75%]
    tests/test_level4/test_has.py ..............                                                                                                                         [ 79%]
    tests/test_level4/test_host.py ..                                                                                                                                    [ 80%]
    tests/test_level4/test_host_context.py .                                                                                                                             [ 80%]
    tests/test_level4/test_in_range.py .......                                                                                                                           [ 82%]
    tests/test_level4/test_indeterminate.py ..                                                                                                                           [ 83%]
    tests/test_level4/test_is.py ........                                                                                                                                [ 85%]
    tests/test_level4/test_lang.py ..................                                                                                                                    [ 90%]
    tests/test_level4/test_local_link.py ..                                                                                                                              [ 90%]
    tests/test_level4/test_matches.py ..                                                                                                                                 [ 91%]
    tests/test_level4/test_not.py .                                                                                                                                      [ 91%]
    tests/test_level4/test_nth_child.py ..                                                                                                                               [ 92%]
    tests/test_level4/test_optional.py ..                                                                                                                                [ 92%]
    tests/test_level4/test_out_of_range.py .......                                                                                                                       [ 94%]
    tests/test_level4/test_past.py ..                                                                                                                                    [ 95%]
    tests/test_level4/test_paused.py ..                                                                                                                                  [ 95%]
    tests/test_level4/test_placeholder_shown.py .                                                                                                                        [ 96%]
    tests/test_level4/test_playing.py ..                                                                                                                                 [ 96%]
    tests/test_level4/test_read_only.py .                                                                                                                                [ 96%]
    tests/test_level4/test_read_write.py .                                                                                                                               [ 97%]
    tests/test_level4/test_required.py ..                                                                                                                                [ 97%]
    tests/test_level4/test_scope.py ...                                                                                                                                  [ 98%]
    tests/test_level4/test_target_within.py ..                                                                                                                           [ 99%]
    tests/test_level4/test_user_invalid.py .                                                                                                                             [ 99%]
    tests/test_level4/test_where.py ..                                                                                                                                   [100%]
    
    ================================================================================= FAILURES =================================================================================
    ________________________________________________________________ TestSoupContains.test_contains_cdata_html _________________________________________________________________
    
    self = <tests.test_extra.test_soup_contains.TestSoupContains testMethod=test_contains_cdata_html>
    
        def test_contains_cdata_html(self):
            """Test contains CDATA in HTML5."""
    
            markup = """
            <body><div id="1">Testing that <span id="2"><![CDATA[that]]></span>contains works.</div></body>
            """
    
    >       self.assert_selector(
                markup,
                'body *:-soup-contains("that")',
                ['1'],
                flags=util.HTML
            )
    
    tests/test_extra/test_soup_contains.py:154:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    tests/util.py:122: in assert_selector
        self.assertEqual(sorted(ids), sorted(expected_ids))
    E   AssertionError: Lists differ: ['1', '2'] != ['1']
    E
    E   First list contains 1 additional elements.
    E   First extra element 1:
    E   '2'
    E
    E   - ['1', '2']
    E   + ['1']
    --------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------
    ----Running Selector Test----
    PATTERN:  body *:-soup-contains("that")
    ## PARSING: 'body *:-soup-contains("that")'
    TOKEN: 'tag' --> 'body' at position 0
    TOKEN: 'combine' --> ' ' at position 4
    TOKEN: 'tag' --> '*' at position 5
    TOKEN: 'pseudo_contains' --> ':-soup-contains("that")' at position 6
    ## END PARSING
    
    ====PARSER:  html5lib
    TAG:  div
    
    ====PARSER:  lxml
    TAG:  div
    TAG:  span
    _____________________________________________________________ TestSoupContainsOwn.test_contains_own_cdata_html _____________________________________________________________
    
    self = <tests.test_extra.test_soup_contains_own.TestSoupContainsOwn testMethod=test_contains_own_cdata_html>
    
        def test_contains_own_cdata_html(self):
            """Test contains CDATA in HTML5."""
    
            markup = """
            <body><div id="1">Testing that <span id="2"><![CDATA[that]]></span>contains works.</div></body>
            """
    
    >       self.assert_selector(
                markup,
                'body *:-soup-contains-own("that")',
                ['1'],
                flags=util.HTML
            )
    
    tests/test_extra/test_soup_contains_own.py:45:
    _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _
    tests/util.py:122: in assert_selector
        self.assertEqual(sorted(ids), sorted(expected_ids))
    E   AssertionError: Lists differ: ['1', '2'] != ['1']
    E
    E   First list contains 1 additional elements.
    E   First extra element 1:
    E   '2'
    E
    E   - ['1', '2']
    E   + ['1']
    --------------------------------------------------------------------------- Captured stdout call ---------------------------------------------------------------------------
    ----Running Selector Test----
    PATTERN:  body *:-soup-contains-own("that")
    ## PARSING: 'body *:-soup-contains-own("that")'
    TOKEN: 'tag' --> 'body' at position 0
    TOKEN: 'combine' --> ' ' at position 4
    TOKEN: 'tag' --> '*' at position 5
    TOKEN: 'pseudo_contains' --> ':-soup-contains-own("that")' at position 6
    ## END PARSING
    
    ====PARSER:  html5lib
    TAG:  div
    
    ====PARSER:  lxml
    TAG:  div
    TAG:  span
    ========================================================================= short test summary info ==========================================================================
    FAILED tests/test_extra/test_soup_contains.py::TestSoupContains::test_contains_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
    FAILED tests/test_extra/test_soup_contains_own.py::TestSoupContainsOwn::test_contains_own_cdata_html - AssertionError: Lists differ: ['1', '2'] != ['1']
    ====================================================================== 2 failed, 358 passed in 2.25s =======================================================================
    
    S: duplicate 
    opened by kloczek 21
  • perf: don't import bs4 in every `is_...` function

    perf: don't import bs4 in every `is_...` function

    In my app, this makes is_tag (the third hottest function according to profiling) about 24% faster:

    func | ncalls | time | owntime ---- | ----- | ---- | -------- master/is_tag | 2054073 | 1566 | 1179 perf-istag/is_tag | 2054073 | 1191 | 775

    I assume unpacking all of the classes from bs4 might make things a little faster still, but it's probably not worth the mess?

    S: approved C: source C: css-matching 
    opened by akx 19
  • Improve CSS syntax error reporting

    Improve CSS syntax error reporting

    This produces tracebacks like the following:

    Traceback (most recent call last):
      ...
      File "/home/mg/src/zopefoundation/zc.catalog/.tox/py37/lib/python3.7/site-packages/zope/testbrowser/browser.py", line 1370, in getControlLabels
        forlbls = html.select('label[for=%s]' % controlid)
      File "/home/mg/src/zopefoundation/zc.catalog/.tox/py37/lib/python3.7/site-packages/bs4/element.py", line 1376, in select
        return soupsieve.select(selector, self, namespaces, limit, **kwargs)
      File "/home/mg/src/soupsieve/soupsieve/__init__.py", line 108, in select
        return compile(select, namespaces, flags).select(tag, limit)
      File "/home/mg/src/soupsieve/soupsieve/__init__.py", line 59, in compile
        return cp._cached_css_compile(pattern, namespaces, flags)
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 192, in _cached_css_compile
        CSSParser(pattern, flags).process_selectors(),
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 930, in process_selectors
        return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 772, in parse_selectors
        key, m = next(iselector)
      File "/home/mg/src/soupsieve/soupsieve/css_parser.py", line 917, in selector_iter
        raise SelectorSyntaxError(msg, self.pattern)
      File "<string>", line 1
        label[for=BrowserAdd__zope.catalog.catalog.Catalog]
             ^
    soupsieve.css_parser.SelectorSyntaxError: Malformed attribute selector at position 5
    

    whereas before the traceback ended in

      File "/home/mg/src/zopefoundation/zc.catalog/.tox/py37/lib/python3.7/site-packages/soupsieve/css_parser.py", line 881, in selector_iter
        raise SyntaxError(msg)
      File "<string>", line None
    SyntaxError: Malformed attribute selector at position 5
    

    making it difficult to see what exactly was malformed about the selector.

    I've also chosen to introduce an exception subclass (SelectorSyntaxError), so that CSS parse errors could be distinguished from genuine Python syntax errors.

    S: rejected T: maintenance 
    opened by mgedmin 18
  • An easy way to set priority?

    An easy way to set priority?

    https://developer.mozilla.org/en-US/docs/Web/CSS/Specificity "Using !important, however, is bad practice and should be avoided because it makes debugging more difficult by breaking the natural cascading in your stylesheets." But sometimes when I use complex words to select, it is hard for me to review. Can we use something like parentheses (eg:3*(3+4)=21)?

    S: more-info-needed 
    opened by yjqiang 18
  • the :not selector don't work as expected.

    the :not selector don't work as expected.

    the minimal code which can reproduce the bug lists below

    import bs4
    b = bs4.BeautifulSoup("<a href=\"http://www.example.com\"></a>") 
    b.body.a['foo'] = None  # str(b) ->  <html><body><a foo href="http://www.example.com"></a></body></html>
    b.select("a:not([foo])")  # -> [<a foo href="http://www.example.com"></a>]
    

    in this case, the tag a shouldn't be selected.

    T: bug 
    opened by jimages 16
  • Selectors '> tag', '+ tag', and '~ tag'

    Selectors '> tag', '+ tag', and '~ tag'

    '>+~' symbols at the beginning of the selectors. These selectors worked in Beautiful Soup 4.6.x. But in 4.7.x there is no support for such selectors.

    For example, the code below causes an soupsieve.util.SelectorSyntaxError exception.

    from bs4 import BeautifulSoup
    BeautifulSoup('<a>test<b>test2</b></a>').a.select('> b')
    

    Result:

    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "D:\Programs\Programming\Python-3\lib\site-packages\bs4\element.py", line 1376, in select
        return soupsieve.select(selector, self, namespaces, limit, **kwargs)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\__init__.py", line 112, in select
        return compile(select, namespaces, flags, **kwargs).select(tag, limit)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\__init__.py", line 63, in compile
        return cp._cached_css_compile(pattern, namespaces, custom, flags)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 205, in _cached_css_compile
        CSSParser(pattern, custom=custom_selectors, flags=flags).process_selectors(),
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 1010, in process_selectors
        return self.parse_selectors(self.selector_iter(self.pattern), index, flags)
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 888, in parse_selectors
        sel, m, has_selector, selectors, relations, is_pseudo, index
      File "D:\Programs\Programming\Python-3\lib\site-packages\soupsieve\css_parser.py", line 713, in parse_combinator
        index
    soupsieve.util.SelectorSyntaxError: The combinator '>' at postion 0, must have a selector before it
      line 1:
    > b
    ^
    
    S: wontfix 
    opened by unreal666 14
  • XML default namespace leads to TypeError: __init__() keywords must be strings

    XML default namespace leads to TypeError: __init__() keywords must be strings

    This is a bug with handling valid XML namespaces; soupsieve assumes all namespaces have a prefix:

    <prefix:tag xmlns:prefix="...">
    

    but the prefix can be omitted to define a default namespace:

    <tag xmlns="...">
    

    meaning that any element without a prefix: prepended to the tag name is in that namespace. See section 6.2 of the XML namespaces 1.1 spec.

    During parsing, lxml passes in a default namespace under the None key, e.g. {None: "..."}, and unique keys are accumulated in the soup._namespaces dictionary. soupsieve assumes the dictionary only ever has string keys, so an XML document with a default namespace leads to an exception.

    Test case (using BeautifulSoup 4.7 for convenience):

    >>> from bs4 import BeautifulSoup, __version__
    >>> __version__
    '4.7.0'
    >>> sample = b'''\
    ... <?xml version="1.1"?>
    ... <!-- unprefixed element types are from "books" -->
    ... <book xmlns='urn:loc.gov:books'
    ...       xmlns:isbn='urn:ISBN:0-395-36341-6'>
    ...     <title>Cheaper by the Dozen</title>
    ...     <isbn:number>1568491379</isbn:number>
    ... </book>
    ... '''
    >>> soup = BeautifulSoup(sample, 'xml')
    >>> soup._namespaces
    {'xml': 'http://www.w3.org/XML/1998/namespace', None: 'urn:loc.gov:books', 'isbn': 'urn:ISBN:0-395-36341-6'}
    >>> soup.select_one('title')
    Traceback (most recent call last):
      File "<stdin>", line 1, in <module>
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1345, in select_one
        value = self.select(selector, namespaces, 1, **kwargs)
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/bs4/element.py", line 1377, in select
        return soupsieve.select(selector, self, namespaces, limit, **kwargs)
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 108, in select
        return compile(select, namespaces, flags).select(tag, limit)
      File "/Users/mj/Development/venvs/stackoverflow-latest/lib/python3.7/site-packages/soupsieve/__init__.py", line 50, in compile
        namespaces = ct.Namespaces(**(namespaces))
    TypeError: __init__() keywords must be strings
    

    where <title>Cheaper by the Dozen</title> was expected.

    T: feature C: API S: rejected 
    opened by mjpieters 13
  • Did I make a mistake?

    Did I make a mistake?

    import requests
    from bs4 import BeautifulSoup
    
    
    url = 'https://www.ptwxz.com/html/0/296/39948.html'
    
    cookie = ""
    
    user_agent = ('Mozilla/5.0 (iPhone; CPU iPhone OS 11_2_6 like'
                  'Mac OS X) AppleWebKit/604.1.34 (KHTML, like Gecko)'
                  'CriOS/65.0.3325.152 Mobile/15D100 Safari/604.1')
    
    headers = {'User-Agent': user_agent, 'cookie': cookie}
    
    '''
    socks5 = 'socks5://127.0.0.1:10086'  # because this website is blocked by the government of China, I have to use proxy
    rsp = requests.get(
        url,
        headers=headers,
        proxies={'http': socks5, 'https': socks5})
    '''
    rsp = requests.get(url, headers=headers)
    rsp.encoding = 'gbk'
    text = rsp.text
    
    soups = BeautifulSoup(text, 'html.parser')
    print(soups.prettify())
    print('_____________________')
    
    tag_center = soups.select_one('table[align="center"]')
    list_tag_center_next_siblings = list(tag_center.next_siblings)
    for i in list_tag_center_next_siblings[-4:]:
        print(type(i), i)
    
    tag_head = soups.select_one('head')
    list_tag_head_children = list(tag_head.children)
    for i in list_tag_head_children[-4:]:
        print(type(i), i, '|')
    
    for x, y in zip(reversed(list_tag_center_next_siblings), reversed(list_tag_head_children)):
        assert x is y
    
    
    tags_after_center = soups.select('table[align="center"] ~ *')
    print(tags_after_center)
    
    T: support 
    opened by yjqiang 12
  • Help: is soupsieve case-insensitive?

    Help: is soupsieve case-insensitive?

    In [122]: xml = """<Envelope><Header>...</Header></Envelope>"""
    
    In [123]: s = BeautifulSoup(xml, "xml")
    
    In [124]: s.select("header")
    Out[124]: [<Header>...</Header>]
    
    In [125]: s.select("Header")
    Out[125]: []
    

    Before, BeautifulSoup accepted (and I think required) case-sensitive tag name in selector.

    Now that BeautifulSoup uses soupsieve, it seems that only lower-case selectors are supported.

    I'm really not sure why or if I can change this behaviour.

    T: bug S: confirmed 
    opened by dimaqq 11
  • `:has()` is no longer forgiving

    `:has()` is no longer forgiving

    CSS has resolved that :has() should no longer be forgiving in order mitigate some JQuery issues. We have never really implemented true forgiveness, just forgiveness as far as trailing and leading commas and empty entries. We will need to drop such support for :has(). We can deprecate the behavior or just remove it. I have no idea if anyone relies on such behavior.

    C: css-parsing skip-triage T: enhancement 
    opened by facelessuser 0
  • LXML does not currently generate wheels for Python 3.11 on Windows

    LXML does not currently generate wheels for Python 3.11 on Windows

    Due to this, SoupSieve currently ignores any testing on Python 3.11 that requires LXML. In time, once LXML properly generates wheels for Windows, we will once again enable testing of LXML for Windows on Python 3.11.

    Related issue LXML issue: https://bugs.launchpad.net/lxml/+bug/1977998

    T: maintenance skip-triage 
    opened by facelessuser 0
  • Interesting psuedo class to keep an eye on `:in()`

    Interesting psuedo class to keep an eye on `:in()`

    https://drafts.csswg.org/css-cascade-6/#in-scope-selector

    It would be way too early to expect that this gets implemented officially or that the spec wouldn't change right under us, but something to keep an eye on. It may be fun to play with to see how the code would actually look and how useful it is.

    If I'm feeling adventurous, maybe implement it under something like :--soup-in() for experimental purposes.

    T: feature C: css-custom P: maybe skip-triage 
    opened by facelessuser 8
  • Consider possibly deprecating [attr!=value]

    Consider possibly deprecating [attr!=value]

    There is no rush to do such, but moving forward, I think we will shy away from delving into syntax that deviates from the CSS specification. We've started moving custom pseudo-classes over to have prefixes to avoid future conflicts, and it is possible that one day [attr!=value] could have some meaning in the CSS spec in the future, and it could be different that what we currently do.

    IIRC this syntax was borrowed from JQuery, but TBH, it really doesn't add functionality as you can do the same with :not([attr=value]).

    T: feature skip-triage 
    opened by facelessuser 0
  • Experimental: Language tag canonicalization

    Experimental: Language tag canonicalization

    There is talk about potentially having the CSS level 4 :lang() pseudo-class canonicalizing tags and ranges to better help in situations such as: :lang(yue, zh-yue, zh-HK). The idea is you could then just do something like: :lang(yue). For best matches, it is recommended to canonicalize both the range used in the pseudo-class and the tag it is comparing. Canonicalization would also output in the extlang form.

    Generally * are ignored in ranges except when at the start: *-yue. Things like en-*-US resolve to en-US, though implicit matching between tags will still match en-xxx-US with en-US.

    Currently, in this pull, we have canonicalization implemented according to RFC5646, but there are still some questions:

    1. Should we abandon canonicalization, like we are currently doing, when the tag is invalid? Or do we just canonicalize the valid parts and ignore the failing parts?

    2. As mentioned above, ranges can use *, so we strip out non-essential *s and them canonicalize the range. This seems like the only sane approach, but am I misunderstanding something?

    3. It is only suggested that we MAY order variants to improve matching. We decided to go ahead and do this. Should we though? We have also omitted any failures if the required prefixes for a given variant are not found in the tag. This is to help ensure that both the the tag variant order is the same as the range's variant order, as specified range may not explicitly define all required ranges and rely on implicit matching to grab those. This seems reasonable, but should we abort canonicalization if the prefixes are not found? It is not a MUST requirement in the spec, only a SHOULD.

    Anyways, some things to think about. Technically we could merge this as is and simply disable the canonicalization and it should behave exactly how it did before. We could also enable this functionality under an experimental flag if we wanted. Right now, we are simply waiting to see what is decided for the official level 4 CSS spec.

    C: docs S: work-in-progress C: infrastructure C: tests C: css-matching 
    opened by facelessuser 1
Releases(2.3.2.post1)
  • 2.3.2.post1(Apr 14, 2022)

  • 2.3.2(Apr 6, 2022)

  • 2.3.1(Nov 11, 2021)

  • 2.3(Nov 3, 2021)

    2.3

    • NEW: Officially support Python 3.10.
    • NEW: Add static typing.
    • NEW: :has(), :is(), and :where() now use use a forgiving selector list. While not as forgiving as CSS might be, it will forgive such things as empty sets and empty slots due to multiple consecutive commas, leading commas, or trailing commas. Essentially, these pseudo-classes will match all non-empty selectors and ignore empty ones. As the scraping environment is different than a browser environment, it was chosen not to aggressively forgive bad syntax and invalid features to ensure the user is alerted that their program may not perform as expected.
    • NEW: Add support to output a pretty print format of a compiled SelectorList for debug purposes.
    • FIX: Some small corner cases discovered with static typing.
    Source code(tar.gz)
    Source code(zip)
  • 2.2.1(Mar 19, 2021)

  • 2.2(Feb 9, 2021)

    2.2

    • NEW: :link and :any-link no longer include <link> due to a change in the level 4 selector specification. This actually yields more sane results.
    • FIX: BeautifulSoup, when using find, is quite forgiving of odd types that a user may place in an element's attribute value. Soup Sieve will also now be more forgiving and attempt to match these unexpected values in a sane manner by normalizing them before compare. (#212)
    Source code(tar.gz)
    Source code(zip)
  • 2.1.0(Dec 10, 2020)

    2.1.0

    • NEW: Officially support Python 3.9.
    • NEW: Drop official support for Python 3.5.
    • NEW: In order to avoid conflicts with future CSS specification changes, non-standard pseudo classes will now start with the :-soup- prefix. As a consequence, :contains() will now be known as :-soup-contains(), though for a time the deprecated form of :contains() will still be allowed with a warning that users should migrate over to :-soup-contains().
    • NEW: Added new non-standard pseudo class :-soup-contains-own() which operates similar to :-soup-contains() except that it only looks at text nodes directly associated with the currently scoped element and not its descendants.
    • FIX: Import bs4 globally instead of in local functions as it appears there are no adverse affects due to circular imports as bs4 does not immediately reference soupsieve functions and soupsieve does not immediately reference bs4 functions. This should give a performance boost to functions that had previously included bs4 locally.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.1(May 16, 2020)

  • 1.9.6(May 16, 2020)

    1.9.6

    Note: Last version for Python 2.7

    • FIX: Prune dead code.
    • FIX: Corner case with splitting namespace and tag name that that have an escaped |.
    Source code(tar.gz)
    Source code(zip)
  • 2.0.0(Feb 23, 2020)

    2.0.0

    • NEW: SelectorSyntaxError is derived from Exception not SyntaxError.
    • NEW: Remove deprecated comments and icomments from the API.
    • NEW: Drop support for EOL Python versions (Python 2 and Python < 3.5).
    • FIX: Corner case with splitting namespace and tag name that have an escaped |.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.5(Nov 2, 2019)

  • 1.9.4(Sep 26, 2019)

    1.9.4

    • FIX: :checked rule was too strict with option elements. The specification for :checked does not require an option element to be under a select element.
    • FIX: Fix level 4 :lang() wildcard match handling with singletons. Implicit wildcard matching should not match any singleton. Explicit wildcard matching (* in the language range: *-US) is allowed to match singletons.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.3(Aug 18, 2019)

    1.9.3

    • FIX: [attr!=value] pattern was mistakenly using :not([attr|=value]) logic instead of :not([attr=value]).
    • FIX: Remove undocumented _QUIRKS mode flag. Beautiful Soup was meant to use it to help with transition to Soup Sieve, but never released with it. Help with transition at this point is no longer needed.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.2(Jun 23, 2019)

    1.9.2

    • FIX: Shortcut last descendant calculation if possible for performance.
    • FIX: Fix issue where Doctype strings can be mistaken for a normal text node in some cases.
    • FIX: A top level tag is not a :root tag if it has sibling text nodes or tag nodes. This is an issue that mostly manifests when using html.parser as the parser will allow multiple root nodes.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.1(Apr 13, 2019)

    1.9.1

    • FIX: :root, :contains(), :default, :indeterminate, :lang(), and :dir() will properly account for HTML iframe elements in their logic when selecting or matching an element. Their logic will be restricted to the document for which the element under consideration applies.
    • FIX: HTML pseudo-classes will check that all key elements checked are in the XHTML namespace (HTML parsers that do not provide namespaces will assume the XHTML namespace).
    • FIX: Ensure that all pseudo-class names are case insensitive and allow CSS escapes.
    Source code(tar.gz)
    Source code(zip)
  • 1.9.0(Mar 26, 2019)

    1.9.0

    • NEW: Allow :contains() to accept a list of text to search for. (#115)
    • NEW: Add new escape function for escaping CSS identifiers. (#125)
    • NEW: Deprecate comments and icomments functions in the API to ensure Soup Sieve focuses only in CSS selectors. comments and icomments will most likely be removed in 2.0. (#130)
    • NEW: Add Python 3.8 support. (#133)
    • FIX: Don't install test files when installing the soupsieve package. (#111)
    • FIX: Improve efficiency of :contains() comparison.
    • FIX: Null characters should translate to the Unicode REPLACEMENT CHARACTER (U+FFFD) according to the specification. This applies to CSS escaped NULL characters as well. (#124)
    • FIX: Escaped EOF should translate to U+FFFD outside of CSS strings. In a string, they should just be ignored, but as there is no case where we could resolve such a string and still have a valid selector, string handling remains the same. (#128)
    Source code(tar.gz)
    Source code(zip)
  • 1.8.0(Feb 17, 2019)

    1.8.0

    • NEW: Add custom selector support. (#92)(#108)
    • FIX: Small tweak to CSS identifier pattern to ensure it matches the CSS specification exactly. Specifically, you can't have an identifier of only -. (#107)
    • FIX: CSS string patterns should allow escaping newlines to span strings across multiple lines. (#107)
    • FIX: Newline regular expression for CSS newlines should treat \r\n as a single character, especially in cases such as string escapes: \\\r\n. (#107)
    • FIX: Allow -- as a valid identifier or identifier start. (#107)
    • FIX: Bad CSS syntax now raises a SelectorSyntaxError, which is still currently derived from SyntaxError, but will most likely be derived from Exception in the future.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.3(Jan 23, 2019)

    1.7.3

    • FIX: Fix regression with tag names in regards to case sensitivity, and ensure there are tests to prevent breakage in the future.
    • FIX: XHTML should always be case sensitive like XML.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.2(Jan 18, 2019)

    1.7.2

    • FIX: Fix HTML detection for type selector.
    • FIX: Fixes for :enabled and :disabled.
    • FIX: Provide a way for Beautiful Soup to parse selectors in a quirks mode to mimic some of the quirks of the old select method prior to Soup Sieve, but with warnings. This is to help old scripts to not break during the transitional period with newest Beautiful Soup. In the future, these quirks will raise an exception as Soup Sieve requires selectors to follow the CSS specification.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.1(Jan 13, 2019)

    1.7.1

    • FIX: Fix issue with :has() selector where a leading combinator can only be provided in the first selector in a relative selector list.
    Source code(tar.gz)
    Source code(zip)
  • 1.7.0(Jan 10, 2019)

    1.7.0

    • NEW: Add support for :in-range and :out-of-range selectors. (#60)
    • NEW: Add support for :defined selector. (#76)
    • FIX: Fix pickling issue when compiled selector contains a NullSelector object. (#70)
    • FIX: Better exception messages in the CSS selector parser and fix a position reporting issue that can occur in some exceptions. (#72, #73)
    • FIX: Don't compare prefixes when evaluating attribute namespaces, compare the actual namespace. (#75)
    • FIX: Split whitespace attribute lists by all whitespace characters, not just space.
    • FIX: :nth-* patterns were converting numbers to base 16 when they should have been converting to base 10.
    Source code(tar.gz)
    Source code(zip)
  • 1.6.2(Jan 4, 2019)

    1.6.2

    • FIX: Fix pattern compile issues on Python < 2.7.4.
    • FIX: Don't use \d in Unicode Re patterns as they will contain characters outside the range of [0-9].
    Source code(tar.gz)
    Source code(zip)
  • 1.6.1(Jan 2, 2019)

  • 1.6.0(Dec 31, 2018)

  • 1.5.0(Dec 28, 2018)

    1.5.0

    • NEW: Add select_one method like Beautiful Soup has.
    • NEW: Add :dir() selector (HTML only).
    • FIX: Fix handling issues of HTML fragments (elements without a BeautifulSoup object as a parent).
    • FIX: Fix internal nth range check.
    Source code(tar.gz)
    Source code(zip)
  • 1.4.0(Dec 27, 2018)

    1.4.0

    • NEW: Throw NotImplementedError for at-rules: @page, etc.
    • NEW: Match nothing for :host, :host(), and :host-context().
    • NEW: Add support for :read-write and :read-only.
    • NEW: Selector patterns can be annotated with CSS comments.
    • FIX: \r, \n, and \f cannot be escaped with \ in CSS. You must use Unicode escapes.
    Source code(tar.gz)
    Source code(zip)
  • 1.3.1(Dec 24, 2018)

  • 1.3.0(Dec 22, 2018)

    1.3.0

    • NEW: Add support for :scope.
    • NEW: :user-invalid, :playing, :paused, and :local-link will not cause a failure, but all will match nothing as their use cases are not possible in an environment outside a web browser.
    • FIX: Fix [attr~=value] handling of whitespace. According to the spec, if the value contains whitespace, or is an empty string, it should not match anything.
    • FIX: Precompile internal patterns for pseudo-classes to prevent having to parse them again.
    Source code(tar.gz)
    Source code(zip)
  • 1.2.1(Dec 20, 2018)

    1.2.1

    • FIX: More descriptive exceptions. Exceptions will also now mention position in the pattern that is problematic.
    • FIX: filter ignores NavigableString objects in normal iterables and Tag iterables. Basically, it filters all Beautiful Soup document parts regardless of iterable type where as it used to only filter out a NavigableString in a Tag object. This is viewed as fixing an inconsistency.
    • FIX: DEBUG flag has been added to help with debugging CSS selector parsing. This is mainly for development.
    • FIX: If forced to search for language in meta tag, and no language is found, cache that there is no language in the meta tag to prevent searching again during the current select.
    • FIX: If a non BeautifulSoup/Tag object is given to the API to compare against, raise a TypeError.
    Source code(tar.gz)
    Source code(zip)
  • 1.2.0(Dec 19, 2018)

Web Content Retrieval for Humans™

Lassie Lassie is a Python library for retrieving basic content from websites. Usage import lassie lassie.fetch('http://www.youtube.com/watch?v

Mike Helmick 570 Dec 19, 2022
Simple proxy scraper made by using ProxyScrape's api.

What is Moon? Moon is a lightweight and fast proxy scraper made by using ProxyScrape's api. What can i do with this? You can use proxies for varietys

1 Jul 04, 2022
Automated data scraper for Thailand COVID-19 data

The Researcher COVID data Automated data scraper for Thailand COVID-19 data Accessing the Data 1st Dose Provincial Vaccination Data 2nd Dose Provincia

Porames Vatanaprasan 31 Apr 17, 2022
Script used to download data for stocks.

This script is useful for downloading stock market data for a wide range of companies specified by their respective tickers. The script reads in the d

Carmelo Gonzales 71 Oct 04, 2022
An helper library to scrape data from Instagram effortlessly, using the Influencer Hunters APIs.

Instagram Scraper An utility library to scrape data from Instagram hassle-free Go to the website » View Demo · Report Bug · Request Feature About The

2 Jul 06, 2022
A tool for scraping and organizing data from NewsBank API searches

nbscraper Overview This simple tool automates the process of copying, pasting, and organizing data from NewsBank API searches. Curerntly, nbscrape onl

0 Jun 17, 2021
A Python module to bypass Cloudflare's anti-bot page.

cloudflare-scrape A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Reque

3k Jan 04, 2023
Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Scrapping the data from each page of biocides listed on the BAUA website into a csv file

Eric DE MARIA 1 Nov 30, 2021
Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

Anton Ivarsson 1 Nov 07, 2021
An experiment to deploy a serverless infrastructure for a scrapy project.

Serverless Scrapy project This project aims to evaluate the feasibility of an architecture based on serverless technology for a web crawler using scra

José Ferraz Neto 5 Jul 08, 2022
Creating Scrapy scrapers via the Django admin interface

django-dynamic-scraper Django Dynamic Scraper (DDS) is an app for Django which builds on top of the scraping framework Scrapy and lets you create and

Holger Drewes 1.1k Dec 17, 2022
A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

A Python Covid-19 cases tracker that scrapes data off the web and presents the number of Cases, Recovered Cases, and Deaths that occurred because of the pandemic.

Alex Papadopoulos 1 Nov 13, 2021
Web Scraping Instagram photos with Selenium by only using a hashtag.

Web-Scraping-Instagram This project is used to automatically obtain images by web scraping Instagram with Selenium in Python. The required input will

Sandro Agama 3 Nov 24, 2022
A list of Python Bots used to extract data from several websites

A list of Python Bots used to extract data from several websites. Data extraction is for products on e-commerce (ecommerce) websites. Data fetched i

Sahil Ladhani 1 Jan 14, 2022
Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

Krishna Consciousness Society 18 Dec 14, 2022
让中国用户使用git从github下载的速度提高1000倍!

序言 github上有很多好项目,但是国内用户连github却非常的慢.每次都要用插件或者其他工具来解决. 这次自己做一个小工具,输入github原地址后,就可以自动替换为代理地址,方便大家更快速的下载. 安装 pip install cit 主要功能与用法 主要功能 change 将目标地址转换为

35 Aug 29, 2022
This Spider/Bot is developed using Python and based on Scrapy Framework to Fetch some items information from Amazon

- Hello, This Project Contains Amazon Web-bot. - I've developed this bot for fething some items information on Amazon. - Scrapy Framework in Python is

Khaled Tofailieh 4 Feb 13, 2022
A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

Muhammad Abdullah 273 Dec 31, 2022
Python web scrapper

Website scrapper Web scrapping project in Python. Created for learning purposes. Start Install python Update configuration with websites Launch script

Nogueira Vitor 1 Dec 19, 2021
Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes.

Pyrics Pyrics is a tool to scrape lyrics, get rhymes, generate relevant lyrics with rhymes. ./test/run.py provides the full function in terminal cmd

MisterDK 1 Feb 12, 2022