A Python module to bypass Cloudflare's anti-bot page.

Last update: Jan 04, 2023

Overview

cloudflare-scrape

A simple Python module to bypass Cloudflare's anti-bot page (also known as "I'm Under Attack Mode", or IUAM), implemented with Requests. Python versions 2.6 - 3.7 are supported. Cloudflare changes their techniques periodically, so I will update this repo frequently.

This can be useful if you wish to scrape or crawl a website protected with Cloudflare. Cloudflare's anti-bot page currently just checks if the client supports JavaScript, though they may add additional techniques in the future.

Due to Cloudflare continually changing and hardening their protection page, cloudflare-scrape requires Node.js to solve JavaScript challenges. This allows the script to easily impersonate a regular web browser without explicitly deobfuscating and parsing Cloudflare's JavaScript.

Note: This only works when regular Cloudflare anti-bots is enabled (the "Checking your browser before accessing..." loading page). If there is a reCAPTCHA challenge, you're out of luck. Thankfully, the JavaScript check page is much more common.

For reference, this is the default message Cloudflare uses for these sorts of pages:

Checking your browser before accessing website.com.

This process is automatic. Your browser will redirect to your requested content shortly.

Please allow up to 5 seconds...

Any script using cloudflare-scrape will sleep for 5 seconds for the first visit to any site with Cloudflare anti-bots enabled, though no delay will occur after the first request.

Installation

Simply run pip install cfscrape. You can upgrade with pip install -U cfscrape. The PyPI package is at https://pypi.python.org/pypi/cfscrape/

Alternatively, clone this repository and run python setup.py install.

Node.js dependency

Node.js version 10 or above is required to interpret Cloudflare's obfuscated JavaScript challenge.

Your machine may already have Node installed (check with node -v). If not, you can install it with apt-get install nodejs on Ubuntu >= 18.04 and Debian >= 9 and brew install node on macOS. Otherwise, you can get it from Node's download page or their package manager installation page.

Updates

Cloudflare regularly modifies their anti-bot protection page and improves their bot detection capabilities.

If you notice that the anti-bot page has changed, or if this module suddenly stops working, please create a GitHub issue so that I can update the code accordingly.

Many issues are a result of users not updating to the latest release of this project. Before filing an issue, please run the following command to update cloudflare-scrape to the latest version:

pip install -U cfscrape

If you are still encountering a problem, create a GitHub issue and please include:

The version number from pip show cfscrape.
The relevant code snippet that's experiencing an issue or raising an exception.
The full exception and traceback, if applicable.
The URL of the Cloudflare-protected page which the script does not work on.
A Pastebin or Gist containing the HTML source of the protected page.

If you've upgraded and are still experiencing problems, click here to create a GitHub issue and fill out the pertinent information.

Usage

The simplest way to use cloudflare-scrape is by calling create_scraper().

import cfscrape

scraper = cfscrape.create_scraper()  # returns a CloudflareScraper instance
# Or: scraper = cfscrape.CloudflareScraper()  # CloudflareScraper inherits from requests.Session
print scraper.get("http://somesite.com").content  # => "<!DOCTYPE html><html><head>..."

That's it. Any requests made from this session object to websites protected by Cloudflare anti-bot will be handled automatically. Websites not using Cloudflare will be treated normally. You don't need to configure or call anything further, and you can effectively treat all websites as if they're not protected with anything.

You use cloudflare-scrape exactly the same way you use Requests. (CloudflareScraper works identically to a Requests Session object.) Just instead of calling requests.get() or requests.post(), you call scraper.get() or scraper.post(). Consult Requests' documentation for more information.

Options

Existing session

If you already have an existing Requests session, you can pass it to create_scraper() to continue using that session.

session = requests.session()
session.headers = ...
scraper = cfscrape.create_scraper(sess=session)

Unfortunately, not all of Requests' session attributes are easily transferable, so if you run into problems with this, you should replace your initial sess = requests.session() call with sess = cfscrape.create_scraper().

Delays

Normally, when a browser is faced with a Cloudflare IUAM challenge page, Cloudflare requires the browser to wait 5 seconds before submitting the challenge answer. If a website is under heavy load, sometimes this may fail. One solution is to increase the delay (perhaps to 10 or 15 seconds, depending on the website). If you would like to override this delay, pass the delay keyword argument to create_scraper() or CloudflareScraper().

There is no need to override this delay unless cloudflare-scrape generates an error recommending you increase the delay.

scraper = cfscrape.create_scraper(delay=10)

Integration

It's easy to integrate cloudflare-scrape with other applications and tools. Cloudflare uses two cookies as tokens: one to verify you made it past their challenge page and one to track your session. To bypass the challenge page, simply include both of these cookies (with the appropriate user-agent) in all HTTP requests you make.

To retrieve just the cookies (as a dictionary), use cfscrape.get_tokens(). To retrieve them as a full Cookie HTTP header, use cfscrape.get_cookie_string().

get_tokens and get_cookie_string both accept Requests' usual keyword arguments (like get_tokens(url, proxies={"http": "socks5://localhost:9050"})). Please read Requests' documentation on request arguments for more information.

User-Agent Handling

The two integration functions return a tuple of (cookie, user_agent_string). You must use the same user-agent string for obtaining tokens and for making requests with those tokens, otherwise Cloudflare will flag you as a bot. That means you have to pass the returned user_agent_string to whatever script, tool, or service you are passing the tokens to (e.g. curl, or a specialized scraping tool), and it must use that passed user-agent when it makes HTTP requests.

If your tool already has a particular user-agent configured, you can make cloudflare-scrape use it with cfscrape.get_tokens("http://somesite.com/", user_agent="User-Agent Here") (also works for get_cookie_string). Otherwise, a randomly selected user-agent will be used.

Integration examples

Remember, you must always use the same user-agent when retrieving or using these cookies. These functions all return a tuple of (cookie_dict, user_agent_string).

Retrieving a cookie dict through a proxy

get_tokens is a convenience function for returning a Python dict containing Cloudflare's session cookies. For demonstration, we will configure this request to use a proxy. (Please note that if you request Cloudflare clearance tokens through a proxy, you must always use the same proxy when those tokens are passed to the server. Cloudflare requires that the challenge-solving IP and the visitor IP stay the same.)

If you do not wish to use a proxy, just don't pass the proxies keyword argument. These convenience functions support all of Requests' normal keyword arguments, like params, data, and headers.

import cfscrape

proxies = {"http": "http://localhost:8080", "https": "http://localhost:8080"}
tokens, user_agent = cfscrape.get_tokens("http://somesite.com", proxies=proxies)
print tokens
# => {'cf_clearance': 'c8f913c707b818b47aa328d81cab57c349b1eee5-1426733163-3600', '__cfduid': 'dd8ec03dfdbcb8c2ea63e920f1335c1001426733158'}

Retrieving a cookie string

get_cookie_string is a convenience function for returning the tokens as a string for use as a Cookie HTTP header value.

This is useful when crafting an HTTP request manually, or working with an external application or library that passes on raw cookie headers.

import cfscrape
request = "GET / HTTP/1.1\r\n"

cookie_value, user_agent = cfscrape.get_cookie_string("http://somesite.com")
request += "Cookie: %s\r\nUser-Agent: %s\r\n" % (cookie_value, user_agent)

print request

# GET / HTTP/1.1\r\n
# Cookie: cf_clearance=c8f913c707b818b47aa328d81cab57c349b1eee5-1426733163-3600; __cfduid=dd8ec03dfdbcb8c2ea63e920f1335c1001426733158
# User-Agent: Some/User-Agent String

curl example

Here is an example of integrating cloudflare-scrape with curl. As you can see, all you have to do is pass the cookies and user-agent to curl.

import subprocess
import cfscrape

# With get_tokens() cookie dict:

# tokens, user_agent = cfscrape.get_tokens("http://somesite.com")
# cookie_arg = "cf_clearance=%s; __cfduid=%s" % (tokens["cf_clearance"], tokens["__cfduid"])

# With get_cookie_string() cookie header; recommended for curl and similar external applications:

cookie_arg, user_agent = cfscrape.get_cookie_string("http://somesite.com")

# With a custom user-agent string you can optionally provide:

# ua = "Scraping Bot"
# cookie_arg, user_agent = cfscrape.get_cookie_string("http://somesite.com", user_agent=ua)

result = subprocess.check_output(["curl", "--cookie", cookie_arg, "-A", user_agent, "http://somesite.com"])

Trimmed down version. Prints page contents of any site protected with Cloudflare, via curl. (Warning: shell=True can be dangerous to use with subprocess in real code.)

url = "http://somesite.com"
cookie_arg, user_agent = cfscrape.get_cookie_string(url)
cmd = "curl --cookie {cookie_arg} -A {user_agent} {url}"
print(subprocess.check_output(cmd.format(cookie_arg=cookie_arg, user_agent=user_agent, url=url), shell=True))

Comments

Captcha issues
The latest version of cfscrape should not encounter captchas, unless you're using Tor or another IP that Cloudflare has blacklisted. If you're getting a captcha error, first please run pip install -U cfscrape and try again. If you're still getting an error, please leave a comment.

Please put all captcha challenge-related issues here.

Please run the following to determine the OpenSSL version compiled with your Python binary and include the output in your comment:

$ python3 -c 'import ssl; print(ssl.OPENSSL_VERSION)' OpenSSL 1.1.1b 26 Feb 2019

(Or python instead of python3 if running on Python 2.)
opened by Anorov 129
ReferenceError: atob is not defined

Hello,

I got the below error since a couple of days: js2py.internals.simplex.JsException: ReferenceError: atob is not defined

File "/home/maxx/.local/lib/python3.6/site-packages/js2py/base.py", line 1074, in get return self.prototype.get(prop, throw) File "/home/maxx/.local/lib/python3.6/site-packages/js2py/base.py", line 1079, in get raise MakeError('ReferenceError', '%s is not defined' % prop) js2py.internals.simplex.JsException: ReferenceError: atob is not defined

Anyone is experiencing the same ?

Thank you!

opened by Krylanc3lo 115

Pure Python CF parser

EDIT: I updated it myself after all, see this comment.

ORIGINAL POST: The logic inside the challenge is grounded, it uses JSFuck plus some arithmetic. There's this project called UniversalScrapers (from the non-official, underground XBMC scene) where I first saw this, it's based on Anorov's but does the solving entirely in inline Python (no node.js or js2py needed). It is broken now after these latest updates to the CF challenge, but it's a nice reference.

I wish we could work on updates for this as it's more lightweight than the proposed alternatives.

OLD CODE (needs fixes):

import logging
import random
import re
'''''''''
Disables InsecureRequestWarning: Unverified HTTPS request is being made warnings.
'''''''''
import requests
from requests.packages.urllib3.exceptions import InsecureRequestWarning
requests.packages.urllib3.disable_warnings(InsecureRequestWarning)
''''''
from requests.sessions import Session
from copy import deepcopy
from time import sleep

try:
    from urlparse import urlparse
except ImportError:
    from urllib.parse import urlparse

DEFAULT_USER_AGENTS = [
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/65.0.3325.181 Safari/537.36",
    "Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Ubuntu Chromium/65.0.3325.181 Chrome/65.0.3325.181 Safari/537.36",
    "Mozilla/5.0 (Linux; Android 7.0; Moto G (5) Build/NPPS25.137-93-8) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.137 Mobile Safari/537.36",
    "Mozilla/5.0 (iPhone; CPU iPhone OS 7_0_4 like Mac OS X) AppleWebKit/537.51.1 (KHTML, like Gecko) Version/7.0 Mobile/11B554a Safari/9537.53",
    "Mozilla/5.0 (Windows NT 6.1; Win64; x64; rv:60.0) Gecko/20100101 Firefox/60.0",
    "Mozilla/5.0 (Macintosh; Intel Mac OS X 10.13; rv:59.0) Gecko/20100101 Firefox/59.0",
    "Mozilla/5.0 (Windows NT 6.3; Win64; x64; rv:57.0) Gecko/20100101 Firefox/57.0"
]

DEFAULT_USER_AGENT = random.choice(DEFAULT_USER_AGENTS)

BUG_REPORT = ("Cloudflare may have changed their technique, or there may be a bug in the script.\n\nPlease read " "https://github.com/Anorov/cloudflare-scrape#updates, then file a "
"bug report at https://github.com/Anorov/cloudflare-scrape/issues.")


class CloudflareScraper(Session):
    def __init__(self, *args, **kwargs):
        super(CloudflareScraper, self).__init__(*args, **kwargs)

        if "requests" in self.headers["User-Agent"]:
            # Spoof Firefox on Linux if no custom User-Agent has been set
            self.headers["User-Agent"] = DEFAULT_USER_AGENT

    def request(self, method, url, *args, **kwargs):
        resp = super(CloudflareScraper, self).request(method, url, *args, **kwargs)

        # Check if Cloudflare anti-bot is on
        if ( resp.status_code == 503
             and resp.headers.get("Server", "").startswith("cloudflare")
             and b"jschl_vc" in resp.content
             and b"jschl_answer" in resp.content
        ):
            return self.solve_cf_challenge(resp, **kwargs)

        # Otherwise, no Cloudflare anti-bot detected
        return resp

    def solve_cf_challenge(self, resp, **original_kwargs):
        sleep(8)  # Cloudflare requires a delay before solving the challenge

        body = resp.text
        parsed_url = urlparse(resp.url)
        domain = parsed_url.netloc
        submit_url = "%s://%s/cdn-cgi/l/chk_jschl" % (parsed_url.scheme, domain)

        cloudflare_kwargs = deepcopy(original_kwargs)
        params = cloudflare_kwargs.setdefault("params", {})
        headers = cloudflare_kwargs.setdefault("headers", {})
        headers["Referer"] = resp.url
        
        try:
            params["jschl_vc"] = re.search(r'name="jschl_vc" value="(\w+)"', body).group(1)
            params["pass"] = re.search(r'name="pass" value="(.+?)"', body).group(1)
            params["s"] = re.search(r'name="s" value="(.+?)"', body).group(1)

            # Extract the arithmetic operation
            init = re.findall('setTimeout\(function\(\){\s*var.*?.*:(.*?)}', body)[-1]
            builder = re.findall(r"challenge-form\'\);\s*(.*)a.v", body)[0]
            if '/' in init:
                init = init.split('/')
                decryptVal = self.parseJSString(init[0]) / float(self.parseJSString(init[1]))
            else:
                decryptVal = self.parseJSString(init)
            lines = builder.split(';')

            for line in lines:
                if len(line)>0 and '=' in line:
                    sections=line.split('=')
                    if '/' in sections[1]:
                        subsecs = sections[1].split('/')
                        line_val = self.parseJSString(subsecs[0]) / float(self.parseJSString(subsecs[1]))
                    else:
                        line_val = self.parseJSString(sections[1])
                    decryptVal = float(eval(('%.16f'%decryptVal)+sections[0][-1]+('%.16f'%line_val)))

            answer = float('%.10f'%decryptVal) + len(domain)


        except Exception as e:
            # Something is wrong with the page.
            # This may indicate Cloudflare has changed their anti-bot
            # technique. If you see this and are running the latest version,
            # please open a GitHub issue so I can update the code accordingly.
            logging.error("[!] %s Unable to parse Cloudflare anti-bots page. "
                          "Try upgrading cloudflare-scrape, or submit a bug report "
                          "if you are running the latest version. Please read "
                          "https://github.com/Anorov/cloudflare-scrape#updates "
                          "before submitting a bug report." % e)
            raise

        try: params["jschl_answer"] = str(answer) #str(int(jsunfuck.cfunfuck(js)) + len(domain))
        except: pass

        # Requests transforms any request into a GET after a redirect,
        # so the redirect has to be handled manually here to allow for
        # performing other types of requests even as the first request.
        method = resp.request.method
        cloudflare_kwargs["allow_redirects"] = False

        redirect = self.request(method, submit_url, **cloudflare_kwargs)
        redirect_location = urlparse(redirect.headers["Location"])

        if not redirect_location.netloc:
            redirect_url = "%s://%s%s" % (parsed_url.scheme, domain, redirect_location.path)
            return self.request(method, redirect_url, **original_kwargs)
        return self.request(method, redirect.headers["Location"], **original_kwargs)


    def parseJSString(self, s):
        try:
            offset=1 if s[0]=='+' else 0
            val = int(eval(s.replace('!+[]','1').replace('!![]','1').replace('[]','0').replace('(','str(')[offset:]))
            return val
        except:
            pass


    @classmethod
    def create_scraper(cls, sess=None, **kwargs):
        """
        Convenience function for creating a ready-to-go requests.Session (subclass) object.
        """
        scraper = cls()

        if sess:
            attrs = ["auth", "cert", "cookies", "headers", "hooks", "params", "proxies", "data"]
            for attr in attrs:
                val = getattr(sess, attr, None)
                if val:
                    setattr(scraper, attr, val)

        return scraper


    ## Functions for integrating cloudflare-scrape with other applications and scripts

    @classmethod
    def get_tokens(cls, url, user_agent=None, **kwargs):
        scraper = cls.create_scraper()
        if user_agent:
            scraper.headers["User-Agent"] = user_agent

        try:
            resp = scraper.get(url, **kwargs)
            resp.raise_for_status()
        except Exception as e:
            logging.error("'%s' returned an error. Could not collect tokens." % url)
            raise

        domain = urlparse(resp.url).netloc
        cookie_domain = None

        for d in scraper.cookies.list_domains():
            if d.startswith(".") and d in ("." + domain):
                cookie_domain = d
                break
        else:
            raise ValueError("Unable to find Cloudflare cookies. Does the site actually have Cloudflare IUAM (\"I'm Under Attack Mode\") enabled?")

        return ({
                    "__cfduid": scraper.cookies.get("__cfduid", "", domain=cookie_domain),
                    "cf_clearance": scraper.cookies.get("cf_clearance", "", domain=cookie_domain)
                },
                scraper.headers["User-Agent"]
               )

    @classmethod
    def get_cookie_string(cls, url, user_agent=None, **kwargs):
        """
        Convenience function for building a Cookie HTTP header value.
        """
        tokens, user_agent = cls.get_tokens(url, user_agent=user_agent, **kwargs)
        return "; ".join("=".join(pair) for pair in tokens.items()), user_agent

create_scraper = CloudflareScraper.create_scraper
get_tokens = CloudflareScraper.get_tokens
get_cookie_string = CloudflareScraper.get_cookie_string

opened by doko-desuka 86

Doesen't work, it keeps running forever.
Before creating an issue, first upgrade cfscrape with pip install -U cfscrape and see if you're still experiencing the problem. Please also confirm your Node version (node --version or nodejs --version) is version 10 or higher.

Make sure the website you're having issues with is actually using anti-bot protection by Cloudflare and not a competitor like Imperva Incapsula or Sucuri. And if you're using an anonymizing proxy, a VPN, or Tor, Cloudflare often flags those IPs and may block you or present you with a captcha as a result.

Please confirm the following statements and check the boxes before creating an issue:

[y ] I've upgraded cfscrape with pip install -U cfscrape

[ y] I'm using Node version 10 or higher

[y ] The site protection I'm having issues with is from Cloudflare

[y ] I'm not using Tor, a VPN, or an anonymizing proxy

Python version number

Run python --version and paste the output below:

Python 3.7.5

cfscrape version number

Run pip show cfscrape and paste the output below:

Name: cfscrape Version: 2.0.8 Summary: A simple Python module to bypass Cloudflare's anti-bot page. See https://github.com/Anorov/cloudflare-scrape for more information. Home-page: https://github.com/Anorov/cloudflare-scrape Author: Anorov Author-email: [email protected] License: UNKNOWN Location: /usr/local/lib/python3.7/site-packages Requires: requests Required-by:

Code snippet involved with the issue

>>> import cfscrape >>> scraper = cfscrape.create_scraper() >>> res = scraper.get("https://altadefinizione01-nuovo.link") **... for ever, doesen't return a result**

Complete exception and traceback

(If the problem doesn't involve an exception being raised, leave this blank)

URL of the Cloudflare-protected page

https://altadefinizione01-nuovo.link

URL of Pastebin/Gist with HTML source of protected page

https://pastebin.com/mkrSkaMi
bug
opened by makovez 28
Error parsing Cloudflare IUAM Javascript challenge

script was working reliably for weeks, but this morning spit out the above error.

Here is a pastebin of the page text: https://pastebin.com/HXmdsm0E

Page url in the pastebin.

it's failing at the start of the script, with the line: cookie_arg, user_agent = cfscrape.get_cookie_string(url)

opened by pl77 23
Update for latest Cloudflare challenge
This PR supersedes #206

Only 3 regular expression

Zero JS challenge removals

Minimal solution

The headers and params are both ordered dictionaries to preserve ordering when it matters. Everything works the way you'd expect, you can pass headers, etc.. The delay is still configurable and defaults to the parsed delay when omitted. If parsing the delay fails, it will fallback to the former default delay of 8 seconds. The JS challenge can be sent with HTTP status 429 (Too Many Requests) so that's been addressed. The original request params are no longer being mixed into the challenge response. The manual redirect handling has been updated as seen in previous pull requests.

All of the headers/UA combinations have been tested against https://pro-src.com on python 2 and 3. :heavy_check_mark:

Feel free to scrutinize now. :sweat_smile:

Issues that this PR will close

Fixes #233, Fixes #232, Fixes #231, Fixes #229, Fixes #228 Close #206, Fixes #227, Fixes #225, Fixes #228, Fixes #227 Fixes #220, Fixes #219, Fixes #217, Fixes #205, Fixes #201 Close #199, Fixes #190, Fixes #181
opened by ghost 19
New error can't collect tokens
Running just updated version from yesterday..

Code line is just

cookies = cfscrape.get_tokens(url)

Fetching Cloudflare cookies ... undefined:1 t.charCodeAt(1) ^

TypeError: Cannot read property 'charCodeAt' of undefined at eval (eval at (evalmachine.:1:1027), :1:3) at evalmachine.:1:1027 at evalmachine.:1:1275 at Script.runInContext (vm.js:107:20) at Script.runInNewContext (vm.js:113:17) at Object.runInNewContext (vm.js:296:38) at [eval]:1:27 at Script.runInThisContext (vm.js:96:20) at Object.runInThisContext (vm.js:303:38) at Object. ([eval]-wrapper:6:22) ERROR:root:Error executing Cloudflare IUAM Javascript. Cloudflare may have changed their technique, or there may be a bug in the script.

Please read https://github.com/Anorov/cloudflare-scrape#updates, then file a bug report at https://github.com/Anorov/cloudflare-scrape/issues." ERROR:root:'https://url.com' returned an error. Could not collect tokens.
opened by chicc0 18
Not working

For a few days now (last it worked was on 5/6/16), the module isn't working properly. No exception is thrown, but there is utterly no sort of response. When I stop my program with a ctrl+c, I get the following: Traceback (most recent call last): File "E:\Kissanime-dl\kissanime-dl.py", line 188, in <module> return bs(cfscraper.create_scraper().get(url).content, 'lxml') followed by an endless loop of File "C:\Tools\Anaconda\lib\site-packages\requests\sessions.py", line 487, in get return self.request('GET', url, **kwargs) File "C:\Tools\Anaconda\lib\site-packages\cfscrape\__init__.py", line 30, in request return self.solve_cf_challenge(resp, **kwargs) File "C:\Tools\Anaconda\lib\site-packages\cfscrape\__init__.py", line 69, in solve_cf_challenge return self.get(submit_url, **kwargs)
and finally,
File "C:\Tools\Anaconda\lib\site-packages\cfscrape\__init__.py", line 36, in solve_cf_challenge time.sleep(5) #Cloudflare requires a delay before solving the challenge

The page I'm trying to scrape is KissAnime.
And here is the source code of the page.

I just noticed that somebody else had an issue with the same page, but that issue was marked closed. And my Traceback seems to be different.

opened by 5vbz3r0 18

2.0.1 and 2.0.2 crashes opening user_agents.json

The pip installation does not ship user_agents.json therefor, cfscrape (2.0.1 and 2.0.2) crashes:

mkdir cfscrapetest
cd cfscrapetest/
pipenv --three
pipenv install cfscrape
pipenv run python
Python 3.5.2 (default, Nov 23 2017, 16:37:01)
[GCC 5.4.0 20160609] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import cfscrape
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/share/virtualenvs/cfscrapetest-uMJOQPOp/lib/python3.5/site-packages/cfscrape/__init__.py", line 22, in <module>
    with open(USER_AGENTS_PATH) as f:
FileNotFoundError: [Errno 2] No such file or directory: '/home/user/.local/share/virtualenvs/cfscrapetest-uMJOQPOp/lib/python3.5/site-packages/cfscrape/user_agents.json'
>>>

@Anorov @pro-src

opened by lukastribus 16

cfscrape cannot solve captchas

Hello! Is there any chance that somebody can help me? The server is a fresh instance set up specially for the test. Ubuntu 19.04 with nodejs installed. None of the python versions(2.7, 3.7) seems to solve the issue. Thanks for any help! Just in case I've attached the output from terminal

Version number

Run pip show cfscrape and paste the output below: Name: cfscrape Version: 2.0.7 Summary: A simple Python module to bypass Cloudflare's anti-bot page. See https://github.com/Anorov/cloudflare-scrape for more information. Home-page: https://github.com/Anorov/cloudflare-scrape Author: Anorov Author-email: [email protected] License: UNKNOWN Location: /usr/local/lib/python2.7/dist-packages Requires: requests Required-by:

Code snippet experiencing the issue

import cfscrape

scraper = cfscrape.create_scraper() print scraper.get("https://www.enotes.com/topics/alpha/").content

Complete exception and traceback

Traceback (most recent call last): File "test.py", line 5, in print (scraper.get("https://www.enotes.com/topics/alpha/").content) # => "..." File "/usr/local/lib/python2.7/dist-packages/requests/sessions.py", line 546, in get return self.request('GET', url, **kwargs) File "/usr/local/lib/python2.7/dist-packages/cfscrape/init.py", line 124, in request self.handle_captcha_challenge(resp, url) File "/usr/local/lib/python2.7/dist-packages/cfscrape/init.py", line 147, in handle_captcha_challenge raise CloudflareCaptchaError(error, response=resp) cfscrape.CloudflareCaptchaError: Cloudflare captcha challenge presented for www.enotes.com (cfscrape cannot solve captchas)

URL of the Cloudflare-protected page

https://www.enotes.com/topics/alpha/

URL of Pastebin/Gist with HTML source of protected page

https://pastebin.com/CXKapc0B
bug

opened by bajburtskii 15
Js2Py to execute javascript

~~Have you ever considered something like PyV8 (or irV8, though this fork seems particularly volatile) for executing the javascript? Such an addition would make this package better contained.~~

~~The state of the PyV8 package is a sad one at the moment though, a lot of libraries are needed to build the package and the last prebuilt binaries are from 2012.~~

It looks like Js2Py would be a more viable option.

opened by Mattwmaster58 15
cloudflare issue
issuepost1.txt Before creating an issue, first upgrade cfscrape with pip install -U cfscrape and see if you're still experiencing the problem. Please also confirm your Node version (node --version or nodejs --version) is version 10 or higher.

Make sure the website you're having issues with is actually using anti-bot protection by Cloudflare and not a competitor like Imperva Incapsula or Sucuri. And if you're using an anonymizing proxy, a VPN, or Tor, Cloudflare often flags those IPs and may block you or present you with a captcha as a result.

Please confirm the following statements and check the boxes before creating an issue:

[x] I've upgraded cfscrape with pip install -U cfscrape

[x] I'm using Node version 10 or higher

[x] The site protection I'm having issues with is from Cloudflare

[x] I'm not using Tor, a VPN, or an anonymizing proxy

Python version number

Run python --version and paste the output below:

cfscrape version number

Run pip show cfscrape and paste the output below:

Code snippet involved with the issue

Complete exception and traceback

(If the problem doesn't involve an exception being raised, leave this blank)

URL of the Cloudflare-protected page

[LINK GOES HERE]

URL of Pastebin/Gist with HTML source of protected page

[LINK GOES HERE]
bug
opened by muhammedalisahan 0
FAIL TO BYPASS. Try to check the newest Cloudflare Technique
[x] I've upgraded cfscrape with pip install -U cfscrape

[ ] I'm using Node version 10 or higher

[x] The site protection I'm having issues with is from Cloudflare

[x] I'm not using Tor, a VPN, or an anonymizing proxy

Python version number

Run python --version and paste the output below:

Python 3.11.0

cfscrape version number

Run pip show cfscrape and paste the output below:

Name: cfscrape Version: 2.1.1

Code snippet involved with the issue

from cfscrape import CloudflareScraper as cfs #! WEB 1 req = cfs() resp = req.get('https://fpminer.com/') print(resp.text) from cfscrape import CloudflareScraper as cfs #! WEB 2 req = cfs() resp = req.get('https://you.com') print(resp.text)

Complete exception and traceback

(If the problem doesn't involve an exception being raised, leave this blank)

URL of the Cloudflare-protected page

https://fpminer.com/ https://you.com

URL of Pastebin/Gist with HTML source of protected page

https://dpaste.org/QXXZn https://dpaste.org/uc5Yo
bug
opened by Queday 0
Fails to get cookie cf_clearance and cfduid

Python version 3.10.7

cfscrape version 2.1.1

Name: cfscrape Version: 2.1.1 Summary: A simple Python module to bypass Cloudflare's anti-bot page. See https://github.com/Anorov/cloudflare-scrape for more information. Home-page: https://github.com/Anorov/cloudflare-scrape Author: Anorov Author-email: [email protected] License: UNKNOWN Location: c:\users\pc\appdata\local\programs\python\python310\lib\site-packages Requires: requests Required-by:

Code snippet involved with the issue

import cfscrape

url = 'https://www.vinted.fr/mujer/ropa/vestidos/vestidos-formales/2449263747-vestido-mujer-sfera-l'

headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.77 Safari/537.36', } token, agent = cfscrape.get_tokens(url, headers=headers) token2, agent2 = cfscrape.get_cookie_string(url, headers=headers)

print(token) print(token2)

output

{'__cfduid': '', 'cf_clearance': ''} __cfduid=; cf_clearance=

URL of the Cloudflare-protected page

[https://www.vinted.fr/mujer/ropa/vestidos/vestidos-formales/2449263747-vestido-mujer-sfera-l]

URL of Pastebin/Gist with HTML source of protected page

[LINK GOES HERE]
bug

opened by Polo6767 0
Checking if the site connection is secure

Python 3.10.6

Node.js v14.21.1.

Name: cfscrape Version: 2.1.1 Summary: A simple Python module to bypass Cloudflare's anti-bot page. See https://github.com/Anorov/cloudflare-scrape for more information. Home-page: https://github.com/Anorov/cloudflare-scrape

access res: Checking if the site connection is secure https://www.fastpeoplesearch.com needs to review the security of your connection before proceeding. Ray ID: 76de5f6b3cdc1a1b Performance & security by Cloudflare
bug

opened by ithjl521 0

Enable Javascript and Cookies in your browser.

help me pls

import cfscrape

scraper = cfscrape.create_scraper()  # returns a CloudflareScraper instance
# Or: scraper = cfscrape.CloudflareScraper()  # CloudflareScraper inherits from requests.Session
print(scraper.get(site).content)

bug

opened by hasanali586q 0

ckgsir.com protected by CF

[x] I've upgraded cfscrape with pip install -U cfscrape
[x] I'm using Node version 10 or higher
[x] The site protection I'm having issues with is from Cloudflare
[x] I'm not using Tor, a VPN, or an anonymizing proxy

Python version number

Run python --version and paste the output below:

Python 3.8.7

cfscrape version number

Run pip show cfscrape and paste the output below:

Name: cfscrape
Version: 2.1.1
Summary: A simple Python module to bypass Cloudflare's anti-bot page. See https://github.com/Anorov/cloudflare-scrape for more information.
Home-page: https://github.com/Anorov/cloudflare-scrape
Author: Anorov
Author-email: [email protected]
License: UNKNOWN
Location: c:\users\myhero\appdata\local\programs\python\python38\lib\site-packages
Requires: requests
Required-by:

Code snippet involved with the issue

 scraper = cfscrape.create_scraper()
 res = scraper.get("https://ckgsir.com")
 print(res.text)
 scraper.solve_challenge(res.text, "https://ckgsir.com")

Complete exception and traceback

(If the problem doesn't involve an exception being raised, leave this blank)

EXCEPTION => ( <class 'ValueError'> )  Unable to identify Cloudflare IUAM Javascript on website. Cloudflare may have changed their technique, or there may be a bug in the script.

Please read https://github.com/Anorov/cloudflare-scrape#updates, then file a bug report at https://github.com/Anorov/cloudflare-scrape/issues."
Traceback (most recent call last):
  File "C:\Users\MyHero\AppData\Local\Programs\Python\Python38\lib\site-packages\cfscrape\__init__.py", line 249, in solve_challenge
    javascript = re.search(r'\<script type\=\"text\/javascript\"\>\n(.*?)\<\/script\>',body, flags=re.S).group(1) # find javascript
AttributeError: 'NoneType' object has no attribute 'group'

URL of the Cloudflare-protected page

https://ckgsir.com

URL of Pastebin/Gist with HTML source of protected page

https://gist.github.com/sh-erfan/3867c60f53fb1f68f20fd54c6e3510e2

bug

opened by sh-erfan 0

Releases(2.1.1)

2.1.1(Feb 23, 2020)

Bug fix
Source code(tar.gz)
Source code(zip)
2.1.0(Feb 22, 2020)

Updated to handle latest Cloudflare IUAM challenge.
Source code(tar.gz)
Source code(zip)
2.0.8(Aug 11, 2019)
Default SSL ciphers changed to resolve issues with some users reporting receiving Cloudflare captchas. If you're having issues with captchas, please try upgrading to the latest version.

Source code(tar.gz)
Source code(zip)
2.0.7(Jun 3, 2019)

Source code(tar.gz)
Source code(zip)
2.0.5(Jun 3, 2019)

Source code(tar.gz)
Source code(zip)
1.9.7(Mar 20, 2019)

Source code(tar.gz)
Source code(zip)
1.9.6(Mar 14, 2019)

Source code(tar.gz)
Source code(zip)
1.9.5(Apr 5, 2018)

Source code(tar.gz)
Source code(zip)
1.9.4(Feb 5, 2018)

Source code(tar.gz)
Source code(zip)
1.9.3(Jan 31, 2018)

Source code(tar.gz)
Source code(zip)
1.9.2(Jan 27, 2018)

Source code(tar.gz)
Source code(zip)
1.9.1(Dec 21, 2017)

Source code(tar.gz)
Source code(zip)
1.9.0(Sep 4, 2017)

This release contains a critical security fix. Please upgrade to 1.9.0 immediately.
Source code(tar.gz)
Source code(zip)

Owner

GitHub Repository

Telegram group scraper tool

Telegram Group Scrapper

2 Jan 11, 2022

Python scrapper scrapping torrent website and download new movies Automatically.

torrent-scrapper Python scrapper scrapping torrent website and download new movies Automatically. If you like it Put a ⭐ on this repo 😇 Run this git

1 Jan 08, 2022

Web scrapping

Project Setup Table of Contents Project Setup Table of Contents Run project locally Install Requirements Run script Run project locally Install Requir

3 Feb 04, 2022

fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本，仅用于测试和学习研究，禁止用于商业用途，不能保证其合法性，准确性，完整性和有效性，请根据情况自行判断。本项目内所有资源文件，禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件，默认去除了所有演唱会（Live）版本的歌曲。使用方法直接运行python run.py即可，然后输入你想获取的歌手名字，然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例，会生成两

11 Jul 27, 2022

京东茅台抢购 2021年4月最新版

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本，仅用于测试和学习研究，禁止用于商业用途，不能保证其合法性，准确性，完整性和有效性，请根据情况自行判断。本项目内所有资源文件，禁止任何公众号、自媒体进行任何形式的转载、发布。 huanghyw 对任何脚本问题概不

45 Dec 14, 2022

Open Crawl Vietnamese Text

Open Crawl Vietnamese Text This repo contains crawled Vietnamese text from multiple sources. This list of a topic-centric public data sources in high

4 Jan 05, 2022

A tool to easily scrape youtube data using the Google API

YouTube data scraper To easily scrape any data from the youtube homepage, a youtube channel/user, search results, playlists, and a single video itself

7 Dec 03, 2022

An automated, headless YouTube Watcher and Scraper

Searches YouTube, queries recommended videos and watches them. All fully automated and anonymised through the Tor network. The project consists of two independently usable components, the YouTube aut

44 Oct 18, 2022

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation This repository provides two web crawlers to label domain nam

1 Nov 05, 2021

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

12.3k Jan 07, 2023

Dictionary - Application focused on word search through web scraping

Dictionary - Application focused on word search through web scraping, in addition to other functions such as dictation, spell and conjugation of syllables.

2 May 09, 2022

A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

4 Jun 02, 2022

mlscraper: Scrape data from HTML pages automatically with Machine Learning

🤖 Scrape data from HTML websites automatically with Machine Learning

798 Dec 29, 2022

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

Parsel Parsel is a BSD-licensed Python library to extract and remove data from HTML and XML using XPath and CSS selectors, optionally combined with re

859 Dec 29, 2022

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

273 Dec 31, 2022

A Python module to bypass Cloudflare's anti-bot page.

Related tags

Overview

cloudflare-scrape

Installation

Node.js dependency

Updates

Usage

Options

Existing session

Delays

Integration

Integration examples

Comments

Python version number

cfscrape version number

Code snippet involved with the issue

Complete exception and traceback

URL of the Cloudflare-protected page

URL of Pastebin/Gist with HTML source of protected page

Version number

Code snippet experiencing the issue

Complete exception and traceback

URL of the Cloudflare-protected page

URL of Pastebin/Gist with HTML source of protected page

Python version number

cfscrape version number

Code snippet involved with the issue

Complete exception and traceback

URL of the Cloudflare-protected page

URL of Pastebin/Gist with HTML source of protected page

Python version number

cfscrape version number

Code snippet involved with the issue

Complete exception and traceback

URL of the Cloudflare-protected page

URL of Pastebin/Gist with HTML source of protected page

Python version 3.10.7

cfscrape version 2.1.1

Code snippet involved with the issue

output

URL of the Cloudflare-protected page

URL of Pastebin/Gist with HTML source of protected page

Python version number

cfscrape version number

Code snippet involved with the issue

Complete exception and traceback

URL of the Cloudflare-protected page

URL of Pastebin/Gist with HTML source of protected page

Releases(2.1.1)

2.1.1(Feb 23, 2020)

2.1.0(Feb 22, 2020)

2.0.8(Aug 11, 2019)

2.0.7(Jun 3, 2019)

2.0.5(Jun 3, 2019)

1.9.7(Mar 20, 2019)

1.9.6(Mar 14, 2019)

1.9.5(Apr 5, 2018)

1.9.4(Feb 5, 2018)

1.9.3(Jan 31, 2018)

1.9.2(Jan 27, 2018)

1.9.1(Dec 21, 2017)

1.9.0(Sep 4, 2017)

Owner

Telegram group scraper tool

Python scrapper scrapping torrent website and download new movies Automatically.

Web scrapping

fork huanghyw/jd_seckill

一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

京东茅台抢购 2021年4月最新版

Open Crawl Vietnamese Text

A tool to easily scrape youtube data using the Google API

An automated, headless YouTube Watcher and Scraper

Web Crawlers for Data Labelling of Malicious Domain Detection & IP Reputation Evaluation

News, full-text, and article metadata extraction in Python 3. Advanced docs:

Dictionary - Application focused on word search through web scraping

A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

mlscraper: Scrape data from HTML pages automatically with Machine Learning

Parsel lets you extract data from XML/HTML documents using XPath or CSS selectors

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore