A scalable frontier for web crawlers

Related tags

Web Crawlingfrontera
Overview

Frontera

pypi python versions Build Status codecov

Overview

Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large scale online web crawler.

Frontera takes care of the logic and policies to follow during the crawl. It stores and prioritises links extracted by the crawler to decide which pages to visit next, and capable of doing it in distributed manner.

Main features

  • Online operation: small requests batches, with parsing done right after fetch.
  • Pluggable backend architecture: low-level backend access logic is separated from crawling strategy.
  • Two run modes: single process and distributed.
  • Built-in SqlAlchemy, Redis and HBase backends.
  • Built-in Apache Kafka and ZeroMQ message buses.
  • Built-in crawling strategies: breadth-first, depth-first, Discovery (with support of robots.txt and sitemaps).
  • Battle tested: our biggest deployment is 60 spiders/strategy workers delivering 50-60M of documents daily for 45 days, without downtime,
  • Transparent data flow, allowing to integrate custom components easily using Kafka.
  • Message bus abstraction, providing a way to implement your own transport (ZeroMQ and Kafka are available out of the box).
  • Optional use of Scrapy for fetching and parsing.
  • 3-clause BSD license, allowing to use in any commercial product.
  • Python 3 support.

Installation

$ pip install frontera

Documentation

Community

Join our Google group at https://groups.google.com/a/scrapinghub.com/forum/#!forum/frontera or check GitHub issues and pull requests.

Comments
  • Redesign codecs

    Redesign codecs

    Issue discussed here https://github.com/scrapinghub/frontera/issues/211#issuecomment-251931413 Todo List

    • [X] Fix msgpack codec
    • [x] Fix json codec
    • [x] Integration test with Hbase backend(manually)

    This PR fixes #211

    Other things done in this besides the todo list:

    • Added two methods _convert and reconvert in json codec. These are needed as JSONEncoder accepts strings only as unicode. Method convert converts objects recursively to unicode and saves their type.
    • made the requirement of msgpack >=0.4 as only versions greater than 0.4 support the changes made in this PR.
    • fixed a buggy test case in test_message_bus_backend which got exposed after fixing the codecs.
    opened by voith 35
  • Distributed example (HBase, Kafka)

    Distributed example (HBase, Kafka)

    The documentation is a little simple and does not explain how to integrate with Kafka and Hbase for a fully distributed architecture. Could you, please provide an example in the examples folder of a well configured distributed frontera config?

    opened by casertap 33
  • PY3 Syntactic changes.

    PY3 Syntactic changes.

    Most of the changes were produced using the modernize script. Changes include print syntax, error syntax, converting iterators and generators to lists, etc. Also includes some other changes which were missed by the script.

    opened by Preetwinder 32
  • Redirect loop when using distributed-frontera

    Redirect loop when using distributed-frontera

    I am using the development version of distributed-frontera, frontera and scrapy for crawling. After a while my spider keeps getting stuck in a redirect loop. Restarting the spider helps, but after a while this happens:

    2015-12-21 17:23:22 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:22 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:23 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:24 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:25 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:26 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:27 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.reg.ru/domain/new-gtlds> from <GET http://www.reg.ru/domain/new-gtlds>
    2015-12-21 17:23:32 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:33 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:34 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:35 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:36 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:37 [scrapy] DEBUG: Discarding <GET http://www.reg.ru/domain/new-gtlds>: max redirections reached
    2015-12-21 17:23:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:23:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    2015-12-21 17:23:43 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    ...
    2015-12-21 17:45:38 [scrapy] DEBUG: Redirecting (302) to <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit> from <GET http://www.toraath.com/index.php?lid=35&name=Downloads&req=getit>
    2015-12-21 17:45:43 [scrapy] INFO: Crawled 10354 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
    

    This does not seem to be an issue with distributed-frontera since I could not find any code related to redirecting there.

    opened by lljrsr 25
  • [WIP] Added Cassandra backend

    [WIP] Added Cassandra backend

    This PR is a rebase of #128. Although I have completely changed the design and refactored the code, I have added @wpxgit commits(but squashed them) because this work was originally initiated by him.

    I have tried to follow the DRY methodology as much as possible, so I had to refactor some existing code.

    I have serialized dicts using Pickle, as a result this backend won't have problems discussed in #211.

    The PR includes unit tests and some integration tests with the backends integration testing framework.

    Its good that frontera has an integration test framework for testing backends in single threaded mode. However, having a similar framework for the distributed mode is very much needed.

    I am open to all sorts of suggestions :)

    opened by voith 17
  • cluster kafka db worker doesnt recognize partitions

    cluster kafka db worker doesnt recognize partitions

    Hi, Im trying to use cluster configuration. I've created topics in kafka and have it up and running. Im running into trouble starting the database worker. Tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0,1 got an error 0,1 not recognized, tried: python -m frontera.worker.db --config config.dbw --no-incoming --partitions 0 I was getting the same issue as in #359, but somehow that stopped happening.

    Now I'm getting: that kafka partitions are not recognized or iterrable, see error. Im using python 3.6 and the frontera from the repo (FYI qzm and cachetools still needed to be installed manually). Any ideas?

    File "/usr/lib64/python3.6/runpy.py", line 193, in _run_module_as_main "main", mod_spec) File "/usr/lib64/python3.6/runpy.py", line 85, in _run_code exec(code, run_globals) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 246, in args.no_scoring, partitions=args.partitions) File "/usr/lib/python3.6/dist-packages/frontera/worker/stats.py", line 22, in init super(StatsExportMixin, self).init(settings, *args, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 115, in init self.slot = Slot(self, settings, **slot_kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 46, in init self.components = self._load_components(worker, settings, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/db.py", line 55, in _load_components component = cls(worker, settings, stop_event=self.stop_event, **kwargs) File "/usr/lib/python3.6/dist-packages/frontera/worker/components/scoring_consumer.py", line 24, in init self.scoring_log_consumer = scoring_log.consumer() File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 219, in consumer return Consumer(self._location, self._enable_ssl, self._cert_path, self._topic, self._group, partition_id=None) File "/usr/lib/python3.6/dist-packages/frontera/contrib/messagebus/kafkabus.py", line 60, in init self._partitions = [TopicPartition(self._topic, pid) for pid in self._consumer.partitions_for_topic(self._topic)]

    opened by danmsf 16
  • [WIP] Downloader slot usage optimization

    [WIP] Downloader slot usage optimization

    Imagine, we have a queue of 10K urls from many different domains. Our task is to fetch it as fast as possible. At the same time we have a prioritization which tends to group URLs from the same domain. During downloading we want to be polite and limit per host RPS. So, picking just top URLs from the queue leeds us to the time waste, because connection pool of Scrapy downloader most of time underused.

    In this PR, I'm addressing this issue by propagating information about overused hostnames/IPs in downloader pool.

    opened by sibiryakov 16
  • Fixed scheduler process_spider_output() to yield requests

    Fixed scheduler process_spider_output() to yield requests

    fixes #253 Here's a screenshot using the same code discussed here. screen shot 2017-02-12 at 3 13 48 pm

    Nothing seems to break when testing this change manually. The only test that was failing was wrong IMO because it passed a list of requests and items and was only expecting items in return. I have modified that test to make it compatible with this patch.

    I've the split this PR into three commits:

    • The first commit adds a test to reproduce the bug.
    • The second commit fixes the bug
    • The third commit fixes the broken test discussed above

    A note about the tests added:

    The tests might be a little difficult to understand on the first sight. I would recommend to read the following code in order understand the tests:
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/spidermw.py#L34-L73: This is to understand how scrapy processes the different methods of the spider middleware.
    • https://github.com/scrapy/scrapy/blob/master/scrapy/core/scraper.py#L135-L147: This is to understand how the scrapy core executes the spider middleware methods and passes the control to the spider callbacks.

    I have simulated the above discussed code in order to write the test.

    opened by voith 15
  • New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    New DELAY_ON_EMPTY functionality on FronteraScheduler terminates crawl right at start

    While this is solved you can use this on your settings as a workaround:

    DELAY_ON_EMPTY=0.0
    

    The problem is in frontera.contrib.scrapy.schedulers.FrontieraScheduler, method _get_next_requests. If there are no pending requests and the test self._delay_next_call < time() fails, an empty list is returned which causes the crawl to terminate

    bug 
    opened by plafl 14
  • Fix SQL integer type for crc32 field

    Fix SQL integer type for crc32 field

    CRC32 is an unsigned 4-byte int, so it does not fit in a signed 4-byte int (Integer). There is no unsigned int type in the SQL standard, so I changed it to BigInteger instead. Without this change, both MySQL and Postgres complain that host_crc32 field value is out of bounds. Another option (to save space) would be to conver CRC32 into a signed 4-bit int, but this will complicate things, not sure it's worth it.

    opened by lopuhin 12
  • Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    Use crawler settings as a fallback when there's no FRONTERA_SETTINGS

    This is a follow up to https://github.com/scrapinghub/frontera/pull/45.

    It enables the manager to receive the crawler settings and then instantiate the frontera settings accordingly. I added a few tests that should make the new behavior a little clearer.

    Is something along this lines acceptable? How can it be improved?

    opened by josericardo 12
  • how can I know it works when I use it with scrapy?

    how can I know it works when I use it with scrapy?

    I did everything as the document running-the-rawl, and start to run

    scrapy crawl my-spider
    

    I notice the item being crawled from the console, but I don't know whether Frontera works.

    What I did

    image

    sandwarm/frontera/settings.py

    
    BACKEND = 'frontera.contrib.backends.sqlalchemy.Distributed'
    
    SQLALCHEMYBACKEND_ENGINE="mysql://acme:[email protected]:3306/acme"
    SQLALCHEMYBACKEND_MODELS={
        'MetadataModel': 'frontera.contrib.backends.sqlalchemy.models.MetadataModel',
        'StateModel': 'frontera.contrib.backends.sqlalchemy.models.StateModel',
        'QueueModel': 'frontera.contrib.backends.sqlalchemy.models.QueueModel'
    }
    
    SPIDER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerSpiderMiddleware': 1000,
    })
    
    DOWNLOADER_MIDDLEWARES.update({
        'frontera.contrib.scrapy.middlewares.schedulers.SchedulerDownloaderMiddleware': 1000,
    })
    
    SCHEDULER = 'frontera.contrib.scrapy.schedulers.frontier.FronteraScheduler'
    
    

    settings.py

    FRONTERA_SETTINGS = 'sandwarm.frontera.settings'
    
    

    Since I enable mysql backend, I am supposed to see connection error, for I don't start mysql yet.

    Thanks for your guys hard working, but please make the document easier for humans. for example, a very basic working example, currently, we need to gather all documents to get the basic idea, even the worse, it still doesn't work at all. I alreay spent a week on a working example.

    opened by vidyli 1
  • Project Status?

    Project Status?

    It's been a year since the last commit in the master branch? Do you have any plan to maintain this? I noticed a lot of issues doesn't get resolve, and lots of PR are still pending.

    opened by psdon 8
  • Message Decode Error

    Message Decode Error

    Getting following error when adding URL to Kafka for scrapy to parse

    2020-09-07 20:12:46 [messagebus-backend] WARNING: Could not decode message: b'http://quotes.toscrape.com/page/1/', error unpack(b) received extra data.
    
    opened by ab-bh 0
  • The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    The `KeyError` throw when running to to_fetch in StateContext class: b'fingerprint'

    https://github.com/scrapinghub/frontera/blob/master/frontera/core/manager.py I use 0.8.1 code base in LOCAL_MODE, The KeyError throw when running to to_fetch in StateContext class:

    from line 801:

    class StatesContext(object):
    	...
        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                fingerprint = request.meta[b'fingerprint'] # error occured here!!!
    

    I think the reason is the meta b'fingerprint' used before it's setting:

    from line 302:

    class LocalFrontierManager(BaseContext, StrategyComponentsPipelineMixin, BaseManager):
        def page_crawled(self, response):
    ...
            self.states_context.to_fetch(response)  # here used  b'fingerprint'
            self.states_context.fetch()
            self.states_context.states.set_states(response)
            super(LocalFrontierManager, self).page_crawled(response) # but only here init!
            self.states_context.states.update_cache(response)
    

    from line 233:

    class BaseManager(object):			
        def page_crawled(self, response):
    ...
            self._process_components(method_name='page_crawled',
                                     obj=response,
                                     return_classes=self.response_model) # b'fingerprint' will be set when pipeline go through here
    		
    

    My corrent work aroud is add the line to to_fetch method of StateContext class:

        def to_fetch(self, requests):
            requests = requests if isinstance(requests, Iterable) else [requests]
            for request in requests:
                if b'fingerprint' not in request.meta:                
                    request.meta[b'fingerprint'] = sha1(request.url)
                fingerprint = request.meta[b'fingerprint']
                self._fingerprints[fingerprint] = request
    

    What is the collect way to fix this?

    opened by yujiaao 0
  • KeyError [b'frontier'] on Request Creation from Spider

    KeyError [b'frontier'] on Request Creation from Spider

    Issue might be related to #337

    Hi,

    I have already read in discussions here, that the scheduling of requests should be done by frontera and apparently even the creation should be done by the frontier and not by the spider. However, in the documentation of scrapy and frontera it is written that requests shall be yielded in the spider parse function.

    How should the process look like, if requests are to be created by the crawling strategy and not yielded by the spider? How does the spider trigger that?

    In my use case, I am using scrapy-selenium with scrapy and frontera (I use SeleniumRequests to be able to wait for JS loaded elements).

    I have to generate the URLs I want to scrape in two phases: I am yielding them firstly in the start_requests() method of the spider instead of a seeds file and yield requests for extracted links in the first of two parse functions.

    Yielding SeleniumRequests from start_requests works, but yielding SeleniumRequests from the parse function afterwards results in the following error (only pasted an extract, as the iterable error prints the same errors over and over):

    return (_set_referer(r) for r in result or ())
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/urllength.py", line 37, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/spidermiddlewares/depth.py", line 58, in <genexpr>
        return (r for r in result or () if _filter(r))
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/scrapy/core/spidermw.py", line 64, in _evaluate_iterable
        for r in iterable:
      File "/Users/user/opt/anaconda3/envs/frontera-update/lib/python3.8/site-packages/frontera/contrib/scrapy/schedulers/frontier.py", line 112, in process_spider_output
        frontier_request = response.meta[b'frontier_request']
    KeyError: b'frontier_request'
    

    Very thankful for all hints and examples!

    opened by dkipping 3
Releases(v0.8.1)
  • v0.8.1(Apr 5, 2019)

  • v0.8.0.1(Jul 30, 2018)

  • v0.8.0(Jul 25, 2018)

    This is major release containing many architectural changes. The goal of these changes is make development and debugging of the crawling strategy easier. From now, there is an extensive guide in documentation on how to write a custom crawling strategy, a single process mode making much easier to debug crawling strategy locally and old distributed mode for production systems. Starting from this version there is no requirement to setup Apache Kafka or HBase to experiment with crawling strategies on your local computer.

    We also removed unnecessary, rarely used features: distributed spiders run mode, prioritisation logic from backends to make Frontera easier to use and understand.

    Here is a (somewhat) full change log:

    • PyPy (2.7.*) support,
    • Redis backend (kudos to @khellan),
    • LRU cache and two cache generations for HBaseStates,
    • Discovery crawling strategy, respecting robots.txt and leveraging sitemaps to discover links faster,
    • Breadth-first and depth-first crawling strategies,
    • new mandatory component in backend: DomainMetadata,
    • filter_links_extracted method in crawling strategy API to optimise calls to backends for state data,
    • create_request in crawling strategy is now using FronteraManager middlewares,
    • many batch gen instances,
    • support of latest kafka-python,
    • statistics are sent to message bus from all parts of Frontera,
    • overall reliability improvements,
    • settings for OverusedBuffer,
    • DBWorker was refactored and divided on components (kudos to @vshlapakov),
    • seeds addition can be done using s3 now,
    • Python 3.7 compatibility.
    Source code(tar.gz)
    Source code(zip)
  • v0.7.1(Feb 9, 2017)

    Thanks to @voith, a problem introduced with beginning of support of Python 3 when Frontera was supporting only keys and values stored as bytes in .meta fields is now solved. Many Scrapy middlewares weren't working or working incorrectly. This is still not tested properly, so please report any bugs.

    Other improvements include:

    • batched states refresh in crawling strategy,
    • proper access to redirects in Scrapy converters,
    • more readable and simple OverusedBuffer implementation,
    • examples, tests and docs fixes.

    Thank you all, for your contributions!

    Source code(tar.gz)
    Source code(zip)
  • v0.7.0(Nov 29, 2016)

    A long awaiting support of kafka-python 1.x.x client. Now Frontera is much more resistant to physical connectivity loss and is using new asynchronous Kafka API. Other improvements:

    • SW consumes less CPU (because of rare state flushing),
    • requests creation api is changed in BaseCrawlingStrategy, and now it's batch oriented,
    • new article in the docs on cluster setup,
    • disable scoring log consumption option in DB worker,
    • fix of hbase drop table,
    • improved tests coverage.
    Source code(tar.gz)
    Source code(zip)
  • v0.6.0(Aug 18, 2016)

    • Full Python 3 support 👏 👍 🍻 (https://github.com/scrapinghub/frontera/issues/106), all the thanks goes to @Preetwinder.
    • canonicalize_url method removed in favor of w3lib implementation.
    • The whole Request (incl. meta) is propagated to DB Worker, by means of scoring log (fixes https://github.com/scrapinghub/frontera/issues/131)
    • Generating Crc32 from hostname the same way for both platforms: Python 2 and 3.
    • HBaseQueue supports delayed requests now. ‘crawl_at’ field in meta with timestamp makes request available to spiders only after moment expressed with timestamp passed. Important feature for revisiting.
    • Request object is now persisted in HBaseQueue, allowing to schedule requests with specific meta, headers, body, cookies parameters.
    • MESSAGE_BUS_CODEC option allowing to choose other than default message bus codec.
    • Strategy worker refactoring to simplify it’s customization from subclasses.
    • Fixed a bug with extracted links distribution over spider log partitions (https://github.com/scrapinghub/frontera/issues/129).
    Source code(tar.gz)
    Source code(zip)
  • v0.5.3(Jul 22, 2016)

  • v0.5.2.3(Jul 18, 2016)

  • v0.5.2.2(Jun 29, 2016)

    • CONSUMER_BATCH_SIZE is removed and two new options are introduced SPIDER_LOG_CONSUMER_BATCH_SIZE and SCORING_LOG_CONSUMER_BATCH_SIZE
    • Traceback is thrown into log when SIGUSR1 is received in DBW or SW.
    • Finishing in SW is fixed when crawling strategy reports finished.
    Source code(tar.gz)
    Source code(zip)
  • v0.5.2.1(Jun 24, 2016)

    Before that release the default compression codec was Snappy. We found out Snappy support is broken in certain Kafka versions, and issued that release. The latest version has no compression codec enabled by default, and allows to choose the compression codec with KAFKA_CODEC_LEGACY option.

    Source code(tar.gz)
    Source code(zip)
  • v0.5.2(Jun 21, 2016)

  • v0.5.1.1(Jun 2, 2016)

  • v0.5.0(Jun 1, 2016)

    Here is the change log:

    • latest SQLAlchemy unicode-related crashes are fixed,
    • corporate website friendly canonical solver has been added.
    • crawling strategy concept evolved: added ability to add to queue an arbitrary URL (with transparent state check), FrontierManager available on construction,
    • strategy worker code was refactored,
    • default state introduced for links generated during crawling strategy operation,
    • got rid of Frontera logging in favor of Python native logging,
    • logging system configuration by means of logging.config using file,
    • partitions to instances can be assigned from command line now,
    • improved test coverage from @Preetwinder.

    Enjoy!

    Source code(tar.gz)
    Source code(zip)
  • v0.4.2(Apr 22, 2016)

    This release prevents installing kafka-python package versions newer than 0.9.5. Newer version has significant architectural changes and requires Frontera code adaptation and testing. If you are using Kafka message bus, than you're encouraged to install this update.

    Source code(tar.gz)
    Source code(zip)
  • v0.4.1(Jan 18, 2016)

    • fixed API docs generation on RTD,
    • added body field in Request objects, to support POST-type requests,
    • guidance on how to set MAX_NEXT_REQUESTS and settings docs fixes,
    • fixed colored logging.
    Source code(tar.gz)
    Source code(zip)
  • v0.4.0(Dec 30, 2015)

    A tremendous work was done:

    • distributed-frontera and frontera were merged together into the single project: to make it easier to use and understand,
    • Backend was completely redesigned. Now it's consisting of Queue, Metadata and States objects for low-level code and higher-level Backend implementations for crawling policies,
    • Added definition of run modes: single process, distributed spiders, distributed spider and backend.
    • Overall distributed concept is now integrated into Frontera, making difference between usage of components in single process and distributed spiders/backend run modes clearer.
    • Significantly restructured and augmented documentation, addressing user needs in a more accessible way.
    • Much less configuration footprint.

    Enjoy this new year release and let us know what you think!

    Source code(tar.gz)
    Source code(zip)
  • v0.3.3(Sep 29, 2015)

    • tldextract is no longer minimum required dependency,
    • SQLAlchemy backend now persists headers, cookies, and method, also _create_page method added to ease customization,
    • Canonical solver code (needs documentation)
    • Other fixes and improvements
    Source code(tar.gz)
    Source code(zip)
  • v0.3.2(Jun 19, 2015)

    Now, it's possible to configure Frontera from Scrapy settings. The order of precedence for configuration sources is following:

    1. Settings defined in the module pointed by FRONTERA_SETTINGS (higher precedence)
    2. settings defined in the Scrapy settings,
    3. default frontier settings.
    Source code(tar.gz)
    Source code(zip)
  • v0.3.1(May 25, 2015)

    Main issue solved in this version is that now, request callbacks and request.meta contents are successfully serializing and deserializing in SQL Alchemy-based backend. Therefore, majority of Scrapy extensions shouldn't suffer from loosing meta or callbacks passing over Frontera anymore. Second, there is hot fix for cold start problem, when seeds are added, and Scrapy is quickly finishing with no further activity. Well thought solution for this will be offered later.

    Source code(tar.gz)
    Source code(zip)
  • v0.3.0(Apr 15, 2015)

    • Frontera is the new name for Crawl Frontier.
    • Signature of get_next_requests method is changed, now it accepts arbitrary key-value arguments.
    • Overused buffer (subject to remove in the future in favor of downloader internal queue).
    • Backend internals became more customizable.
    • Scheduler now requests for new requests when there is free space in Scrapy downloader queue, instead of waiting for absolute emptiness.
    • Several Frontera middlewares are disabled by default.
    Source code(tar.gz)
    Source code(zip)
  • v0.2.0(Jan 12, 2015)

    • Added documentation (Scrapy Seed Loaders+Tests+Examples)
    • Refactored backend tests
    • Added requests library example
    • Added requests library manager and object converters
    • Added FrontierManagerWrapper
    • Added frontier object converters
    • Fixed script examples for new changes
    • Optional Color logging (only if available)
    • Changed Scrapy frontier and recorder integration to scheduler+middlewares
    • Changed default frontier backend
    • Added comment support to seeds
    • Added doc requirements for RTD build
    • Removed optional dependencies for setup.py and requirements
    • Changed tests to pytest
    • Updated docstrings and documentation
    • Changed frontier componets (Backend and Middleware) to abc
    • Modified Scrapy frontier example to use seed loaders
    • Refactored Scrapy Seed loaders
    • Added new fields to Request and Response frontier objects
    • Added ScrapyFrontierManager (Scrapy wrapper for Frontier Manager)
    • Changed frontier core objects (Page/Link to Request/Response)
    Source code(tar.gz)
    Source code(zip)
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
Web scrapper para cotizar articulos

WebScrapper Este web scrapper esta desarrollado en python 3.10.0 para buscar en la pagina de cyber puerta articulos dentro del catalogo. El programa t

Jordan Gaona 1 Oct 27, 2021
Create crawler get some new products with maximum discount in banimode website

crawler-banimode create crawler and get some new products with maximum discount in banimode website. این پروژه کوچک جهت یادگیری و کار با ابزار سلنیوم

nourollah rezaei 2 Feb 17, 2022
Generate a repository with mirror links for DriveDroid app

DriveDroid Repository Generator Generate a repository for the app that allow boot a PC using ISO files stored on your Android phone Check also an offi

Evgeny 11 Nov 19, 2022
This is a webscraper for a specific website

This is a webscraper for a specific website. It is tuned to extract the headlines of that website. With some little adjustments the webscraper is able to extract any part of the website.

Rahul Siyanwal 1 Dec 13, 2021
A way to scrape sports streams for use with Jellyfin.

Sportyfin Description Stream sports events straight from your Jellyfin server. Sportyfin allows users to scrape for live streamed events and watch str

axelmierczuk 38 Nov 05, 2022
A distributed crawler for weibo, building with celery and requests.

A distributed crawler for weibo, building with celery and requests.

SpiderClub 4.8k Jan 03, 2023
Console application for downloading images from Reddit in Python

RedditImageScraper Console application for downloading images from Reddit in Python Introduction This short Python script was created for the mass-dow

James 0 Jul 04, 2021
AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

AssistScraper - program for /r/nba to use to find list of all players a player assisted and how many assists each player recieved

5 Nov 25, 2021
A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com

TDTV2-Direct Version 1.00.1 • A Simple Web Scraper made to Extract Download Links from Todaytvseries2.com :) How to Works?? install all dependancies v

Danushka-Madushan 1 Nov 28, 2021
Scraping Top Repositories for Topics on GitHub,

0.-Webscrapping-using-python Scraping Top Repositories for Topics on GitHub, Web scraping is the process of extracting and parsing data from websites

Dev Aravind D Satprem 2 Mar 18, 2022
Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021
A leetcode scraper to compile all questions in leetcode free tier to text file. pdf also available.

A leetcode scraper to compile all questions in leetcode free tier to text file, pdf also available. if new questions get added, run again to get new questions.

3 Dec 07, 2021
此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

此脚本为 python 脚本,实现原理为利用 selenium 定位相关元素,再配合点击事件完成浏览器的自动化.

N0el4kLs 5 Nov 19, 2021
News, full-text, and article metadata extraction in Python 3. Advanced docs:

Newspaper3k: Article scraping & curation Inspired by requests for its simplicity and powered by lxml for its speed: "Newspaper is an amazing python li

Lucas Ou-Yang 12.3k Jan 07, 2023
✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

XforWorks 3 Mar 07, 2022
mlscraper: Scrape data from HTML pages automatically with Machine Learning

🤖 Scrape data from HTML websites automatically with Machine Learning

Karl Lorey 798 Dec 29, 2022
A simple flask application to scrape gogoanime website.

gogoanime-api-flask A simple flask application to scrape gogoanime website. Used for demo and learning purposes only. How to use the API The base api

1 Oct 29, 2021
download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022
The core packages of security analyzer web crawler

Security Analyzer 🐍 A large scale web crawler (considered also as vulnerability scanner tool) to take an overview about security of Moroccan sites Cu

Security Analyzer 10 Jul 03, 2022
抖音批量下载用户所有无水印视频

Douyincrawler 抖音批量下载用户所有无水印视频 Run 安装python3, 安装依赖

28 Dec 08, 2022