Before I start I just want to say that you all have done a great job developing this project. I love gerapy. I will probably start contributing to the project. I will try to document this as well as I can so it can be helpful to others.
Describe the bug
I have a scrapy project which runs perfectly fine in terminal using the following command:
scrapy crawl examplespider
However, when I schedule it in a task and run it on my local scrapyd client it runs but immediately closes. I don't know why it opens and closes without doing anything. Throws no errors. I think it's a config file issue. When I view the results of the job it shows the following:
`y.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.002359,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 63709184,
'memusage/startup': 63709184,
'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)}
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)`
In the logs it shows the following:
/home/ubuntu/env/scrape/bin/logs/examplescraper/examplespider
2022-12-15 07:03:21 [scrapy.utils.log] INFO: Scrapy 2.7.1 started (bot: examplescraper)
2022-12-15 07:03:21 [scrapy.utils.log] INFO: Versions: lxml 4.9.1.0, libxml2 2.9.14, cssselect 1.2.0, parsel 1.7.0, w3lib 2.1.1, Twisted 22.10.0, Python 3.8.10 (default, Nov 14 2022, 12:59:47) - [GCC 9.4.0], pyOpenSSL 22.1.0 (OpenSSL 3.0.7 1 Nov 2022), cryptography 38.0.4, Platform Linux-5.15.0-1026-aws-x86_64-with-glibc2.29
2022-12-15 07:03:21 [scrapy.crawler] INFO: Overridden settings:
{'BOT_NAME': 'examplescraper',
'DOWNLOAD_DELAY': 0.1,
'LOG_FILE': 'logs/examplescraper/examplespider/8d623d447c4611edad0641137877ddff.log',
'NEWSPIDER_MODULE': 'examplespider.spiders',
'SPIDER_MODULES': ['examplespider.spiders'],
'USER_AGENT': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 '
'(KHTML, like Gecko) Chrome/101.0.4951.67 Safari/537.36'
}
2022-12-15 07:03:21 [py.warnings] WARNING: /home/ubuntu/env/scrape/lib/python3.8/site-packages/scrapy/utils/request.py:231: ScrapyDeprecationWarning: '2.6' is a deprecated value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting.
It is also the default value. In other words, it is normal to get this warning if you have not defined a value for the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting. This is so for backward compatibility reasons, but it will change in a future version of Scrapy.
See the documentation of the 'REQUEST_FINGERPRINTER_IMPLEMENTATION' setting for information on how to handle this
deprecation.
return cls(crawler)
2022-12-15 07:03:21 [scrapy.utils.log] DEBUG: Using reactor: twisted.internet.epollreactor.EPollReactor
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet Password: b11a24faee23f82c
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled extensions:
['scrapy.extensions.corestats.CoreStats',
'scrapy.extensions.telnet.TelnetConsole
'scrapy.extensions.memusage.MemoryUsage',
'scrapy.extensions.logstats.LogStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled downloader middlewares:
['scrapy.downloadermiddlewares.httpauth.HttpAuthMiddleware',
'scrapy.downloadermiddlewares.downloadtimeout.DownloadTimeoutMiddleware',
'scrapy.downloadermiddlewares.defaultheaders.DefaultHeadersMiddleware',
'scrapy.downloadermiddlewares.useragent.UserAgentMiddleware',
'scrapy.downloadermiddlewares.retry.RetryMiddleware',
'scrapy.downloadermiddlewares.redirect.MetaRefreshMiddleware',
'scrapy.downloadermiddlewares.httpcompression.HttpCompressionMiddleware',
'scrapy.downloadermiddlewares.redirect.RedirectMiddleware',
'scrapy.downloadermiddlewares.cookies.CookiesMiddleware',
'scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware',
'scrapy.downloadermiddlewares.stats.DownloaderStats']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled spider middlewares:
['scrapy.spidermiddlewares.httperror.HttpErrorMiddleware',
'scrapy.spidermiddlewares.offsite.OffsiteMiddleware',
'scrapy.spidermiddlewares.referer.RefererMiddleware',
'scrapy.spidermiddlewares.urllength.UrlLengthMiddleware',
'scrapy.spidermiddlewares.depth.DepthMiddleware']
2022-12-15 07:03:21 [scrapy.middleware] INFO: Enabled item pipelines:
[]
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider opened
2022-12-15 07:03:21 [scrapy.extensions.logstats] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min)
2022-12-15 07:03:21 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Closing spider (finished)
2022-12-15 07:03:21 [scrapy.statscollectors] INFO: Dumping Scrapy stats:
{'elapsed_time_seconds': 0.002359,
'finish_reason': 'finished',
'finish_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 314439),
'log_count/DEBUG': 1,
'log_count/INFO': 10,
'log_count/WARNING': 1,
'memusage/max': 63709184,
'memusage/startup': 63709184,
'start_time': datetime.datetime(2022, 12, 15, 7, 3, 21, 312080)
}
2022-12-15 07:03:21 [scrapy.core.engine] INFO: Spider closed (finished)
/home/ubuntu/gerapy/logs
[email protected]:~/gerapy/logs$ cat 20221215065310.log
INFO - 2022-12-15 14:53:18,043 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 105 - scheduler - successfully synced task with jobs with force
INFO - 2022-12-15 14:54:15,011 - process: 480 - scheduler.py - gerapy.server.core.scheduler - 34 - scheduler - execute job of client LOCAL, project examplescraper, spider examplespider
[email protected]:~/gerapy/logs$
To Reproduce
Steps to reproduce the behavior:
- AWS Ubuntu 20.04 Instance
- Use python3 virtual environment and follow the installation instructions
- Create a systemd service for scrapyd and gerapy by doing the following:
cd /lib/systemd/system
sudo nano scrapyd.service
paste the following:
[Unit]
Description=Scrapyd service
After=network.target
[Service]
User=ubuntu
Group=ubuntu
WorkingDirectory=/home/ubuntu/env/scrape/bin
ExecStart=/home/ubuntu/env/scrape/bin/scrapyd
[Install]
WantedBy=multi-user.target
Issue the following commands:
sudo systemctl enable scrapyd.service
sudo systemctl start scrapyd.service
sudo systemctl status scrapyd.service
It should say: active (running)
Create a script to run gerapy as a systemd service
cd ~/virtualenv/exampleproject/bin/
nano runserv-gerapy.sh
Paste the following:
#!/bin/bashcd
/home/ubuntu/virtualenv
source exampleproject/bin/activate
cd /home/ubuntu/gerapy
gerapy runserver 0.0.0.0:8000
Give this file execute permissions
sudo chmod +x runserve-gerapy.sh
Navigate back to systemd and create a service to run the runserve-gerapy.sh
cd /lib/systemd/system
sudo nano gerapy-web.service
Paste the following:
[Unit]
Description=Gerapy Webserver Service
After=network.target
[Service]
User=ubuntu
Group=ubuntu
WorkingDirectory=/home/ubuntu/virtualenv/exampleproject/bin
ExecStart=/bin/bash /home/ubuntu/virtualenv/exampleproject/bin/runserver-gerapy.sh
[Install]
WantedBy=multi-user.target
Again issue the following:
sudo systemctl enable gerapy-web.service
sudo systemctl start gerapy-web.service
sudo systemctl status gerapy-web.service
Look for active (running) and navigate to http://your.pub.ip.add:8000 or http://localhost:8000 or http://127.0.0.1:8000 to verify that it is running. Reboot the instance to verify that the services are running on system startup.
5. Log in and create a client for the local scrapyd service. Use IP 127.0.0.1 and Port 6800. No Auth. Save it as "Local" or "Scrapyd"
6. Create a project. Select Clone. For testing I used the following github scrapy project: https://github.com/eneiromatos/NebulaEmailScraper (actually a pretty nice starter project). Save the project. Build the project. Deploy the project. (If you get an error when deploying make sure to be running in the virtual env, you might need to reboot).
7. Create a task. Make sure the project name and spider name matches what is in the scrapy.cfg and examplespider.py files and save the task. Schedule the task. Run the task
Traceback
See logs above ^^^
Expected behavior
It should run for at least 5 minutes and output to a file called emails.json in the project root folder (the folder with scrapy.cfg file)
Screenshots
I can upload screenshots if requested.
Environment (please complete the following information):
- OS: AWS Ubuntu 20.04
- Browser Firefox
- Python Version 3.8
- Gerapy Version 0.9.11 (latest)
Additional context
Add any other context about the problem here.
bug