Visual scraping for Scrapy

Related tags

Web Crawlingportia
Overview

Portia

Portia is a tool that allows you to visually scrape websites without any programming knowledge required. With Portia you can annotate a web page to identify the data you wish to extract, and Portia will understand based on these annotations how to scrape data from similar pages.

Running Portia

The easiest way to run Portia is using Docker:

You can run Portia using Docker & official Portia-image by running:

docker run -v ~/portia_projects:/app/data/projects:rw -p 9001:9001 scrapinghub/portia

You can also set up a local instance with Docker-compose by cloning this repo & running from the root of the folder:

docker-compose up

For more detailed instructions, and alternatives to using Docker, see the Installation docs.

Documentation

Documentation can be found from Read the docs. Source files can be found in the docs directory.

Comments
  • unable to deploy with scrapyd-deploy

    unable to deploy with scrapyd-deploy

    Hello,

    Could you please help me figure out what I'm doing wrong ? Here are the steps: i followed the portia install manual - all ok i created a new project, entered an url, tagged an item - all ok clicked "continue browsing", browsed through site, items were being extracted as expected - all ok

    Next i wanted to deploy my spider: 1st try : i tried to run, as the docs specified, scrapyd-deploy your_scrapyd_target -p project_name - got error - scrapyd wasn't installed fix: pip install scrapyd 2nd try : i launched scrapyd server (also missing from the docs), accessed http://localhost:6800/ -all ok after a brief reading of scrapyd docs i found out i had to edit the file scrapy.cfg from my project : slyd/data/projects/new_project/scrapy.cfg added the following : [deploy:local] url = http://localhost:6800/

    went back to the console, checked all is ok : $:> scrapyd-deploy -l local http://localhost:6800/

    $:> scrapyd-deploy -L local default

    seemed ok so i gave it another try : $scrapyd-deploy local -p default Packing version 1418722113 Deploying to project "default" in http://localhost:6800/addversion.json Server response (200): {"status": "error", "message": "IOError: [Errno 21] Is a directory: '/Users/Mihai/Work/www/4ideas/MarketWatcher/portia_tryout/portia/slyd/data/projects/new_project'"}

    What am I missing ?

    opened by MihaiCraciun 40
  • ImportError: No module named jsonschema.exceptions

    ImportError: No module named jsonschema.exceptions

    After correct installation, when I try to run

    twistd -n slyd

    I get

    Traceback (most recent call last): File "/usr/local/bin/twistd", line 14, in run() File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 27, in run app.run(runApp, ServerOptions) File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 642, in run runApp(config) File "/usr/local/lib/python2.7/dist-packages/twisted/scripts/twistd.py", line 23, in runApp _SomeApplicationRunner(config).run() File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 376, in run self.application = self.createOrGetApplication() File "/usr/local/lib/python2.7/dist-packages/twisted/application/app.py", line 436, in createOrGetApplication ser = plg.makeService(self.config.subOptions) File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 55, in makeService root = create_root(config) File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/tap.py", line 27, in create_root from slyd.crawlerspec import (CrawlerSpecManager, File "/home/euphorbium/Projects/mtg/scraper/portia-master/slyd/slyd/crawlerspec.py", line 12, in from jsonschema.exceptions import ValidationError ImportError: No module named jsonschema.exceptions

    opened by Euphorbium 28
  • How to install portia2.0 correctly by docker?

    How to install portia2.0 correctly by docker?

    when I install portia by docker I had a problems. when I run ember build -w ,the results is:

    Could not start watchman; falling back to NodeWatcher for file system events. Visit http://ember-cli.com/user-guide/#watchman for more info. components/browser-iframe.js: line 175, col 39, 'TreeMirror' is not defined. components/browser-iframe.js: line 212, col 9, '$' is not defined. components/browser-iframe.js: line 227, col 23, '$' is not defined. components/browser-iframe.js: line 281, col 20, '$' is not defined. 4 errors components/inspector-panel.js: line 92, col 16, '$' is not defined. components/inspector-panel.js: line 92, col 33, '$' is not defined. components/inspector-panel.js: line 102, col 16, '$' is not defined. 3 errors components/save-status.js: line 48, col 16, 'moment' is not defined. 1 error controllers/projects/project/conflicts/conflict.js: line 105, col 13, '$' is not defined. 1 error controllers/projects/project/spider.js: line 10, col 25, 'URI' is not defined. 1 error routes/projects/project/conflicts.js: line 5, col 16, '$' is not defined. 1 error services/web-socket.js: line 10, col 15, 'URI' is not defined. services/web-socket.js: line 15, col 12, 'URI' is not defined. 2 errors utils/browser-features.js: line 14, col 13, 'Modernizr' is not defined. 1 error utils/tree-mirror-delegate.js: line 30, col 20, '$' is not defined. utils/tree-mirror-delegate.js: line 66, col 17, '$' is not defined. 2 errors utils/utils.js: line 15, col 20, 'URI' is not defined. utils/utils.js: line 50, col 9, 'Raven' is not defined. utils/utils.js: line 56, col 9, 'Raven' is not defined. 3 errors ===== 10 JSHint Errors Build successful - 1641ms. Slowest Trees | Total
    ----------------------------------------------+--------------------- broccoli-persistent-filter:Babel > [Babel:... | 275ms
    JSHint app | 149ms
    SourceMapConcat: Concat: Vendor /assets/ve... | 89ms
    broccoli-persistent-filter:Babel | 85ms
    broccoli-persistent-filter:Babel | 82ms
    Slowest Trees (cumulative) | Total (avg)
    ----------------------------------------------+--------------------- broccoli-persistent-filter:Babel > [Ba... (2) | 294ms (147 ms)
    broccoli-persistent-filter:Babel (5) | 257ms (51 ms)
    JSHint app (1) | 149ms
    SourceMapConcat: Concat: Vendor /asset... (1) | 89ms
    broccoli-persistent-filter:TemplateCom... (2) | 88ms (44 ms)

    I wanna know where is wrong. By the way,this step I followed the some ISSUES,if I just follow the Documentation to install portia by docker(docker build -t portia .),when I run the portia,the http://localhost:9001/static/index.html will show me nothing. So,how to install portia2.0 correctly by docker?Could someone help me?

    opened by ChoungJX 19
  • Support for JS

    Support for JS

    Add support for JS based sites. It would be nice to have UI support for configuring instead of having to do it manually at the Scrapy level.

    Perhaps we can allow users to enable or disable sending requests via Splash.

    feature 
    opened by shaneaevans 19
  • Crawls duplicate/identical URLs

    Crawls duplicate/identical URLs

    If there are duplicate URLs on the page they will be crawled and exported as many times as you see the link.

    It is very unusual circumstances that you would need to crawl the same URL more than once.

    I have two proposals:

    1. Have a checkbox so you can tick "Avoid visiting duplicate links".
    2. Alternatively, add filtering options in the link crawler to filter by HTML markup too, that way only only links with certain classes.
    opened by jpswade 18
  • Question: how to store data in json

    Question: how to store data in json

    Hi, recently I have installed portia and it's quite good. But since I'm not very familiar with it, I have met a problem: In my template I used variants to scrape the website and it worked well when I clicked continue browsing or turned to a similiar page. However, when I started running the portia spider in cmd, it could successfully scrape data yet failed to store the whole data scraped in the json file. I found that in the json file only the data in the page I annotated were stored. I guess the problem is "WARNING: Dropped: Duplicate product scraped at http://bbs.chinadaily.com.cn/forum-83-206.html, first one was scraped at http://bbs.chinadaily.com.cn/forum-83-2.html" (it appeared in cmd when the spider was running), but I don't know how to solve the problem. I hope someone can help me as soon as possible. Thanks very much!

    opened by SophiaCY 17
  • Next page navigation

    Next page navigation

    Is the new portia scrape next page data? I build spider in portia with nextpage recorded option which is run in docker, but when i deployed it in scrapyd there is no data scraped i got output like

    2016-03-29 15:57:55 [scrapy] INFO: Spider opened 2016-03-29 15:57:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03-29 15:57:55 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6023 2016-03-29 15:58:32 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 1 times): 504 Gateway Time-out 2016-03-29 15:58:55 [scrapy] INFO: Crawled 0 pages (at 0 pages/min), scraped 0 items (at 0 items/min) 2016-03-29 15:59:07 [scrapy] DEBUG: Retrying <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (failed 2 times): 504 Gateway Time-out 2016-03-29 15:59:38 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:39 [scrapy] DEBUG: Filtered offsite request to 'www.successfactors.com': <GET http://www.successfactors.com/> 2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:51 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:52 [scrapy] DEBUG: Filtered offsite request to 'www.cpchem.com': <GET http://www.cpchem.com/> 2016-03-29 15:59:52 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:55 [scrapy] INFO: Crawled 7 pages (at 7 pages/min), scraped 0 items (at 0 items/min) 2016-03-29 15:59:57 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:58 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 15:59:59 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:00 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:03 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None) 2016-03-29 16:00:04 [scrapy] DEBUG: Crawled (200) <POST http://localhost:8050/render.html?job_id=dfe9afdef59811e5b9ad000c29120335> (referer: None)

    there scrapyd and splash are running in local, is there any idea for this issue?

    thanks in advance

    opened by agramesh 16
  • Running Portia with Docker

    Running Portia with Docker

    For windows machine I installed docker and this command "docker build -t portia ." also successfully run . after that i received error while entering this section

    docker run -i -t --rm -v <PROJECT_FOLDER>/data:/app/slyd/data:rw
    -p 9001:9001
    --name portia
    portia

    how configure this command and what is <PROJECT_FOLDER>. Could please help on this to test.

    opened by agramesh 15
  • Cant visit http://localhost:8000/

    Cant visit http://localhost:8000/

    After installed vagrant and virtualbox on my win7 amd64 pc, I successfully launched the Ubuntu virtual machine. image

    However, when I visited http://localhost:8000/, I got this. image

    What am i missing?

    opened by fakegit 14
  • Try to resolve the

    Try to resolve the "No module named slyd.tap error"?

    When I start bin/slyd i received the following error in cent-os

    slyd]$ bin/slyd 2015-08-27 17:18:29+0530 [-] Log opened. 2015-08-27 17:18:29.318689 [-] Splash version: 1.6 2015-08-27 17:18:29.319315 [-] Qt 4.8.5, PyQt 4.10.1, WebKit 537.21, sip 4.14.6, Twisted 15.2.1, Lua 5.1 2015-08-27 17:18:29.319412 [-] Open files limit: 1024 2015-08-27 17:18:29.319473 [-] Open files limit increased from 1024 to 4096 2015-08-27 17:18:29.534741 [-] Xvfb is started: ['Xvfb', ':2431', '-screen', '0', '1024x768x24'] Xlib: extension "RANDR" missing on display ":2431". 2015-08-27 17:18:29.637987 [-] Traceback (most recent call last): 2015-08-27 17:18:29.638202 [-] File "bin/slyd", line 41, in 2015-08-27 17:18:29.638312 [-] splash.server.main() 2015-08-27 17:18:29.638408 [-] File "/usr/lib/python2.7/site-packages/splash/server.py", line 373, in main 2015-08-27 17:18:29.638572 [-] max_timeout=opts.max_timeout 2015-08-27 17:18:29.638665 [-] File "/usr/lib/python2.7/site-packages/splash/server.py", line 279, in default_splash_server 2015-08-27 17:18:29.638801 [-] max_timeout=max_timeout 2015-08-27 17:18:29.638887 [-] File "bin/slyd", line 15, in make_server 2015-08-27 17:18:29.638981 [-] from slyd.tap import makeService

    2015-08-27 17:18:29.639099 [-] ImportError: No module named slyd.tap

    How could it resolve

    opened by agramesh 14
  • error unexpected error

    error unexpected error

    hi all

    i dont know why this happen and how to solve this i got some error while i click new sample the GUI for enter new sample not happen but got this error appear

    portia error

    can anyone help me to solve this?

    opened by jemjov 13
  • How to get fields exported in correct order?

    How to get fields exported in correct order?

    I've got this working on Docker, and can run a spider with something along the lines of docker exec -t portialatest portiacrawl "/app/data/projects/nproject" "nproject.com" -o "/mnt/dc.csv" I've got spiders setup for a few sites which all scrape the same data, my problem is that the exported fields are in different orders, I've tried inserting FIELDS_TO_EXPORT and also CSV_EXPORT_FIELDS into settings.py but seems to have no effect. I'm wanting to get the fields output all in the same order - the name of the fields (Annotations) are Ref, Beds, Baths, Price, Location Could some kind soul point me to the correct solution please? Many thanks.

    opened by buttonsbond 0
  • fix(sec): upgrade scrapy-splash to 0.8.0

    fix(sec): upgrade scrapy-splash to 0.8.0

    What happened?

    There are 1 security vulnerabilities found in scrapy-splash 0.7.2

    What did I do?

    Upgrade scrapy-splash from 0.7.2 to 0.8.0 for vulnerability fix

    What did you expect to happen?

    Ideally, no insecure libs should be used.

    The specification of the pull request

    PR Specification from OSCS

    opened by chncaption 0
  • Fix portia_server

    Fix portia_server

    this PR is abandoned in favor of Gerapy

    fix #883 #920 #913 #907 #903 #902 #895 #877 #842 #812 #811 #790 #760 #742 ...

    todo

    • [x] fix portia_server
      • [x] bump version from 2.0.8 to 2.1.0
    • [x] fix slybot (at least the build is working, may produce runtime errors)
      • [ ] fix some snapshot tests
    • [x] fix portiaui
    • [x] fix slyd
    • [ ] support frames #494

    related https://github.com/scrapinghub/portia2code/pull/12 https://github.com/scrapy/scrapely/pull/122

    opened by milahu 6
  • Correct typo in docker setup docs

    Correct typo in docker setup docs

    Minor typo in docs - name of environment variable used in documentation info box (PROJECT_FOLDER) didn't match the actual name used in the example shell commands (PROJECTS_FOLDER)

    opened by mz8i 0
  • Browser version error message

    Browser version error message

    I am running Portia using Docker on Windows. Even though I just installed and updated Chrome today to Version 95.0.4638.54 (Official Build) (64-bit), I am getting an error that I need to use an up-to-date Browser as soon as I open the portia (localhost:9001).

    opened by fptmark 2
Owner
Scrapinghub
Turn web content into useful data
Scrapinghub
Scraping script for stats on covid19 pandemic status in Chiba prefecture, Japan

About 千葉県の地域別の詳細感染者統計(Excelファイル) をCSVに変換し、かつ地域別の日時感染者集計値を出力するスクリプトです。 Requirement POSIX互換なシェル, e.g. GNU Bash (1) curl (1) python = 3.8 pandas = 1.1.

Conv4Japan 1 Nov 29, 2021
Web-Scraping using Selenium Master

Web-Scraping using Selenium What is the need of Selenium? Some websites don't like to be scrapped and in that case you need to disguise your webscrapi

Md Rashidul Islam 1 Oct 26, 2021
Divar.ir Ads scrapper

Divar.ir Ads Scrapper Introduction This project first asynchronously grab Divar.ir Ads and then save to .csv and .xlsx files named data.csv and data.x

Iman Kermani 4 Aug 29, 2022
Get paper names from dblp.org

scraper-dblp Get paper names from dblp.org and store them in a .txt file Useful for a related literature :) Install libraries pip3 install -r requirem

Daisy Lab 1 Dec 07, 2021
A Very simple free proxy list scraper.

Scrappp A Very simple free proxy list scraper, made in python The tool scrape proxy from diffrent sites and api's. Screenshots About the script !!! RE

Joji aka Moncef 12 Oct 27, 2022
Pro Football Reference Game Data Webscraper

Pro Football Reference Game Data Webscraper Code Copyright Yeetzsche This is a simple Pro Football Reference Webscraper that can either collect all ga

6 Dec 21, 2022
Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

2 Nov 22, 2021
一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件

QQ音乐歌词爬虫 一款利用Python来自动获取QQ音乐上某个歌手所有歌曲歌词的爬虫软件,默认去除了所有演唱会(Live)版本的歌曲。 使用方法 直接运行python run.py即可,然后输入你想获取的歌手名字,然后静静等待片刻。 output目录下保存生成的歌词和歌名文件。以周杰伦为例,会生成两

Yang Wei 11 Jul 27, 2022
Discord webhook spammer with proxy support and proxy scraper

Discord webhook spammer with proxy support and proxy scraper

3 Feb 27, 2022
jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人

jd_maotai rpa 基于selenium驱动的jd抢购rpa机器人, 照顾我们这样的马大哈, 不会忘记抢购了, 祝大家过年都能喝上茅台. 特别声明: 本仓库发布的jd_maotai_rpa项目定义为自动化rpa项目, 是用于防止忘记参与jd茅台的活动(由于本人时常忘记), 而不是为了秒杀和抢

35 Nov 18, 2022
This program will help you to properly scrape all data from a specific website

This program will help you to properly scrape all data from a specific website

MD. MINHAZ 0 May 15, 2022
优化版本的京东茅台抢购神器

优化版本的京东茅台抢购神器

1.8k Mar 18, 2022
Download images from forum threads

Forum Image Scraper Downloads images from forum threads Only works with forums which doesn't require a login to view and have an incremental paginatio

9 Nov 16, 2022
京东茅台抢购

截止 2021/2/1 日,该项目已无法使用! 京东:约满即止,仅限京东实名认证用户APP端抢购,2月1日10:00开始预约,2月1日12:00开始抢购(京东APP需升级至8.5.6版本及以上) 写在前面 本项目来自 huanghyw - jd_seckill,作者的项目地址我找不到了,找到了再贴上

abee 73 Dec 03, 2022
Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

Jason Nguyen 1 Oct 29, 2021
The core packages of security analyzer web crawler

Security Analyzer 🐍 A large scale web crawler (considered also as vulnerability scanner tool) to take an overview about security of Moroccan sites Cu

Security Analyzer 10 Jul 03, 2022
爬取各大SRC当日公告 | 通过微信通知的小工具 | 赏金工具

OnTimeHacker V1.0 OnTimeHacker 是一个爬取各大SRC当日公告,并通过微信通知的小工具 OnTimeHacker目前版本为1.0,已支持24家SRC,列表如下 360、爱奇艺、阿里、百度、哔哩哔哩、贝壳、Boss、58、菜鸟、滴滴、斗鱼、 饿了么、瓜子、合合、享道、京东、

Bywalks 95 Jan 07, 2023
a Scrapy spider that utilizes Postgres as a DB, Squid as a proxy server, Redis for de-duplication and Splash to render JavaScript. All in a microservices architecture utilizing Docker and Docker Compose

This is George's Scraping Project To get started cd into the theZoo file and run: chmod +x script.sh then: ./script.sh This will spin up a Postgres co

George Reyes 7 Nov 27, 2022
Unja is a fast & light tool for fetching known URLs from Wayback Machine

Unja Fetch Known Urls What's Unja? Unja is a fast & light tool for fetching known URLs from Wayback Machine, Common Crawl, Virus Total & AlienVault's

Sheryar 10 Aug 07, 2022
Web Scraping Instagram photos with Selenium by only using a hashtag.

Web-Scraping-Instagram This project is used to automatically obtain images by web scraping Instagram with Selenium in Python. The required input will

Sandro Agama 3 Nov 24, 2022