Incredibly fast crawler designed for OSINT.

Overview


Photon
Photon

Incredibly fast crawler designed for OSINT.

pypi

demo

Photon WikiHow To UseCompatibilityPhoton LibraryContributionRoadmap

Key Features

Data Extraction

Photon can extract the following data while crawling:

  • URLs (in-scope & out-of-scope)
  • URLs with parameters (example.com/gallery.php?id=2)
  • Intel (emails, social media accounts, amazon buckets etc.)
  • Files (pdf, png, xml etc.)
  • Secret keys (auth/API keys & hashes)
  • JavaScript files & Endpoints present in them
  • Strings matching custom regex pattern
  • Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

save demo

Flexible

Control timeout, delay, add seeds, exclude URLs matching a regex pattern and other cool stuff. The extensive range of options provided by Photon lets you crawl the web exactly the way you want.

Genius

Photon's smart thread management & refined logic gives you top notch performance.

Still, crawling can be resource intensive but Photon has some tricks up it's sleeves. You can fetch URLs archived by archive.org to be used as seeds by using --wayback option.

Plugins

Docker

Photon can be launched using a lightweight Python-Alpine (103 MB) Docker image.

$ git clone https://github.com/s0md3v/Photon.git
$ cd Photon
$ docker build -t photon .
$ docker run -it --name photon photon:latest -u google.com

To view results, you can either head over to the local docker volume, which you can find by running docker inspect photon or by mounting the target loot folder:

$ docker run -it --name photon -v "$PWD:/Photon/google.com" photon:latest -u google.com

Frequent & Seamless Updates

Photon is under heavy development and updates for fixing bugs. optimizing performance & new features are being rolled regularly.

If you would like to see features and issues that are being worked on, you can do that on Development project board.

Updates can be installed & checked for with the --update option. Photon has seamless update capabilities which means you can update Photon without losing any of your saved data.

Contribution & License

You can contribute in following ways:

  • Report bugs
  • Develop plugins
  • Add more "APIs" for ninja mode
  • Give suggestions to make it better
  • Fix issues & submit a pull request

Please read the guidelines before submitting a pull request or issue.

Do you want to have a conversation in private? Hit me up on my twitter, inbox is open :)

Photon is licensed under GPL v3.0 license

Comments
  • error when scanning IP

    error when scanning IP

    Line 169 errors out if you run photon against an IP. Easiest fix might be to just add a try/except, but there is prob a more elgant solution.

    I'm pretty sure this was working before.

    [email protected]:/opt/Photon# python /opt/Photon/photon.py -u http://192.168.0.213:80
          ____  __          __
         / __ \/ /_  ____  / /_____  ____
        / /_/ / __ \/ __ \/ __/ __ \/ __ \
       / ____/ / / / /_/ / /_/ /_/ / / / /
      /_/   /_/ /_/\____/\__/\____/_/ /_/ v1.1.1
    
    Traceback (most recent call last):
      File "/opt/Photon/photon.py", line 169, in <module>
        domain = get_fld(host, fix_protocol=True) # Extracts top level domain out of the host
      File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 387, in get_fld
        search_private=search_private
      File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 339, in process_url
        raise TldDomainNotFound(domain_name=domain_name)
    tld.exceptions.TldDomainNotFound: Domain 192.168.0.213 didn't match any existing TLD name!
    
    opened by sethsec 18
  • Option to skip crawling of URLs that match a regex pattern

    Option to skip crawling of URLs that match a regex pattern

    Added option to skip crawling of URLs that match a regex pattern, changed handling of seeds, and removed double spaces.

    I am assuming this is what you meant in the ideas column tell me if I'm way off though :)

    opened by connorskees 17
  • Added two new options: -o/--output and --stdout

    Added two new options: -o/--output and --stdout

    Awesome tool. I've been looking for something like this for a while to integrate with something I am building! I added two new options for you to consider:

    1. -o/--output option that allows the user to specify an output directory (overriding the default).

    Command: python photon.py -u http://10.10.10.102:80 -l 2 -t100 -o /pentest/photontest

    In this case, all of the output files will be written to /pentest/photontest:

    [email protected]:/pentest/photontest# ls -ltr
    total 24
    -rw-r--r-- 1 root root    0 Jul 25 11:11 scripts.txt
    -rw-r--r-- 1 root root 3260 Jul 25 11:11 robots.txt
    -rw-r--r-- 1 root root 3260 Jul 25 11:11 links.txt
    -rw-r--r-- 1 root root   17 Jul 25 11:11 intel.txt
    -rw-r--r-- 1 root root  437 Jul 25 11:11 fuzzable.txt
    -rw-r--r-- 1 root root  146 Jul 25 11:11 files.txt
    -rw-r--r-- 1 root root    0 Jul 25 11:11 failed.txt
    -rw-r--r-- 1 root root   96 Jul 25 11:11 external.txt
    -rw-r--r-- 1 root root    0 Jul 25 11:11 endpoints.txt
    -rw-r--r-- 1 root root    0 Jul 25 11:11 custom.txt
    
    1. A --stdout option that allow user to print everything to stdout so they can pipe it into another tool or redirect all output to an output file with an operating system redirector.

    Command: [email protected]:/opt/dev/Photon# python photon.py -u http://10.10.10.9:80 -l 2 -t100 --stdout

    Output:

          ____  __          __
         / __ \/ /_  ____  / /_____  ____
        / /_/ / __ \/ __ \/ __/ __ \/ __ \
       / ____/ / / / /_/ / /_/ /_/ / / / /
      /_/   /_/ /_/\____/\__/\____/_/ /_/
    
    [+] URLs retrieved from robots.txt: 68
    [~] Level 1: 69 URLs
    [!] Progress: 69/69
    [~] Level 2: 9 URLs
    [!] Progress: 9/9
    [~] Crawling 0 JavaScript files
    
    --------------------------------------------------
    [+] URLs: 78
    [+] Intel: 1
    [+] Files: 1
    [+] Endpoints: 0
    [+] Fuzzable URLs: 9
    [+] Custom strings: 0
    [+] JavaScript Files: 0
    [+] External References: 3
    --------------------------------------------------
    [!] Total time taken: 0:32
    [!] Average request time: 0.40
    [+] Results saved in 10.10.10.9:80 directory
    
    All Results:
    
    http://10.10.10.9:80/themes/*.gif
    http://10.10.10.9:80/modules/*.png
    http://10.10.10.9:80/INSTALL.mysql.txt
    http://10.10.10.9:80/install.php
    http://10.10.10.9:80/scripts/
    http://10.10.10.9:80/node/add/
    http://10.10.10.9:80/?q=admin/
    http://10.10.10.9:80/themes/*.png
    http://10.10.10.9:80/modules/*.gif
    http://10.10.10.9:80
    http://10.10.10.9:80/includes/
    http://10.10.10.9:80/?q=user/password/
    http://10.10.10.9:80/INSTALL.txt
    http://10.10.10.9:80/profiles/
    http://10.10.10.9:80/themes/bartik/css/ie6.css?on28x3
    http://10.10.10.9:80/MAINTAINERS.txt
    http://10.10.10.9:80/themes/bartik/css/ie.css?on28x3
    http://10.10.10.9:80/modules/*.jpeg
    http://10.10.10.9:80/misc/*.gif
    

    The last thing i changed is the way you were wiping the directory each time you ran the tool so that you would get clean output. If you accept the -o option which allows the user to specify the directory, you can't just blindly delete the directory anymore (can't trust user input ;)). So i think i added a cleaner way to just overwrite each file (replacing the w+ with w), that should accomplish the same thing without needing to delete directories.

    enhancement 
    opened by sethsec 13
  • Added -v/--verbose option + fixed few logic error in detecting bad js files

    Added -v/--verbose option + fixed few logic error in detecting bad js files

    Changes:

    • Added the -v/--verbose option for verbose output.
    • Fixed 2 logical errors in detecting bad JS scripts. - With this fix, number of bad js files have almost gone to zero.
    opened by 0xInfection 11
  • RuntimeError: Set changed size during iteration in Photon 1.0.7

    RuntimeError: Set changed size during iteration in Photon 1.0.7

    [!] Progress: 1/1
    [~] Level 2: 387 URLs
    [!] Progress: 387/387
    [~] Level 3: 18078 URLs
    [!] Progress: 18078/18078
    [~] Level 4: 90143 URLs
    [!] Progress: 39750/90143^C
    [~] Crawling 0 JavaScript files
    
    Traceback (most recent call last):
      File "photon.py", line 454, in <module>
        for url in external:
    RuntimeError: Set changed size during iteration
    
    invalid 
    opened by thistehneisen 10
  • A couple minor issues

    A couple minor issues

    SSL issues

    You're ignoring SSL verification:

    /usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:843: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
      InsecureRequestWarning)
    

    You can fix this by adding the following to the top of the file (or wherever you feel like putting it):

    import warnings
    warnings.filterwarnings("ignore")
    

    Now this is bad practice, for obvious reasons, however, it makes sense why you're doing it. The above solution is the quickest and simplest solution to solve the problem and keep the errors from being annoying.

    An example with and without:

    Without: without

    With: with


    Dammit Unicode

    If I supply a URL that is in, lets say Russian and try to extract all the data:

          ____  __          __
         / __ \/ /_  ____  / /_____  ____
        / /_/ / __ \/ __ \/ __/ __ \/ __ \
       / ____/ / / / /_/ / /_/ /_/ / / / /
      /_/   /_/ /_/\____/\__/\____/_/ /_/ 
    
     URLs retrieved from robots.txt: 5
     Level 1: 6 URLs
     Progress: 6/6
     Level 2: 35 URLs
     Progress: 35/35
     Level 3: 7 URLs
     Progress: 7/7
     Crawling 7 JavaScript files
     Progress: 7/7
    Traceback (most recent call last):
      File "photon.py", line 429, in <module>
        f.write(x + '\n')
    UnicodeEncodeError: 'ascii' codec can't encode character u'\u200b' in position 7: ordinal not in range(128)
    

    Easiest solution would be to just ignore things like this and continue, with a warning to the user that it was ignored.

    opened by Ekultek 10
  • Adding code to detect and report broken links.

    Adding code to detect and report broken links.

    Current Issue: Broken links for which server returns 404 status (or similar error) are not reported as failed links, instead these erroneous pages are parsed for text content.

    opened by snehm 9
  • Multiple improvements

    Multiple improvements

    The intels are searched only inside the page plain text to avoid retrieving tokens are garbage javascript code. Better regular expressions could allow searching inside javascript code for intels though.

    v1.3.0

    • Added more intels (GENERIC_URL, BRACKET_URL, BACKSLASH_URL, HEXENCODED_URL, URLENCODED_URL, B64ENCODED_URL, IPV4, IPV6, EMAIL, MD5, SHA1, SHA256, SHA512, YARA_PARSE, CREDIT_CARD)
    • Intel search only applied to text (not inside javascript or html tags)
    • proxy support with -p, --proxy option (http proxy only)
    • minor fixes and pep8 format

    Tested on

    os: Linux Mint 19.1
    python: 3.6.7
    
    opened by oXis 8
  • get_lfd

    get_lfd

    Hi After using the option --update i get this error:

    Traceback (most recent call last): File "photon.py", line 187, in domain = topLevel(main_url) File "photon.py", line 182, in topLevel toplevel = tld.get_fld(host, fix_protocol=True) AttributeError: 'module' object has no attribute 'get_fld'

    Any clue? I installed also on another system and still the same change target: same delete totally and clone again: same

    OS Kali3 Thanks

    invalid 
    opened by psychomad 7
  • Exceptions in threads during scanning in Level 1 & 2

    Exceptions in threads during scanning in Level 1 & 2

    Exception in thread Thread-4694: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "photon.py", line 211, in extractor if is_link(link, processed, files): File "/usr/share/Photon/core/utils.py", line 41, in is_link is_file = url.endswith(BAD_TYPES) TypeError: endswith first arg must be str, unicode, or tuple, not list

    image

    opened by 4k4xs4pH1r3 6
  • Improve dev-experience

    Improve dev-experience

    Add .gitignore and requirements file just to improve development experience.

    Doing this:

    • You won't need to remove unnecessary python compiled files or either other not needed files.
    • You just "pip install -r requirements.txt", and you'll have all necessary third-party libraries to execute script.
    invalid 
    opened by dgarana 5
  • TLSCertVerificationDisabled

    TLSCertVerificationDisabled

    Under [core/requester.py] line 48 of code.

    response = SESSION.get

    Certificate verification is disabled by setting verify to False in get. This may lead to Man-in-the-middle attacks.

    opened by yonasuriv 0
  • .well-known files

    .well-known files

    I see that this project examines robots.txt and sitemap.xml. I was wondering if you could add some of the other .well-known files like ads.txt, security.txt and others found https://well-known.dev/resources/

    opened by WebBreacher 0
  • No generating a DNS map

    No generating a DNS map

    So i cant get an dns map only creates when getting from google and example.com.

    Im using python 3 and command: python3 photon.py -u "http://example.com" --dns

    opened by DEPSTRCZ 0
Releases(v1.3.0)
  • v1.3.0(Apr 5, 2019)

    • Dropped Python < 3.2 support
    • Removed Ninja mode
    • Fixed a bug in link parsing
    • Fixed Unicode output
    • Fixed a bug which caused URLs to be treated as files
    • Intel is now associated with the URL where it was found
    Source code(tar.gz)
    Source code(zip)
  • v1.2.1(Jan 26, 2019)

  • v1.1.6(Jan 25, 2019)

  • v1.1.5(Oct 24, 2018)

  • v1.1.4(Sep 18, 2018)

  • v1.1.3(Sep 6, 2018)

  • v1.1.2(Aug 7, 2018)

    • Code refactor
    • Better identification of external URLs
    • Fixed a major bug that made several intel URLs pass under the radar
    • Fixed a major bug that caused non-html type content to be marked a crawlable URL
    Source code(tar.gz)
    Source code(zip)
  • v1.1.1(Sep 4, 2018)

    • Added --wayback
    • Fixed progress bar for python > 3.2
    • Added /core/config.py for easy customization
    • --dns now saves subdomains in subdomains.txt
    Source code(tar.gz)
    Source code(zip)
  • v1.1.0(Aug 29, 2018)

    • Use of ThreadPoolExecutor for x2 speed (for python > 3.2)
    • Fixed mishandling of urls starting with //
    • Removed a redundant try-except statement
    • Evaluate entropy of found keys to avoid false positives
    Source code(tar.gz)
    Source code(zip)
  • v1.0.9(Aug 19, 2018)

  • v1.0.8(Aug 4, 2018)

    • added --exclude option
    • Better regex and code logic to favor performance
    • Fixed a bug that caused dnsdumpster to fail if target was a subdomain
    • Fixed a bug that caused a crash if run outside "Photon" directory
    • Fixed a bug in file saving (specific to python3)
    Source code(tar.gz)
    Source code(zip)
  • v1.0.7(Jul 28, 2018)

    • Added --timeout option
    • Added --output option
    • Added --user-agent option
    • Replaced lxml with regex
    • Better logic for favoring performance
    • Added bigger and separate file for user-agents
    Source code(tar.gz)
    Source code(zip)
  • v1.0.6(Jul 26, 2018)

  • v1.0.5(Jul 26, 2018)

  • v1.0.4(Jul 25, 2018)

    • Fixed an issue which caused regular links to be saved in robots.txt
    • Simplified flash function
    • Removed -n as an alias of --ninja
    • Added --only-urls option
    • Refactored code for readability
    • Skip saving files if the content is empty
    Source code(tar.gz)
    Source code(zip)
  • v1.0.3(Jul 24, 2018)

    • Introduced plugins
    • Added dnsdumpster plugin
    • Fixed non-ascii character handling, again
    • 404 pages are now added to failed list
    • Handling exceptions in jscanner
    Source code(tar.gz)
    Source code(zip)
  • v1.0.2(Jul 23, 2018)

    • Proper handling of null response from robots.txt & sitemap.xml
    • Python2 compatibility
    • Proper handling of non-ascii chars
    • Added ability to specify custom regex pattern
    • Display total time taken and average time per request
    Source code(tar.gz)
    Source code(zip)
  • v1.0.1(Jul 23, 2018)

Owner
Somdev Sangwan
I make things, I break things and I make things that break things.
Somdev Sangwan
Consulta de CPF e CNPJ na Receita Federal com Web-Scraping

Repositório contendo scripts Python que realizam a consulta de CPF e CNPJ diretamente no site da Receita Federal.

Josué Campos 5 Nov 29, 2021
Webservice wrapper for hhursev/recipe-scrapers (python library to scrape recipes from websites)

recipe-scrapers-webservice This is a wrapper for hhursev/recipe-scrapers which provides the api as a webservice, to be consumed as a microservice by o

1 Jul 09, 2022
Minimal set of tools to conduct stealthy scraping.

Stealthy Scraping Tools Do not use puppeteer and playwright for scraping. Explanation. We only use the CDP to obtain the page source and to get the ab

Nikolai Tschacher 88 Jan 04, 2023
Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers

Raspi-scraper is a configurable python webscraper that checks raspberry pi stocks from verified sellers.

Louie Cai 13 Oct 15, 2022
This project was created using Python technology and flask tools to scrape a music site

python-scrapping This project was created using Python technology and flask tools to scrape a music site You need to install the following packages to

hosein moradi 1 Dec 07, 2021
An IpVanish Proxies Scraper

EzProxies Tired of searching for good proxies for hours? Just get an IpVanish account and get thousands of good proxies in few seconds! Showcase Watch

11 Nov 13, 2022
Create crawler get some new products with maximum discount in banimode website

crawler-banimode create crawler and get some new products with maximum discount in banimode website. این پروژه کوچک جهت یادگیری و کار با ابزار سلنیوم

nourollah rezaei 2 Feb 17, 2022
Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc)

Python based Web Scraper which can discover javascript files and parse them for juicy information (API keys, IP's, Hidden Paths etc).

Amit 6 Aug 26, 2022
A package that provides you Latest Cyber/Hacker News from website using Web-Scraping.

cybernews A package that provides you Latest Cyber/Hacker News from website using Web-Scraping. Latest Cyber/Hacker News Using Webscraping Developed b

Hitesh Rana 4 Jun 02, 2022
crypto currency scraping

SCRYPTO What ? Crypto currencies scraping (At the moment, only bitcoin and ethereum crypto currencies are supported) How ? A python script is running

15 Sep 01, 2022
An arxiv spider

An Arxiv Spider 做为一个cser,杰出男孩深知内核对连接到计算机上的硬件设备进行管理的高效方式是中断而不是轮询。每当小伙伴发来一篇刚挂在arxiv上的”热乎“好文章时,杰出男孩都会感叹道:”师兄这是每天都挂在arxiv上呀,跑的好快~“。于是杰出男孩找了找 github,借鉴了一下其

Jie Liu 11 Sep 09, 2022
Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

Semplice scraper realizzato in Python tramite la libreria BeautifulSoup

2 Nov 22, 2021
A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

appelsiensam 4 Jul 26, 2022
This code will be able to scrape movies from a movie website and also provide download links to newly uploaded movies.

Movies-Scraper You are probably tired of navigating through a movie website to get the right movie you'd want to watch during the weekend. There may e

1 Jan 31, 2022
This is a module that I had created along with my friend. It's a basic web scraping module

QuickInfo PYPI link : https://pypi.org/project/quickinfo/ This is the library that you've all been searching for, it's built for developers and allows

OneBit 2 Dec 13, 2021
中国大学生在线 四史自动答题刷分(现仅支持英雄篇)

中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 若对您有所帮助,记得点个Star 🌟 !!! 中国大学生在线 “四史”学习教育竞答 自动答题 刷分 (现仅支持英雄篇,已更新可用) 🥰 🥰 🥰 依赖 本项目依赖的第三方库: requests 在终端执行以下

XWhite 229 Dec 12, 2022
fork huanghyw/jd_seckill

Jd_Seckill 特别声明: 本仓库发布的jd_seckill项目中涉及的任何脚本,仅用于测试和学习研究,禁止用于商业用途,不能保证其合法性,准确性,完整性和有效性,请根据情况自行判断。 本项目内所有资源文件,禁止任何公众号、自媒体进行任何形式的转载、发布。

512 Jan 03, 2023
Scrapes Every Email Address of Every Society in Every University

society-email-scrape Site Live at https://kcsoc.github.io/society-email-scrape/ How to automatically generate new data Go to unis.yml Add your uni Cre

Krishna Consciousness Society 18 Dec 14, 2022
✂️🕷️ Spider-Cut is a Network Mapper Framework (NMAP Framework)

Spider-Cut is a Network Mapper Framework (NMAP Framework) Installation | Usage | Creators | Donate Installation # Kali Linux | WSL

XforWorks 3 Mar 07, 2022
Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX)

mcc-mnc.com-webscraper Scrapes mcc-mnc.com and outputs 3 files with the data (JSON, CSV & XLSX) A Python script for web scraping mcc-mnc.com Link: mcc

Anton Ivarsson 1 Nov 07, 2021