Incredibly fast crawler designed for OSINT.

Last update: Jan 02, 2023

Overview

Photon

Incredibly fast crawler designed for OSINT.

Photon Wiki • How To Use • Compatibility • Photon Library • Contribution • Roadmap

Key Features

Data Extraction

Photon can extract the following data while crawling:

URLs (in-scope & out-of-scope)
URLs with parameters (example.com/gallery.php?id=2)
Intel (emails, social media accounts, amazon buckets etc.)
Files (pdf, png, xml etc.)
Secret keys (auth/API keys & hashes)
JavaScript files & Endpoints present in them
Strings matching custom regex pattern
Subdomains & DNS related data

The extracted information is saved in an organized manner or can be exported as json.

Flexible

Control timeout, delay, add seeds, exclude URLs matching a regex pattern and other cool stuff. The extensive range of options provided by Photon lets you crawl the web exactly the way you want.

Genius

Photon's smart thread management & refined logic gives you top notch performance.

Still, crawling can be resource intensive but Photon has some tricks up it's sleeves. You can fetch URLs archived by archive.org to be used as seeds by using --wayback option.

Plugins

Docker

Photon can be launched using a lightweight Python-Alpine (103 MB) Docker image.

$ git clone https://github.com/s0md3v/Photon.git
$ cd Photon
$ docker build -t photon .
$ docker run -it --name photon photon:latest -u google.com

To view results, you can either head over to the local docker volume, which you can find by running docker inspect photon or by mounting the target loot folder:

$ docker run -it --name photon -v "$PWD:/Photon/google.com" photon:latest -u google.com

Frequent & Seamless Updates

Photon is under heavy development and updates for fixing bugs. optimizing performance & new features are being rolled regularly.

If you would like to see features and issues that are being worked on, you can do that on Development project board.

Updates can be installed & checked for with the --update option. Photon has seamless update capabilities which means you can update Photon without losing any of your saved data.

Contribution & License

You can contribute in following ways:

Report bugs
Develop plugins
Add more "APIs" for ninja mode
Give suggestions to make it better
Fix issues & submit a pull request

Please read the guidelines before submitting a pull request or issue.

Do you want to have a conversation in private? Hit me up on my twitter, inbox is open :)

Photon is licensed under GPL v3.0 license

Comments

error when scanning IP

Line 169 errors out if you run photon against an IP. Easiest fix might be to just add a try/except, but there is prob a more elgant solution.

I'm pretty sure this was working before.

[email protected]:/opt/Photon# python /opt/Photon/photon.py -u http://192.168.0.213:80
      ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ v1.1.1

Traceback (most recent call last):
  File "/opt/Photon/photon.py", line 169, in <module>
    domain = get_fld(host, fix_protocol=True) # Extracts top level domain out of the host
  File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 387, in get_fld
    search_private=search_private
  File "/usr/local/lib/python2.7/dist-packages/tld/utils.py", line 339, in process_url
    raise TldDomainNotFound(domain_name=domain_name)
tld.exceptions.TldDomainNotFound: Domain 192.168.0.213 didn't match any existing TLD name!

opened by sethsec 18

Option to skip crawling of URLs that match a regex pattern

Added option to skip crawling of URLs that match a regex pattern, changed handling of seeds, and removed double spaces.

I am assuming this is what you meant in the ideas column tell me if I'm way off though :)

opened by connorskees 17

Added two new options: -o/--output and --stdout

Awesome tool. I've been looking for something like this for a while to integrate with something I am building! I added two new options for you to consider:

-o/--output option that allows the user to specify an output directory (overriding the default).

Command: python photon.py -u http://10.10.10.102:80 -l 2 -t100 -o /pentest/photontest

In this case, all of the output files will be written to /pentest/photontest:

[email protected]:/pentest/photontest# ls -ltr
total 24
-rw-r--r-- 1 root root    0 Jul 25 11:11 scripts.txt
-rw-r--r-- 1 root root 3260 Jul 25 11:11 robots.txt
-rw-r--r-- 1 root root 3260 Jul 25 11:11 links.txt
-rw-r--r-- 1 root root   17 Jul 25 11:11 intel.txt
-rw-r--r-- 1 root root  437 Jul 25 11:11 fuzzable.txt
-rw-r--r-- 1 root root  146 Jul 25 11:11 files.txt
-rw-r--r-- 1 root root    0 Jul 25 11:11 failed.txt
-rw-r--r-- 1 root root   96 Jul 25 11:11 external.txt
-rw-r--r-- 1 root root    0 Jul 25 11:11 endpoints.txt
-rw-r--r-- 1 root root    0 Jul 25 11:11 custom.txt

A --stdout option that allow user to print everything to stdout so they can pipe it into another tool or redirect all output to an output file with an operating system redirector.

Command: [email protected]:/opt/dev/Photon# python photon.py -u http://10.10.10.9:80 -l 2 -t100 --stdout

Output:

      ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/

[+] URLs retrieved from robots.txt: 68
[~] Level 1: 69 URLs
[!] Progress: 69/69
[~] Level 2: 9 URLs
[!] Progress: 9/9
[~] Crawling 0 JavaScript files

--------------------------------------------------
[+] URLs: 78
[+] Intel: 1
[+] Files: 1
[+] Endpoints: 0
[+] Fuzzable URLs: 9
[+] Custom strings: 0
[+] JavaScript Files: 0
[+] External References: 3
--------------------------------------------------
[!] Total time taken: 0:32
[!] Average request time: 0.40
[+] Results saved in 10.10.10.9:80 directory

All Results:

http://10.10.10.9:80/themes/*.gif
http://10.10.10.9:80/modules/*.png
http://10.10.10.9:80/INSTALL.mysql.txt
http://10.10.10.9:80/install.php
http://10.10.10.9:80/scripts/
http://10.10.10.9:80/node/add/
http://10.10.10.9:80/?q=admin/
http://10.10.10.9:80/themes/*.png
http://10.10.10.9:80/modules/*.gif
http://10.10.10.9:80
http://10.10.10.9:80/includes/
http://10.10.10.9:80/?q=user/password/
http://10.10.10.9:80/INSTALL.txt
http://10.10.10.9:80/profiles/
http://10.10.10.9:80/themes/bartik/css/ie6.css?on28x3
http://10.10.10.9:80/MAINTAINERS.txt
http://10.10.10.9:80/themes/bartik/css/ie.css?on28x3
http://10.10.10.9:80/modules/*.jpeg
http://10.10.10.9:80/misc/*.gif

The last thing i changed is the way you were wiping the directory each time you ran the tool so that you would get clean output. If you accept the -o option which allows the user to specify the directory, you can't just blindly delete the directory anymore (can't trust user input ;)). So i think i added a cleaner way to just overwrite each file (replacing the w+ with w), that should accomplish the same thing without needing to delete directories.

enhancement

opened by sethsec 13

Added -v/--verbose option + fixed few logic error in detecting bad js files
Changes:

Added the -v/--verbose option for verbose output.

Fixed 2 logical errors in detecting bad JS scripts. - With this fix, number of bad js files have almost gone to zero.
opened by 0xInfection 11

RuntimeError: Set changed size during iteration in Photon 1.0.7

[!] Progress: 1/1
[~] Level 2: 387 URLs
[!] Progress: 387/387
[~] Level 3: 18078 URLs
[!] Progress: 18078/18078
[~] Level 4: 90143 URLs
[!] Progress: 39750/90143^C
[~] Crawling 0 JavaScript files

Traceback (most recent call last):
  File "photon.py", line 454, in <module>
    for url in external:
RuntimeError: Set changed size during iteration

invalid

opened by thistehneisen 10

A couple minor issues

SSL issues

You're ignoring SSL verification:

/usr/local/lib/python2.7/site-packages/requests/packages/urllib3/connectionpool.py:843: InsecureRequestWarning: Unverified HTTPS request is being made. Adding certificate verification is strongly advised. See: https://urllib3.readthedocs.io/en/latest/advanced-usage.html#ssl-warnings
  InsecureRequestWarning)

You can fix this by adding the following to the top of the file (or wherever you feel like putting it):

import warnings
warnings.filterwarnings("ignore")

Now this is bad practice, for obvious reasons, however, it makes sense why you're doing it. The above solution is the quickest and simplest solution to solve the problem and keep the errors from being annoying.

An example with and without:

Without: without

With: with

Dammit Unicode

If I supply a URL that is in, lets say Russian and try to extract all the data:

      ____  __          __
     / __ \/ /_  ____  / /_____  ____
    / /_/ / __ \/ __ \/ __/ __ \/ __ \
   / ____/ / / / /_/ / /_/ /_/ / / / /
  /_/   /_/ /_/\____/\__/\____/_/ /_/ 

 URLs retrieved from robots.txt: 5
 Level 1: 6 URLs
 Progress: 6/6
 Level 2: 35 URLs
 Progress: 35/35
 Level 3: 7 URLs
 Progress: 7/7
 Crawling 7 JavaScript files
 Progress: 7/7
Traceback (most recent call last):
  File "photon.py", line 429, in <module>
    f.write(x + '\n')
UnicodeEncodeError: 'ascii' codec can't encode character u'\u200b' in position 7: ordinal not in range(128)

Easiest solution would be to just ignore things like this and continue, with a warning to the user that it was ignored.

opened by Ekultek 10

Adding code to detect and report broken links.

Current Issue: Broken links for which server returns 404 status (or similar error) are not reported as failed links, instead these erroneous pages are parsed for text content.

opened by snehm 9
Multiple improvements
The intels are searched only inside the page plain text to avoid retrieving tokens are garbage javascript code. Better regular expressions could allow searching inside javascript code for intels though.

v1.3.0

Added more intels (GENERIC_URL, BRACKET_URL, BACKSLASH_URL, HEXENCODED_URL, URLENCODED_URL, B64ENCODED_URL, IPV4, IPV6, EMAIL, MD5, SHA1, SHA256, SHA512, YARA_PARSE, CREDIT_CARD)

Intel search only applied to text (not inside javascript or html tags)

proxy support with -p, --proxy option (http proxy only)

minor fixes and pep8 format

Tested on

os: Linux Mint 19.1 python: 3.6.7
opened by oXis 8
get_lfd

Hi After using the option --update i get this error:

Traceback (most recent call last): File "photon.py", line 187, in domain = topLevel(main_url) File "photon.py", line 182, in topLevel toplevel = tld.get_fld(host, fix_protocol=True) AttributeError: 'module' object has no attribute 'get_fld'

Any clue? I installed also on another system and still the same change target: same delete totally and clone again: same

OS Kali3 Thanks
invalid

opened by psychomad 7
Exceptions in threads during scanning in Level 1 & 2

Exception in thread Thread-4694: Traceback (most recent call last): File "/usr/lib/python2.7/threading.py", line 801, in __bootstrap_inner self.run() File "/usr/lib/python2.7/threading.py", line 754, in run self.__target(*self.__args, **self.__kwargs) File "photon.py", line 211, in extractor if is_link(link, processed, files): File "/usr/share/Photon/core/utils.py", line 41, in is_link is_file = url.endswith(BAD_TYPES) TypeError: endswith first arg must be str, unicode, or tuple, not list

opened by 4k4xs4pH1r3 6
Improve dev-experience
Add .gitignore and requirements file just to improve development experience.

Doing this:

You won't need to remove unnecessary python compiled files or either other not needed files.

You just "pip install -r requirements.txt", and you'll have all necessary third-party libraries to execute script.

invalid
opened by dgarana 5
TLSCertVerificationDisabled

Under [core/requester.py] line 48 of code.

response = SESSION.get

Certificate verification is disabled by setting verify to False in get. This may lead to Man-in-the-middle attacks.

opened by yonasuriv 0
.well-known files

I see that this project examines robots.txt and sitemap.xml. I was wondering if you could add some of the other .well-known files like ads.txt, security.txt and others found https://well-known.dev/resources/

opened by WebBreacher 0
No generating a DNS map

So i cant get an dns map only creates when getting from google and example.com.

Im using python 3 and command: python3 photon.py -u "http://example.com" --dns

opened by DEPSTRCZ 0

Releases(v1.3.0)

v1.3.0(Apr 5, 2019)
Dropped Python < 3.2 support

Removed Ninja mode

Fixed a bug in link parsing

Fixed Unicode output

Fixed a bug which caused URLs to be treated as files

Intel is now associated with the URL where it was found

Source code(tar.gz)
Source code(zip)
v1.2.1(Jan 26, 2019)
Added cloning ability

Refactored to be modular

Source code(tar.gz)
Source code(zip)
v1.1.6(Jan 25, 2019)
Reuse TCP connection for better performance

Handle redirect loops

CSV export support

Fixed sitemap.xml parsing

Improved regex

Source code(tar.gz)
Source code(zip)
v1.1.5(Oct 24, 2018)
fixed some minor bugs

fixed a bug in domain name parsing

added --headers option for interactive HTTP headers input

Source code(tar.gz)
Source code(zip)
v1.1.4(Sep 18, 2018)
Added -v option

Fixed progress animation for python2

Added developer.facebook.com API for Ninja mode

Source code(tar.gz)
Source code(zip)
v1.1.3(Sep 6, 2018)
Added --stdout option

Fixed a bug in zap() function

Fixed crashing when target is an IP address

Minor refactor

Source code(tar.gz)
Source code(zip)
v1.1.2(Aug 7, 2018)
Code refactor

Better identification of external URLs

Fixed a major bug that made several intel URLs pass under the radar

Fixed a major bug that caused non-html type content to be marked a crawlable URL

Source code(tar.gz)
Source code(zip)
v1.1.1(Sep 4, 2018)
Added --wayback

Fixed progress bar for python > 3.2

Added /core/config.py for easy customization

--dns now saves subdomains in subdomains.txt

Source code(tar.gz)
Source code(zip)
v1.1.0(Aug 29, 2018)
Use of ThreadPoolExecutor for x2 speed (for python > 3.2)

Fixed mishandling of urls starting with //

Removed a redundant try-except statement

Evaluate entropy of found keys to avoid false positives

Source code(tar.gz)
Source code(zip)
v1.0.9(Aug 19, 2018)
Added --keys option

Fixed a bug related to SSL certifcate verfication

Source code(tar.gz)
Source code(zip)
v1.0.8(Aug 4, 2018)
added --exclude option

Better regex and code logic to favor performance

Fixed a bug that caused dnsdumpster to fail if target was a subdomain

Fixed a bug that caused a crash if run outside "Photon" directory

Fixed a bug in file saving (specific to python3)

Source code(tar.gz)
Source code(zip)
v1.0.7(Jul 28, 2018)
Added --timeout option

Added --output option

Added --user-agent option

Replaced lxml with regex

Better logic for favoring performance

Added bigger and separate file for user-agents

Source code(tar.gz)
Source code(zip)
v1.0.6(Jul 26, 2018)
Fixed lot of bugs

Suppress SSL warnings in MAC

x100 speed by code optimization

Simplified code of exporter plugin

Source code(tar.gz)
Source code(zip)
v1.0.5(Jul 26, 2018)
Added exporter plugin

Added seamless update ability

Fixed a bug in update function

Source code(tar.gz)
Source code(zip)
v1.0.4(Jul 25, 2018)
Fixed an issue which caused regular links to be saved in robots.txt

Simplified flash function

Removed -n as an alias of --ninja

Added --only-urls option

Refactored code for readability

Skip saving files if the content is empty

Source code(tar.gz)
Source code(zip)
v1.0.3(Jul 24, 2018)
Introduced plugins

Added dnsdumpster plugin

Fixed non-ascii character handling, again

404 pages are now added to failed list

Handling exceptions in jscanner

Source code(tar.gz)
Source code(zip)
v1.0.2(Jul 23, 2018)
Proper handling of null response from robots.txt & sitemap.xml

Python2 compatibility

Proper handling of non-ascii chars

Added ability to specify custom regex pattern

Display total time taken and average time per request

Source code(tar.gz)
Source code(zip)
v1.0.1(Jul 23, 2018)
Added non color mode for windows & mac

Cross platform file handling

Source code(tar.gz)
Source code(zip)

Owner

Somdev Sangwan

I make things, I break things and I make things that break things.

GitHub Repository

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

4 Jun 12, 2022

CreamySoup - a helper script for automated SourceMod plugin updates management.

CreamySoup/"Creamy SourceMod Updater" (or just soup for short), a helper script for automated SourceMod plugin updates management.

3 Jan 03, 2022

A simplistic scraper made to download tons of random screenshots made by people.

printStealer 1.1 What is this tool? This tool is developed to show the insecurity of the screenshot utility called prnt sc. It is a site that stores s

4 Jul 26, 2022

TikTok Username Swapper/Claimer/etc

TikTok-Turbo TikTok Username Swapper/Claimer/etc I wanted to create it as fast as possible but i eventually gave up and recoded it many many many many

12 Dec 19, 2022

Scrapping Connections' info on Linkedin

1 Feb 11, 2022

Web Scraping Practica With Python

Web-Scraping-Practica Integrants: Guillem Vidal Pallarols. Lídia Bandrés Solé Fitxers: Aquest document és el primer que trobem. A continuació trobem u

2 Nov 08, 2021

New World Market Scraper

Bean Seller A New Worlds market scraper. Deployment This must be installed on Windows as it uses the Windows api to do its stuff Install Prerequisites

4 Sep 21, 2022

Scrape puzzle scrambles from csTimer.net

Scroodle Selenium script to scrape scrambles from csTimer.net csTimer runs locally in your browser, so this doesn't strain the servers any more than i

1 Oct 29, 2021

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo. (Todas as infomações)

3 Oct 04, 2022

A modern CSS selector implementation for BeautifulSoup

Soup Sieve Overview Soup Sieve is a CSS selector library designed to be used with Beautiful Soup 4. It aims to provide selecting, matching, and filter

151 Dec 23, 2022

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）

python+selenium实现的web端自动打卡说明本打卡脚本适用于郑州大学健康打卡，其他web端打卡也可借鉴学习。（自己用的，从2月分稳定运行至今）仅供学习交流使用，请勿依赖。开发者对使用本脚本造成的问题不负任何责任，不对脚本执行效果做出任何担保，原则上不提供任何形式的技术支持。为防止

1 Aug 27, 2022

Get paper names from dblp.org

scraper-dblp Get paper names from dblp.org and store them in a .txt file Useful for a related literature :) Install libraries pip3 install -r requirem

1 Dec 07, 2021

A scalable frontier for web crawlers

Frontera Overview Frontera is a web crawling framework consisting of crawl frontier, and distribution/scaling primitives, allowing to build a large sc

1.2k Jan 02, 2023

A Very simple free proxy list scraper.

Scrappp A Very simple free proxy list scraper, made in python The tool scrape proxy from diffrent sites and api's. Screenshots About the script !!! RE

12 Oct 27, 2022

download NCERT books using scrapy

download_ncert_books download NCERT books using scrapy Downloading Books: You can either use the spider by cloning this repo and following the instruc

1 Dec 02, 2022

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques, based in France Only. The particularity of this program i

347 Jan 07, 2023

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

GNews 🚩 A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response 🚩 As well as you can fetch full

273 Dec 31, 2022

Bulk download tool for the MyMedia platform

MyMedia Bulk Content Downloader This is a bulk download tool for the MyMedia platform. USE ONLY WHERE ALLOWED BY THE COPYRIGHT OWNER. NOT AFFILIATED W

3 Oct 14, 2022

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Pattern Pattern is a web mining module for Python. It has tools for: Data Mining: web services (Google, Twitter, Wikipedia), web crawler, HTML DOM par

8.4k Jan 08, 2023

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

Aliexpress to telegram post Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a b

6 Dec 06, 2022

Incredibly fast crawler designed for OSINT.

Related tags

Overview

Photon

Incredibly fast crawler designed for OSINT.

Key Features

Data Extraction

Flexible

Genius

Plugins

Docker

Frequent & Seamless Updates

Contribution & License

Comments

SSL issues

Dammit Unicode

v1.3.0

Tested on

Releases(v1.3.0)

v1.3.0(Apr 5, 2019)

v1.2.1(Jan 26, 2019)

v1.1.6(Jan 25, 2019)

v1.1.5(Oct 24, 2018)

v1.1.4(Sep 18, 2018)

v1.1.3(Sep 6, 2018)

v1.1.2(Aug 7, 2018)

v1.1.1(Sep 4, 2018)

v1.1.0(Aug 29, 2018)

v1.0.9(Aug 19, 2018)

v1.0.8(Aug 4, 2018)

v1.0.7(Jul 28, 2018)

v1.0.6(Jul 26, 2018)

v1.0.5(Jul 26, 2018)

v1.0.4(Jul 25, 2018)

v1.0.3(Jul 24, 2018)

v1.0.2(Jul 23, 2018)

v1.0.1(Jul 23, 2018)

Owner

Somdev Sangwan

ChromiumJniGenerator - Jni Generator module extracted from Chromium project

CreamySoup - a helper script for automated SourceMod plugin updates management.

A simplistic scraper made to download tons of random screenshots made by people.

TikTok Username Swapper/Claimer/etc

Scrapping Connections' info on Linkedin

Web Scraping Practica With Python

New World Market Scraper

Scrape puzzle scrambles from csTimer.net

Crawler do site Fundamentus.com com o uso do framework scrapy, tanto da aba detalhada como a de resumo.

A modern CSS selector implementation for BeautifulSoup

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸 每日一句 + 毒鸡汤（从2月份稳定运行至今）

Get paper names from dblp.org

A scalable frontier for web crawlers

A Very simple free proxy list scraper.

download NCERT books using scrapy

DaProfiler allows you to get emails, social medias, adresses, works and more on your target using web scraping and google dorking techniques

A Happy and lightweight Python Package that searches Google News RSS Feed and returns a usable JSON response and scrap complete article - No need to write scrappers for articles fetching anymore

Bulk download tool for the MyMedia platform

Web mining module for Python, with tools for scraping, natural language processing, machine learning, network analysis and visualization.

Python script that reads Aliexpress offers urls from a Excel filename (.csv) and post then in a Telegram channel using a bot

python+selenium实现的web端自动打卡 + 每日邮件发送 + 金山词霸每日一句 + 毒鸡汤（从2月份稳定运行至今）